This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

p-Adic Statistical Field Theory and Convolutional Deep Boltzmann Machines

W. A. Zúñiga-Galindo
University of Texas Rio Grande Valley
School of Mathematical and Statistical Sciences
One West University Blvd.,
Brownsville, TX 78520, United States.
[email protected]
The author was partially supported by the Lokenath Debnath Endowed Professorship.
   Cuiyu He
Oklahoma State University, Department of Mathematics
MSCS 425, Stillwater, OK, United States.
[email protected]
   B. A. Zambrano-Luna
University of Texas Rio Grande Valley
School of Mathematical and Statistical Sciences
One West University Blvd.,
Brownsville, TX 78520, United States.
[email protected]
Abstract

Understanding how deep learning architectures work is a central scientific problem. Recently, a correspondence between neural networks (NNs) and Euclidean quantum field theories (QFTs) has been proposed. This work investigates this correspondence in the framework of pp-adic statistical field theories (SFTs) and neural networks (NNs). In this case, the fields are real-valued functions defined on an infinite regular rooted tree with valence pp, a fixed prime number. This infinite tree provides the topology for a continuous deep Boltzmann machine (DBM), which is identified with a statistical field theory (SFT) on this infinite tree. In the pp-adic framework, there is a natural method to discretize SFTs. Each discrete SFT corresponds to a Boltzmann machine (BM) with a tree-like topology. This method allows us to recover the standard DBMs and gives new convolutional DBMs. The new networks use O(N)O(N) parameters while the classical ones use O(N2)O(N^{2}) parameters.

1 Introduction

The deep neural networks have been successfully applied to various tasks, including self-driving cars, natural language processing, and visual recognition, among many others, [1]-[2]. There is consensus about the need of developing a theoretical framework to understand how deep learning architectures work. Recently, physicists have proposed the existence of a correspondence between neural networks (NNs) and quantum field theories (QFTs), more precisely statistical field theory (SFT), see [3]-[12], and the references therein. This correspondence takes different forms depending on the architecture of the networks involved.

In [12], the study of the above-mentioned correspondence was initiated in the framework of the non-Archimedean statistical field theory (SFT). In this case, the background space (the set of real numbers) is replaced by the set of pp-adic numbers, where pp is a fixed prime number. The pp-adics are organized in a tree-like structure; this feature facilitates the description of hierarchical architectures. In [12], a pp-adic version of the convolutional deep Boltzmann machines is introduced where only binary data is considered and with no implementation. By adapting the mathematical techniques introduced by Le Roux and Benigio in [13], the author shows that these machines are universal approximators for binary data tasks.

In this article, we continue discussing the correspondence between STFs and NNs, in the pp-adic framework. Compare with [12] here we consider more general architectures and data types. We note that dealing with general data is challenging both in theory and implementation practice. We argue that pp-adic analysis still provides the right framework to understand the dynamics of NNs with large tree-like hierarchical architectures. In our approach, a NN corresponds to the discretization of a pp-adic STF. The discretization process is carried out in a rigorous and general way. Moreover, such discretization allows us to obtain many recently developed deep BM. For instance, the NNs constructed in [5] are a particular case of the ones introduced here. We also discuss the implementation of a class of pp-adic convolutional networks and obtain desired results on a feature detection task based on hand-writing images of decimal digits.

The main novelty of our pp-adic convolutional DBMs is that they use significantly fewer parameters than the conventional ones. A detailed discussion is given in Section 6. We note that the connections between pp-adic numbers and neural networks have been considered before. Neural networks whose states are pp-adic numbers were studied in [14, 15]. These models are completely different from the ones considered here. These ideas have been used to develop non-Archimedean models of brain activity and mental processes [16]. In [17, 18], p-adic versions of the cellular neural networks were studied. These models involved abstract evolution equations.

2 pp-Adic Numbers

In this section, we introduce basic concepts for the pp-adic numbers. For more detailed information, refer Section 7 (Appendix A).

From now on, pp denotes a fixed prime number. Any non-zero pp-adic number xx has a unique expansion of the form

x=xkpk+xk+1pk+1++x0+x1p+, x=x_{-k}p^{-k}+x_{-k+1}p^{-k+1}+\ldots+x_{0}+x_{1}p+\ldots,\text{ }

with xk0x_{-k}\neq 0, where kk is an integer, and the xjx_{j}s  are numbers from the set {0,1,,p1}\left\{0,1,\ldots,p-1\right\}. The set of all possible numbers of such form constitutes the field of pp-adic numbers p\mathbb{Q}_{p}. There are natural field operations, sum, and multiplication, on pp-adic numbers, see, e.g., [36]. There is also a natural norm in p\mathbb{Q}_{p} defined as |x|p=pk\left|x\right|_{p}=p^{k} where kk depends on xx, for a nonzero pp-adic number xx.

Refer to caption
Figure 1: The rooted tree associated with the group 2/232\mathbb{Z}_{2}/2^{3}\mathbb{Z}_{2}. The elements of 2/232\mathbb{Z}_{2}/2^{3}\mathbb{Z}_{2} have the form i=i0+i12+i222i=i_{0}+i_{1}2+i_{2}2^{2},i0\;i_{0}, i1i_{1}, i2{0,1}i_{2}\in\{0,1\}. The distance satisfies log2|ij|2=-\log_{2}\left|i-j\right|_{2}=level of the first common ancestor of ii, jj.

The field of pp-adic numbers with the distance induced by ||p\left|\cdot\right|_{p} is a complete ultrametric space. The ultrametric property refers to the fact that |xy|pmax{|xz|p,|zy|p}\left|x-y\right|_{p}\leq\max\left\{\left|x-z\right|_{p},\left|z-y\right|_{p}\right\} for any xx, yy, zz in p\mathbb{Q}_{p}. In this article, we work with pp-adic integers, which pp-adic numbers satisfying k0-k\geq 0. All such pp-adic integers constitute the unit ball p\mathbb{Z}_{p}. The unit ball is closed under addition and multiplication, so it is a commutative ring. Along this article, we work mainly with locally constant functions supported in the unit ball, i.e., with functions of type φ:p\varphi:\mathbb{Z}_{p}\rightarrow\mathbb{R}, such that φ(a+x)=φ(a)\varphi\left(a+x\right)=\varphi\left(a\right) for all xx in p\mathbb{Z}_{p}. The simplest example of such function is the chararcterisstic function 1p(x)1_{\mathbb{Z}_{p}}\left(x\right) of the unit ball p\mathbb{Z}_{p}: 1p(x)=11_{\mathbb{Z}_{p}}\left(x\right)=1 if |x|p1\left|x\right|_{p}\leq 1, otherwise 1p(x)=01_{\mathbb{Z}_{p}}\left(x\right)=0. To check that 1p(a+x)=1p(a)1_{\mathbb{Z}_{p}}\left(a+x\right)=1_{\mathbb{Z}_{p}}\left(a\right), we use that p\mathbb{Z}_{p} is closed under addition. If |x|ppl\left|x\right|_{p}\leq p^{-l}, where the integer ll is fixed and independent of aa. We denote by 𝒟(p)\mathcal{D}(\mathbb{Z}_{p}) the real vector space of test functions supported in the unit ball. There is a natural integration theory so that pφ(x)𝑑x\int_{\mathbb{Z}_{p}}\varphi\left(x\right)dx gives a well-defined real number. The measure dxdx is the so-called Haar measure of p\mathbb{Q}_{p}. Further details are given in Section 7.3.

Since the pp-adic numbers are infinite series, any computational implementation involving these numbers requires a truncation process: xx0+x1p++xl1pl1x\longmapsto x_{0}+x_{1}p+\ldots+x_{l-1}p^{l-1}, l1l\geq 1. The set of all truncated integers mod plp^{l} is denote as Gl=p/plpG_{l}=\mathbb{Z}_{p}/p^{l}\mathbb{Z}_{p}. This set can be represented as a rooted tree with ll levels; see Figure 1.

The unit ball p\mathbb{Z}_{p} is an infinite rooted tree with fractal structure; see Figure 2. Section 7 (Appendix A) provides review of the basic aspects of the pp-adic analysis required here. We note that the word ‘field’ will be used here in two different contexts throughout the article. In a mathematical context, we refer to algebraic fields; in a physical context, we refer to Euclidean quantum fields.

3 Non-Archimedean {𝒗,𝒉}4\left\{\boldsymbol{v},\boldsymbol{h}\right\}^{4}-statistical field theories

We fix a(x)a\left(x\right), b(x)b(x), c(x)c(x), d(x)𝒟(p)d(x)\in\mathcal{D}(\mathbb{Z}_{p}), ee\in\mathbb{R}, and an integrable function w(x):pw\left(x\right):\mathbb{Z}_{p}\rightarrow\mathbb{R} . A pp-adic continuous Boltzmann machine (or a pp-adic continuous BM) is a statistical field theory involving two scalars fields 𝒗\boldsymbol{v}, 𝒉\boldsymbol{h}. The function 𝒗(x)𝒟(p)\boldsymbol{v}(x)\in\mathcal{D}(\mathbb{Z}_{p}) is called the visible field and the function 𝒉(x)𝒟(p)\boldsymbol{h}(x)\in\mathcal{D}(\mathbb{Z}_{p}) is called the hidden field. We assume that the field {𝒗,𝒉}\left\{\boldsymbol{v},\boldsymbol{h}\right\} performs thermal fluctuations and that the expectation value of the field is zero.

Refer to caption
Refer to caption
Figure 2: Based upon [37], we construct an embedding 𝔣:p2\mathfrak{f}:\mathbb{Z}_{p}\rightarrow\mathbb{R}^{2}. The figure shows the images of 𝔣(2)\mathfrak{f}(\mathbb{Z}_{2}) and 𝔣(3)\mathfrak{f}(\mathbb{Z}_{3}). This computation requires a truncation of the pp-adic integers. We use 2/2142\mathbb{Z}_{2}/2^{14}\mathbb{Z}_{2} and 3/3103\mathbb{Z}_{3}/3^{10}\mathbb{Z}_{3}, respectively.

The size of the fluctuations is controlled by an energy functional of the form

E(𝒗,𝒉;𝜽):=E(𝒗,𝒉)=E0(𝒗,𝒉)+Eint(𝒗,𝒉),E(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}):=E(\boldsymbol{v},\boldsymbol{h})=E_{0}\left(\boldsymbol{v},\boldsymbol{h}\right)+E_{\text{int}}\left(\boldsymbol{v},\boldsymbol{h}\right),

where 𝜽=(w,a,b,c,d,e)\boldsymbol{\theta}=\left(w,a,b,c,d,e\right). The first term

E0(𝒗,𝒉)=pa(x)𝒗(x)𝑑xpb(x)𝒉(x)𝑑x+e2p𝒗2(x)𝑑x+e2p𝒉2(x)𝑑xE_{0}\left(\boldsymbol{v},\boldsymbol{h}\right)=-{\displaystyle\int\limits_{\mathbb{Z}_{p}}}a(x)\boldsymbol{v}\left(x\right)dx-{\displaystyle\int\limits_{\mathbb{Z}_{p}}}b(x)\boldsymbol{h}\left(x\right)dx+\frac{e}{2}{\displaystyle\int\limits_{\mathbb{Z}_{p}}}\boldsymbol{v}^{2}\left(x\right)dx+\frac{e}{2}{\displaystyle\int\limits_{\mathbb{Z}_{p}}}\boldsymbol{h}^{2}\left(x\right)dx

is an analogue of the free-field energy. The second term

Eint(𝒗,𝒉)=p×p𝒉(y)w(xy)𝒗(x)𝑑x𝑑y+pc(x)𝒗4(x)𝑑x+pd(x)𝒉4(x)𝑑xE_{\text{int}}\left(\boldsymbol{v},\boldsymbol{h}\right)=-{\displaystyle\iint\limits_{\mathbb{Z}_{p}\times\mathbb{Z}_{p}}}\boldsymbol{h}\left(y\right)w\left(x-y\right)\boldsymbol{v}\left(x\right)dxdy+{\displaystyle\int\limits_{\mathbb{Z}_{p}}}c(x)\boldsymbol{v}^{4}\left(x\right)dx+{\displaystyle\int\limits_{\mathbb{Z}_{p}}}d(x)\boldsymbol{h}^{4}\left(x\right)dx

is an analogue of the interaction energy. The results presented in this section are valid for more general functionals in which the first term in Eint(𝒗,𝒉)E_{\text{int}}(\boldsymbol{v},\boldsymbol{h}) is replaced by

p×p𝒉(y)w(x,y)𝒗(x)𝑑x𝑑y.{\displaystyle\iint\limits_{\mathbb{Z}_{p}\times\mathbb{Z}_{p}}}\boldsymbol{h}\left(y\right)w\left(x,y\right)\boldsymbol{v}\left(x\right)dxdy.

The motivation behind the definition of the energy functionals E(𝒗,𝒉)E(\boldsymbol{v},\boldsymbol{h}) is that the discretizations of these functionals give the energy functionals considered in [5], [13], [38]-[39].

All the thermodynamic properties of the system are described by the partition function of the fluctuating fields, which is defined as

Zphys(𝜽)=𝑑𝒗𝑑𝒉 eE(𝒗,𝒉)KBT,Z^{\text{phys}}(\boldsymbol{\theta})={\displaystyle\int}d\boldsymbol{v}d\boldsymbol{h}\text{ }e^{-\frac{E(\boldsymbol{v},\boldsymbol{h})}{K_{B}T}},

where KBK_{B} is the Boltzmann constant and TT is the temperature constant. We normalize in such a way that KBT=1K_{B}T=1. The measure d𝒗d𝒉d\boldsymbol{v}d\boldsymbol{h} is ill-defined. However, it is expected that such measure can be defined rigorously by a limit process. The statistical field theory corresponding to the energy functional E(𝒗,𝒉;𝜽)E(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}) is defined as the probability measure

𝑷phys(𝒗,𝒉;𝜽)=d𝒗d𝒉exp(E(𝒗,𝒉))Zphys,\boldsymbol{P}^{\text{phys}}(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta})=d\boldsymbol{v}d\boldsymbol{h}\frac{\exp\left(-E(\boldsymbol{v},\boldsymbol{h})\right)}{Z^{\text{phys}}},

on the space 𝒟(p)×𝒟(p)\mathcal{D}(\mathbb{Z}_{p})\times\mathcal{D}(\mathbb{Z}_{p}).

The information about the local properties of the system is contained in the correlation functions G𝕀,𝕂(n)(x1,,xn)G_{\mathbb{I},\mathbb{K}}^{\left(n\right)}\left(x_{1},\ldots,x_{n}\right) of the field {𝒗,𝒉}\left\{\boldsymbol{v},\boldsymbol{h}\right\}: for n1n\geq 1, and two disjoint subsets 𝕀\mathbb{I}, 𝕂{1,2,,n}\mathbb{K}\subset\left\{1,2,\ldots,n\right\}, with 𝕀𝕂={1,2,,n}\mathbb{I}{\textstyle\coprod}\mathbb{K}=\left\{1,2,\ldots,n\right\}, where {\textstyle\coprod} is the disjoint union, we set

G𝕀,𝕂(n)(x1,,xn)=i𝕀𝒗(xi) j𝕂𝒉(xj)\displaystyle G_{\mathbb{I},\mathbb{K}}^{\left(n\right)}\left(x_{1},\ldots,x_{n}\right)=\left\langle{\displaystyle\prod\limits_{i\in\mathbb{I}}}\boldsymbol{v}\left(x_{i}\right)\text{ }{\displaystyle\prod\limits_{j\in\mathbb{K}}}\boldsymbol{h}\left(x_{j}\right)\right\rangle
:=1Zphys𝑑𝒗𝑑𝒉 i𝕀𝒗(xi) j𝕂𝒉(xj) eE(𝒗,𝒉).\displaystyle:=\frac{1}{Z^{\text{phys}}}{\displaystyle\int}d\boldsymbol{v}d\boldsymbol{h}\text{ }{\displaystyle\prod\limits_{i\in\mathbb{I}}}\boldsymbol{v}\left(x_{i}\right)\text{ }{\displaystyle\prod\limits_{j\in\mathbb{K}}}\boldsymbol{h}\left(x_{j}\right)\text{ }e^{-E(\boldsymbol{v},\boldsymbol{h})}.

These functions are also called the nn-point Green functions.

To study these functions, one introduces two auxiliary external fields J0(x),J_{0}(x), J1(x)𝒟(p)J_{1}(x)\in\mathcal{D}(\mathbb{Z}_{p}) called currents, and adds to the energy functional EE as a linear interaction energy of these currents with the field {𝒗,𝒉}\left\{\boldsymbol{v},\boldsymbol{h}\right\},

Esource(𝒗,𝒉,J0,J1)=pJ0(x)𝒗(x)𝑑xpJ1(x)𝒉(x)𝑑x,E_{\text{source}}(\boldsymbol{v},\boldsymbol{h},J_{0},J_{1})=-{\displaystyle\int\limits_{\mathbb{Z}_{p}}}J_{0}(x)\boldsymbol{v}\left(x\right)dx-{\displaystyle\int\limits_{\mathbb{Z}_{p}}}J_{1}(x)\boldsymbol{h}\left(x\right)dx,

now the energy functional is E(𝒗,𝒉,J0,J1)=E(𝒗,𝒉)+Esource(𝒗,𝒉,J0,J1)E(\boldsymbol{v},\boldsymbol{h},J_{0},J_{1})=E\left(\boldsymbol{v},\boldsymbol{h}\right)+E_{\text{source}}(\boldsymbol{v},\boldsymbol{h},J_{0},J_{1}). The partition function formed with this energy is

Z(J0,J1)=1Z0phys𝑑𝒗𝑑𝒉 eE(𝒗,𝒉,J0,J1),Z(J_{0},J_{1})=\frac{1}{Z_{0}^{\text{phys}}}{\displaystyle\int}d\boldsymbol{v}d\boldsymbol{h}\text{ }e^{-E(\boldsymbol{v},\boldsymbol{h},J_{0},J_{1})},

where

Z0phys=𝑑𝒗𝑑𝒉 eE0(𝒗,𝒉).Z_{0}^{\text{phys}}={\displaystyle\int}d\boldsymbol{v}d\boldsymbol{h}\text{ }e^{-E_{0}(\boldsymbol{v},\boldsymbol{h})}.

The functional derivatives of Z(J0,J1)Z(J_{0},J_{1}) with respect to J0(x)J_{0}(x), J1(x)J_{1}(x) evaluated at J0=0J_{0}=0, J1=0J_{1}=0 give the correlation functions of the system:

G𝕀,𝕂(n)(x1,,xn)=1Z[i𝕀δδJ0(xi) j𝕂δδJ1(xj)Z(J0,J1)]J0=0J1=0,G_{\mathbb{I},\mathbb{K}}^{\left(n\right)}\left(x_{1},\ldots,x_{n}\right)=\frac{1}{Z}\left[{\textstyle\prod\limits_{i\in\mathbb{I}}}\frac{\delta}{\delta J_{0}\left(x_{i}\right)}\text{ }{\textstyle\prod\limits_{j\in\mathbb{K}}}\frac{\delta}{\delta J_{1}\left(x_{j}\right)}Z(J_{0},J_{1})\right]_{\begin{subarray}{c}J_{0}=0\\ J_{1}=0\end{subarray}},

where Z=ZphysZ0physZ=\frac{Z^{\text{phys}}}{Z_{0}^{\text{phys}}}. The functional Z(J0,J1)Z(J_{0},J_{1}) is called the generating functional of the theory.

The description of the {𝒗,𝒉}4\left\{\boldsymbol{v},\boldsymbol{h}\right\}^{4}-SFTs presented above is based in the classical version of these theories [40]-[41]. In [31], a mathematically rigorous formulation of ϕ4\phi^{4}-SFTs is presented, the fields are functions from pN\mathbb{Q}_{p}^{N} into \mathbb{R}, with NN arbitrary. We expect that this theory can be extended to the {𝒗,𝒉}4\left\{\boldsymbol{v},\boldsymbol{h}\right\}^{4}-SFTs presented here.

4 Discrete SFTs and pp-adic discrete Boltzmann machines

A central difference between the pp-adic STFs and the classical ones is that in the pp-adic case, the discretization process can be carried out in an easy rigorous way. More specifically, the discretization of a pp-adic SFT is constructed by restricting the energy functional E(𝒗,𝒉;𝜽)E(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}) to a finite dimensional vector subspace 𝒟l(p)\mathcal{D}^{l}(\mathbb{Z}_{p}) of the space of test functions 𝒟(p)\mathcal{D}(\mathbb{Z}_{p}). The test functions in 𝒟l(p)\mathcal{D}^{l}(\mathbb{Z}_{p}) have the form

φ(x)=iGlφ(i)Ω(pl|xi|p),φ(i),\varphi\left(x\right)={\textstyle\sum\limits_{i\in G_{l}}}\varphi\left(i\right)\Omega\left(p^{l}\left|x-i\right|_{p}\right),\quad\varphi\left(i\right)\in\mathbb{R},

where i=i0+i1p++il1pl1Gl=p/plpi\boldsymbol{=}i_{0}+i_{1}p+\ldots+i_{l-1}p^{l-1}\in G_{l}=\mathbb{Z}_{p}/p^{l}\mathbb{Z}_{p}, l1l\geq 1, and Ω(pl|xi|p)\Omega\left(p^{l}\left|x-i\right|_{p}\right) is the characteristic function of the ball Bl(i)B_{-l}(i). Here, it is important to notice that GlG_{l} is a finite, Abelian, additive group. In the pp-adic world, the discrete functions are a particular case of the pp-adic continuous functions, more precisely, 𝒟(p)=l𝒟l(p)\mathcal{D}(\mathbb{Z}_{p})=\cup_{l\in\mathbb{N}}\mathcal{D}^{l}(\mathbb{Z}_{p}) and 𝒟l(p)\mathcal{D}^{l}(\mathbb{Z}_{p})\subset 𝒟l+1(p)\mathcal{D}^{l+1}(\mathbb{Z}_{p}). There is no Archimedean counterpart of this result.

By taking 𝒗,𝒉𝒟l(p)\boldsymbol{v},\boldsymbol{h}\in\mathcal{D}^{l}(\mathbb{Z}_{p}) and ll sufficiently large, the restriction El(𝒗,𝒉;𝜽)E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right) of the energy functional E(𝒗,𝒉;𝜽)E(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}) to 𝒟l(p)\mathcal{D}^{l}(\mathbb{Z}_{p}) has the form

El(𝒗,𝒉;𝜽)=jGlkGlwkvj+khjjGlajvjjGlbjhj+e2jGlNvj2+e2jGlhj2\displaystyle E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right)=-{\displaystyle\sum\limits_{j\in G_{l}}}{\displaystyle\sum\limits_{k\in G_{l}}}w_{k}v_{j+k}h_{j}-{\displaystyle\sum\limits_{j\in G_{l}}}a_{j}v_{j}-{\displaystyle\sum\limits_{j\in G_{l}}}b_{j}h_{j}+\frac{e}{2}{\displaystyle\sum\limits_{j\in G_{l}^{N}}}v_{j}^{2}+\frac{e}{2}{\displaystyle\sum\limits_{j\in G_{l}}}h_{j}^{2} (1)
+jGlcjvj4+jGldjhj4.\displaystyle+{\displaystyle\sum\limits_{j\in G_{l}}}c_{j}v_{j}^{4}+{\displaystyle\sum\limits_{j\in G_{l}}}d_{j}h_{j}^{4}\text{.}

We refer Section 8 (Appendix B) for further details of this calculation. From now on, we refer eq. 1 as the energy functional for a discrete {𝒗,𝒉}4\left\{\boldsymbol{v},\boldsymbol{h}\right\}^{4}-STF.

In the general case, El(𝒗,𝒉;𝜽)E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right) is a pp-adic analogue of the {𝒗,𝒉}4\left\{\boldsymbol{v},\boldsymbol{h}\right\}^{4} neural networks introduced in [5]. In this article, each hidden state hjh_{j} and visible state viv_{i} are interacted through a weight wijw_{ij}. Therefore, requires the number of Gl2G_{l}^{2} for weights ww, whereas our counterpart requires only GlG_{l}. See Section 6, for further discussion.

We now attach to the discrete energy functional a pp-adic discrete BM. For any visible and hidden states, 𝒗=[vi]iGl\boldsymbol{v}=\left[v_{i}\right]_{i\in G_{l}} and  𝐡=[hi]iGl\boldsymbol{h}=\left[h_{i}\right]_{i\in G_{l}}, the Boltzmann probability distribution attached to El(𝒗,𝒉;𝜽)E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right) is given by

𝑷l(𝒗,𝒉;𝜽)=exp(El(𝒗,𝒉;𝜽))𝒗,𝒉exp(El(𝒗,𝒉;𝜽)).\boldsymbol{P}_{l}(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta})=\frac{\exp\left(-E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right)\right)}{{\sum\limits_{\boldsymbol{v},\boldsymbol{h}}}\exp\left(-E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right)\right)}.

When there is no risk of confusion, we will omit 𝜽\boldsymbol{\theta} in the notations. When cj=dj=0c_{j}=d_{j}=0 for all jGlj\in G_{l}, the energy functional El(𝒗,𝒉;𝜽)E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right) corresponds to a pp-adic analogue of the convolutional deep belief networks studied in [38].

We note that the energy functional ElE_{l} has translational symmetry, i.e., ElE_{l} is invariant under the transformations jj+j0j\rightarrow j+j_{0}, kk+k0k\rightarrow k+k_{0}, for any j0j_{0}, k0Glk_{0}\in G_{l}. This transformation is well-defined since GlG_{l} is an additive group. In the case of applications to image processing, the group property also implies that the convolution operation does not alter image dimensions. The convolutional pp-adic discrete BMs introduced here are a specific type of deep Boltzmann machines (DBMs) (also called deep belief networks DBNs).

5 Experimental Results

We implement a pp-adic discrete Boltzmann machine for processing binary images, then 𝒗,𝒉:p{0,1}\boldsymbol{v},\boldsymbol{h}:\mathbb{Z}_{p}\rightarrow\left\{0,1\right\}\subset\mathbb{R}, in this case, we use the following energy functional:

El(𝒗,𝒉;θ)=jGlkGl|k|ppNwkvj+khjjGlajvjjGlbjhj,E_{l}\left(\boldsymbol{v},\boldsymbol{h};\theta\right)=-{\sum\limits_{j\in G_{l}}}{\sum\limits_{\begin{subarray}{c}k\in G_{l}\\ |k|_{p}\leq p^{-N}\end{subarray}}}w_{k}v_{j+k}h_{j}-{\sum\limits_{j\in G_{l}}}a_{j}v_{j}-{\sum\limits_{j\in G_{l}}}b_{j}h_{j}\text{,}

for some natural number 0Nl0\leq N\leq l. Note that, comparing to Equation 1, the quadratic and biquadratic terms are omitted since they do not play any role in the case in which 𝒗,𝒉\boldsymbol{v},\boldsymbol{h} are binary variables. The condition NlN\leq l implies that the convolution operation is restricted to a small neighborhood of radius pNp^{-N} for each pixel. The condition N=0N=0 means that the convolution involves all the points in the image.

Our numerical experiment is based on the MNIST dataset, where each image is considered as a sample of the visible state. Our first task is to train the network to maximize the log-likelihood of the visible states. We choose p=3p=3 and l=6l=6 since image dimension is 33×333^{3}\times 3^{3}. In general, pp and ll depend on the size of the images to be processed. Typically pp is chosen as a small prime number, e.g., 22 or 33.

To tune the parameters aja_{j}, bjb_{j}, wjw_{j}, jG6j\in G_{6} of the network, we first transform each image II into a test function Test(I)𝒟6(3)\text{Test}\left(I\right)\in\mathcal{D}^{6}(\mathbb{Z}_{3}). The test function Test(I)\text{Test}(I) is defined in terms of the tree structure of G6G_{6} in the following way: we define II as the root of the tree. Later we divide II into three horizontal even slices (sub-images). These sub-images are the vertices level 11, and they are the children of II. Each sub-image IjI_{j} at level 11 is then divided vertically into 33 sub-images; these are the children of the IjI_{j}. All the 323^{2} new sub-images correspond to the vertices at level 22. We repeat this process until reaching level 66. At level 66, each vertex corresponds to a pixel, and we denote by value IiI_{i} for iG6i\in G_{6}. Then Test(I)\text{Test}(I) is defined as iG6IiΩ(36|xi|3)\sum_{i\in G_{6}}I_{i}\Omega(3^{6}|x-i|_{3}). See Figure 3 for the construction of the tree corresponding to a 32×323^{2}\times 3^{2} image. Figure 4 shows the graph of a test function corresponding to an image from the MNIST data set. For further details, the reader may consult the Appendix in [18].

Refer to caption
Figure 3: The construction of the tree corresponding to a 3×33\times 3 image.
Refer to caption
Refer to caption
Figure 4: The processing of date using pp-adic networks requires the transformation of the actual data to its pp-adic form. The left image shows a visualization of the test function corresponding to right image. The visualization uses the embedding 𝔣:p2\mathfrak{f}:\mathbb{Z}_{p}\rightarrow\mathbb{R}^{2} showed in Figure 2.

We note that in a pp-adic discrete deep RBM, the visible and hidden states are functions on a finite tree. Only the vertices at the top level, which are marked as orange and blue balls are allowed to have states. See Figure 5 for the case p=2,l=2p=2,l=2. The remaining trees’ vertices (see the black dots in Figure 5) only codify the hierarchical relation between states. The visible and hidden vertices connected by the same type of lines share the same weight ww.

We adapted the contrastive divergence learning method to the pp-adic framework, [48]. The technical details are presented in Section 9.2. We implement two different types of networks. In the first type, the function wkw_{k} is supported in the entire tree G6G_{6}; in the second type, the function wkw_{k} is supported in a proper subset of the tree G6G_{6}. We use the full MNIST hadwritten digits, without considering labels, to train a six layer 33-adic feature detector. The results are show in Figure 6.

Refer to caption
Figure 5: A p-adic RBM for p=2,l=2p=2,l=2

After the processing by the network ends, it is necessary to transform the test function into an image.

6 Conclusions

The standard RBMs [48] and the ϕ4\phi^{4}- neural networks introduced in [5] are particular classes of pp-adic discrete BMs. Indeed, if w(x,y)w(x,y) is a test function and the interaction between the visible and hidden field has the form p×p𝒉(y)w(x,y)𝒗(x)𝑑x𝑑y-{\textstyle\iint\nolimits_{\mathbb{Z}_{p}\times\mathbb{Z}_{p}}}\boldsymbol{h}\left(y\right)w\left(x,y\right)\boldsymbol{v}\left(x\right)dxdy, then in the corresponding discrete energy functional, after a suitable rescaling of the weights wk,jw_{k,j}, the interaction of the visible and hidden states takes the form jGlkGlwk,jvkhj-\sum_{j\in G_{l}}\sum_{k\in G_{l}}w_{k,j}v_{k}h_{j}. Here [wk,j]\left[w_{k,j}\right] is an ordinary matrix, which means that its entries wk,jw_{k,j} do not depend on the algebraic structure of GlG_{l} neither on its topology.

In the case in which the interaction between the visible and hidden field has the form p×p𝒉(y)w(xy)𝒗(x)𝑑x𝑑y-{\textstyle\iint\nolimits_{\mathbb{Z}_{p}\times\mathbb{Z}_{p}}}\boldsymbol{h}\left(y\right)w\left(x-y\right)\boldsymbol{v}\left(x\right)dxdy, then corresponding discrete energy functional depends on the group structure of GlG_{l}, and the corresponding neural network is a particular case of a DBM, see Figure 5.

The condition ll0l\geq l_{0}, for some constant l0l_{0}, means that a pp-adic discrete BM admits copies arbitratly large, this is a consequence of the fact of pp-adic numbers has a tree-like structure. We expect that for ll sufficiently large the statistical properties of the network can be studied using a pp-adic continuous SFT.

Our numerical experiments show that pp-adic discrete convolutional deep BMs alone can be used to process real data, this opens the possibility of using these networks as layers in specialized NNs.

In [12] the first author conjectured that the limit

eE(𝒗,𝒉)Zphysd𝒗d𝒉=liml𝑷l(𝒗,𝒉) d#Gl𝒗 d#Gl𝒉\frac{e^{-E(\boldsymbol{v},\boldsymbol{h})}}{Z^{\text{phys}}}d\boldsymbol{v}d\boldsymbol{h}=\lim_{l\rightarrow\infty}\boldsymbol{P}_{l}(\boldsymbol{v},\boldsymbol{h})\text{ }d^{\#G_{l}}\boldsymbol{v}\text{ }d^{\#G_{l}}\boldsymbol{h} (2)

exists in some sense, here d#Gl𝒙d^{\#G_{l}}\boldsymbol{x} denotes the Lebesgue measure of #Gl\mathbb{R}^{\#G_{l}}. Then the correlation between the network activity in different regions of the underlaying tree GlG_{l} can be understood by computing the correlations functions of the corresponding continuous SFT.

For practical applications the NNs should be discrete entities. This type of NNs naturally correspond with discrete SFTs. To use Euclidean QFT to study NNs, it is convenient to have continuous versions of these networks. Thus, a clear way of passing between discrete STFs to continuous ones is required. The existence of the limit 2 is a very difficult problem in classical QFT. In [31] the existence of this limit was established for pp-adic ϕ4\phi^{4}- theories involving one scalar field, we expect that these techniques can be extended to the case QFTs considered here.

It is widely accepted in the artificial intelligence community that the probability distributions 𝑷l(𝒗,𝒉)\boldsymbol{P}_{l}(\boldsymbol{v},\boldsymbol{h}) should approximate any finite probability distribution very well, which means that the corresponding NNs are universal approximators. We argue that this property is connected with the topology and the structure of the NNs, and that the problem of designing good architectures for NNs is out of the scope of QFT. We expect that QFT techniques will be useful to understand the qualitative behavior of large NNs, which can be well-approximated as ‘continuos’ NNs.

The study of the correspondence between pp-adic Euclidean QFTs and NNs is just starting. We envision that the next step is to develop perturbative calculations of the correlation functions via Feynman diagrams, to study connections with Ginzburg–Landau theory, and to develop practical applications of the pp-adic convolutional BMs.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 6: (a) is the original input image. (b) is the reconstructed image using  Gibbs sampling with  ww  having 3\mathbb{Z}_{3} as support. (c) is the reconstructed image with ww supported in 3333^{3}\mathbb{Z}_{3}. (d) is the reconstructed image with ww supported in 3433^{4}\mathbb{Z}_{3}.

7 Apendix A: Basic facts on pp-adic analysis

In this section we review some basic results on pp-adic analysis required in this article. For a detailed exposition on pp-adic analysis the reader may consult [20], [43]-[45]. For a quick review of pp-adic analysis the reader may consult [46].

7.1 The field of pp-adic numbers

The field of pp-adic numbers p\mathbb{Q}_{p} is defined as the completion of the field of rational numbers \mathbb{Q} with respect to the pp-adic norm ||p|\cdot|_{p}, which is defined as

|x|p={0if x=0pγif x=pγab,|x|_{p}=\begin{cases}0&\text{if }x=0\\ p^{-\gamma}&\text{if }x=p^{\gamma}\dfrac{a}{b},\end{cases} (3)

where aa and bb are integers coprime with pp. The integer γ=ordp(x):=ord(x)\gamma=ord_{p}(x):=ord(x), with ord(0):=+ord(0):=+\infty, is called the pp-adic order of xx. The metric space (p,||p)\left(\mathbb{Q}_{p},\left|\cdot\right|_{p}\right) is a complete ultrametric space. Ultrametric means that |x+y|pmax{|x|p,|y|p}\left|x+y\right|_{p}\leq\max\left\{\left|x\right|_{p},\left|y\right|_{p}\right\}. As a topological space p\mathbb{Q}_{p} is homeomorphic to a Cantor-like subset of the real line, see, e.g., [20], [37], [43].

Any pp-adic number x0x\neq 0 has a unique expansion of the form

x=pord(x)j=0xjpj,x=p^{ord(x)}\sum_{j=0}^{\infty}x_{j}p^{j},

where xj{0,1,2,,p1}x_{j}\in\{0,1,2,\dots,p-1\} and x00x_{0}\neq 0. In addition, for any xp{0}x\in\mathbb{Q}_{p}\smallsetminus\left\{0\right\} we have

|x|p=pord(x).\left|x\right|_{p}=p^{-ord(x)}.

7.2 Topology of p\mathbb{Q}_{p}

For rr\in\mathbb{Z}, denote by Br(a)={xp;|xa|ppr}B_{r}(a)=\{x\in\mathbb{Q}_{p};\left|x-a\right|_{p}\leq p^{r}\} the ball of radius prp^{r} with center at apa\in\mathbb{Q}_{p}, and take Br(0):=BrB_{r}(0):=B_{r}. The ball B0B_{0} equals the ring of pp-adic integers p\mathbb{Z}_{p}. The balls are both open and closed subsets in p\mathbb{Q}_{p}. We use Ω(pr|xa|p)\Omega\left(p^{-r}\left|x-a\right|_{p}\right) to denote the characteristic function of the ball Br(a)B_{r}(a). Two balls in p\mathbb{Q}_{p} are either disjoint or one is contained in the other. As a topological space (p,||p)\left(\mathbb{Q}_{p},\left|\cdot\right|_{p}\right) is totally disconnected, i.e., the only connected  subsets of p\mathbb{Q}_{p} are the empty set and the points. A subset of p\mathbb{Q}_{p} is compact if and only if it is closed and bounded in p\mathbb{Q}_{p}, see e.g. [20, Section 1.3], or [43, Section 1.8]. The balls and spheres are compact subsets. Thus (p,||p)\left(\mathbb{Q}_{p},\left|\cdot\right|_{p}\right) is a locally compact topological space.

7.2.1 Tree-like structures

Any pp-adic integer ii admits an expansion of the form i=ikpk+ik+1pk+1+i=i_{k}p^{k}+i_{k+1}p^{k+1}+\ldots for some k0,ik0k\geq 0,i_{k}\neq 0. The set of pp-adic truncated integers modulo plp^{l}, l1l\geq 1, consists of all the integers of the form i=i0+i1p++il1pl1i\boldsymbol{=}i_{0}+i_{1}p+\ldots+i_{l-1}p^{l-1}. These numbers form a complete set of representatives for the elements of the additive group Gl=p/plpG_{l}=\mathbb{Z}_{p}/p^{l}\mathbb{Z}_{p}, which is isomorphic to the set of integers /pl\mathbb{Z}/p^{l}\mathbb{Z} (written in base pp) modulo plp^{l}. By restricting ||p\left|\cdot\right|_{p} to GlG_{l}, it becomes a normed space, and |Gl|p={0,p(l1),,p1,1}\left|G_{l}\right|_{p}=\left\{0,p^{-\left(l-1\right)},\cdots,p^{-1},1\right\}. With the metric induced by ||p\left|\cdot\right|_{p}, GlG_{l} becomes a finite ultrametric space. In addition, GlG_{l} can be identified with the set of branches (vertices at the top level) of a rooted tree with l+1l+1 levels and plp^{l} branches. By definition the root of the tree is the only vertex at level 0. There are exactly pp vertices at level 11, which correspond with the possible values of the digit i0i_{0} in the pp-adic expansion of ii. Each of these vertices is connected to the root by a non-directed edge. At level kk, with 2kl+12\leq k\leq l+1, there are exactly pkp^{k} vertices,  each vertex corresponds to a truncated expansion of ii of the form i0++ik1pk1i_{0}+\cdots+i_{k-1}p^{k-1}. The vertex corresponding to i0++ik1pk1i_{0}+\cdots+i_{k-1}p^{k-1} is connected to a vertex i0++ik2pk2i_{0}^{\prime}+\cdots+i_{k-2}^{\prime}p^{k-2} at the level k1k-1 if and only if (i0++ik1pk1)(i0++ik2pk2)\left(i_{0}+\cdots+i_{k-1}p^{k-1}\right)-\left(i_{0}^{\prime}+\cdots+i_{k-2}^{\prime}p^{k-2}\right) is divisible by pk1p^{k-1}. See Figure 1. The balls Br(a)=a+prpB_{-r}(a)=a+p^{r}\mathbb{Z}_{p} are an infinite rooted trees.

7.3 The Haar measure

Since (p,+)(\mathbb{Z}_{p},+) is a locally compact topological group, there exists a Haar measure dxdx, which is invariant under translations, i.e., d(x+a)=dxd(x+a)=dx, [47]. If we normalize this measure by the condition p𝑑x=1\int_{\mathbb{Z}_{p}}dx=1, then dxdx is unique. It follows immediately that

Br(a)𝑑x=a+prp𝑑x=prp𝑑y=prr.{\textstyle\int\limits_{B_{r}(a)}}dx={\textstyle\int\limits_{a+p^{-r}\mathbb{Z}_{p}}}dx=p^{r}{\textstyle\int\limits_{\mathbb{Z}_{p}}}dy=p^{r}\text{, }r\in\mathbb{Z}\text{.}

In a few ocassions we use the two-dimensional Haar measure dxdydxdy of the additive group (p×p,+)(\mathbb{Z}_{p}\times\mathbb{Z}_{p},+) normalize this measure by the condition pp𝑑x𝑑y=1\int_{\mathbb{Z}_{p}}\int_{\mathbb{Z}_{p}}dxdy=1. For a quick review of the integration in the pp-adic framework the reader may consult [46] and the references therein.

7.4 The Bruhat-Schwartz space in the unit ball

A real-valued function φ\varphi defined on p\mathbb{Z}_{p} is called Bruhat-Schwartz function (or a test function) if for any xpx\in\mathbb{Z}_{p} there exist an integer l(x)l(x)\in\mathbb{Z} such that

φ(x+x)=φ(x) for any xBl(x).\varphi(x+x^{\prime})=\varphi(x)\text{ for any }x^{\prime}\in B_{l(x)}. (4)

The \mathbb{R}-vector space of Bruhat-Schwartz functions supported in the unit ball is denoted by 𝒟(p)\mathcal{D}(\mathbb{Z}_{p}). For φ𝒟(p)\varphi\in\mathcal{D}(\mathbb{Z}_{p}), the largest number l=l(φ)l=l(\varphi) satisfying (4) is called the exponent of local constancy (or the parameter of constancy) of φ\varphi. A function φ\varphi in 𝒟(p)\mathcal{D}(\mathbb{Z}_{p}) can be written as

φ(x)=j=1Mφ(x~j)Ω(prj|xx~j|p),\varphi\left(x\right)={\displaystyle\sum\limits_{j=1}^{M}}\varphi\left(\widetilde{x}_{j}\right)\Omega\left(p^{r_{j}}\left|x-\widetilde{x}_{j}\right|_{p}\right),

where the x~j\widetilde{x}_{j}, j=1,,Mj=1,\ldots,M, are points in p\mathbb{Z}_{p}, the rjr_{j}, j=1,,Mj=1,\ldots,M, are integers, and Ω(prj|xx~j|p)\Omega\left(p^{r_{j}}\left|x-\widetilde{x}_{j}\right|_{p}\right) denotes the characteristic function of the ball Brj(x~j)=x~j+prjpB_{-r_{j}}(\widetilde{x}_{j})=\widetilde{x}_{j}+p^{r_{j}}\mathbb{Z}_{p}.

8 Appendix B: Discretization of the energy functional

The elments of Gl=p/plpG_{l}=\mathbb{Z}_{p}/p^{l}\mathbb{Z}_{p}, l1l\geq 1, have the form i=i0+i1p++il1pl1i\boldsymbol{=}i_{0}+i_{1}p+\ldots+i_{l-1}p^{l-1}, where the iki_{k}s are pp-adic digits. We denote by 𝒟l(p)\mathcal{D}^{l}(\mathbb{Z}_{p}) the \mathbb{R}-vector space of all test functions of the form

φ(x)=iGlφ(i)Ω(pl|xi|p)φ(i),\varphi\left(x\right)={\textstyle\sum\limits_{i\in G_{l}}}\varphi\left(i\right)\Omega\left(p^{l}\left|x-i\right|_{p}\right)\text{, \ }\varphi\left(i\right)\in\mathbb{R}\text{,}

here Ω(pl|xi|p)\Omega\left(p^{l}\left|x-i\right|_{p}\right) is the characteristic function of the ball i+plpi+p^{l}\mathbb{Z}_{p}. Notice that φ\varphi is supported on p\mathbb{Z}_{p} and that 𝒟l(p)\mathcal{D}^{l}(\mathbb{Z}_{p}) is a finite dimensional vector space spanned by the basis {Ω(pl|xi|p)}iGl\left\{\Omega\left(p^{l}\left|x-i\right|_{p}\right)\right\}_{i\in G_{l}}.

By identifying φ𝒟l(p)\varphi\in\mathcal{D}^{l}(\mathbb{Z}_{p}) with the column vector [φ(i)]iGl#Gl\left[\varphi\left(i\right)\right]_{i\in G_{l}}\in\mathbb{R}^{\#G_{l}}, we get that 𝒟l(p)\mathcal{D}^{l}(\mathbb{Z}_{p}) is isomorphic to #Gl\mathbb{R}^{\#G_{l}} endowed with the norm [φ(i)]iGlN=maxiGl|φ(i)|\left\|\left[\varphi\left(i\right)\right]_{i\in G_{l}^{N}}\right\|=\max_{i\in G_{l}}\left|\varphi\left(i\right)\right|. Furthermore,

𝒟l𝒟l+1𝒟(p),\mathcal{D}^{l}\hookrightarrow\mathcal{D}^{l+1}\hookrightarrow\mathcal{D}(\mathbb{Z}_{p}),

where \hookrightarrow denotes a continuous embedding.

The restriction of EE to the subspace 𝒟l(p)\mathcal{D}^{l}(\mathbb{Z}_{p}) gives a discretization of EE denoted as ElE_{l}. Indeed, by assuming that

𝒗(x)\displaystyle\boldsymbol{v}\left(x\right) =iGl𝒗(i)Ω(pl|xi|p),\displaystyle={\textstyle\sum\limits_{i\in G_{l}}}\boldsymbol{v}\left(i\right)\Omega\left(p^{l}\left|x-i\right|_{p}\right)\text{,}
𝒉(x)\displaystyle\boldsymbol{h}\left(x\right) =iGl𝒉(i)Ω(pl|xi|p),\displaystyle={\textstyle\sum\limits_{i\in G_{l}}}\boldsymbol{h}\left(i\right)\Omega\left(p^{l}\left|x-i\right|_{p}\right),

we have

p×pw(xy)Ω(pl|xi|p)Ω(pl|yj|p)𝑑x𝑑y\displaystyle{\displaystyle\iint\limits_{\mathbb{Z}_{p}\times\mathbb{Z}_{p}}}w\left(x-y\right)\Omega\left(p^{l}\left|x-i\right|_{p}\right)\Omega\left(p^{l}\left|y-j\right|_{p}\right)dxdy
=plp×plpw((ij)+(xy))𝑑x𝑑y,\displaystyle=\iint\limits_{p^{l}\mathbb{Z}_{p}\times p^{l}\mathbb{Z}_{p}}w((i-j)+(x-y))dxdy,

and by using that a(x)a\left(x\right), b(x)b(x), c(x)c(x), d(x)d(x) are test functions supported in the unit ball, and taking ll sufficiently large, we have

a(x)Ω(pl|xi|p)=a(i)Ω(pl|xi|p)a\left(x\right)\Omega\left(p^{l}\left|x-i\right|_{p}\right)=a\left(i\right)\Omega\left(p^{l}\left|x-i\right|_{p}\right)\text{, }
b(x)Ω(pl|xi|p)=b(i)Ω(pl|xi|p),b(x)\Omega\left(p^{l}\left|x-i\right|_{p}\right)=b(i)\Omega\left(p^{l}\left|x-i\right|_{p}\right),
c(x)Ω(pl|xi|p)=c(i)Ω(pl|xi|p),c\left(x\right)\Omega\left(p^{l}\left|x-i\right|_{p}\right)=c\left(i\right)\Omega\left(p^{l}\left|x-i\right|_{p}\right)\text{,}
 d(x)Ω(pl|xi|p)=d(i)Ω(pl|xi|p),\text{ }d(x)\Omega\left(p^{l}\left|x-i\right|_{p}\right)=d(i)\Omega\left(p^{l}\left|x-i\right|_{p}\right),

and consequently

El(𝒗,𝒉)=i, jGlij𝒗(i)𝒉(j)( plp×plpw(ij+xy)𝑑x𝑑y)\displaystyle E_{l}\left(\boldsymbol{v},\boldsymbol{h}\right)=-{\displaystyle\sum\limits_{\begin{subarray}{c}i,\text{ }j\in G_{l}\\ i\neq j\end{subarray}}}\boldsymbol{v}\left(i\right)\boldsymbol{h}\left(j\right)\left(\text{ }\iint\limits_{p^{l}\mathbb{Z}_{p}\times p^{l}\mathbb{Z}_{p}}w(i-j+x-y)dxdy\right)
pliGla(i)𝒗(i)pliGlb(i)𝒉(i)+epl2𝒊Gl𝒗2(i)+epl2𝒊Gl𝒉2(i)\displaystyle-p^{-l}{\displaystyle\sum\limits_{i\in G_{l}}}a\left(i\right)\boldsymbol{v}\left(i\right)-p^{-l}{\displaystyle\sum\limits_{i\in G_{l}}}b\left(i\right)\boldsymbol{h}\left(i\right)+\frac{ep^{-l}}{2}{\displaystyle\sum\limits_{\boldsymbol{i}\in G_{l}}}\boldsymbol{v}^{2}\left(i\right)+\frac{ep^{-l}}{2}{\displaystyle\sum\limits_{\boldsymbol{i}\in G_{l}}}\boldsymbol{h}^{2}\left(i\right)
+pl2iGlc(i)𝒗4(i)+pl2iGld(i)𝒉4(i).\displaystyle+\frac{p^{-l}}{2}{\displaystyle\sum\limits_{i\in G_{l}}}c\left(i\right)\boldsymbol{v}^{4}\left(i\right)+\frac{p^{-l}}{2}{\displaystyle\sum\limits_{i\in G_{l}}}d\left(i\right)\boldsymbol{h}^{4}\left(i\right).

We take vi=𝒗(i)v_{i}=\boldsymbol{v}\left(i\right), hi=𝒉(i)h_{i}=\boldsymbol{h}\left(i\right),

wij=plp×plpw((ij)+(xy))𝑑x𝑑y,w_{i-j}=\iint\limits_{p^{l}\mathbb{Z}_{p}\times p^{l}\mathbb{Z}_{p}}w((i-j)+(x-y))dxdy,

ai=pla(i)a_{i}=p^{-l}a\left(i\right), bi=plb(i)b_{i}=p^{-l}b\left(i\right), ci=plc(i)c_{i}=p^{-l}c\left(i\right), di=pld(i)d_{i}=p^{-l}d\left(i\right), for ii, jGlj\in G_{l}, and 𝜽={wij,ai,bi,ci,di}\boldsymbol{\theta}=\left\{w_{ij},a_{i},b_{i},c_{i},d_{i}\right\}. We also rescale ee to eplep^{l}, then

El(𝒗,𝒉;𝜽)=i, jGlwijvihjiGlaiviiGlbihi\displaystyle E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right)=-{\sum\limits_{i,\text{ }j\in G_{l}}}w_{i-j}v_{i}h_{j}-{\displaystyle\sum\limits_{i\in G_{l}}}a_{i}v_{i}-{\displaystyle\sum\limits_{i\in G_{l}}}b_{i}h_{i}
+e2iGlvi2+e2iGlhi2+iGlcivi4+iGldihi4.\displaystyle+\frac{e}{2}{\displaystyle\sum\limits_{i\in G_{l}}}v_{i}^{2}+\frac{e}{2}{\displaystyle\sum\limits_{i\in G_{l}}}h_{i}^{2}+{\displaystyle\sum\limits_{i\in G_{l}}}c_{i}v_{i}^{4}+{\displaystyle\sum\limits_{i\in G_{l}}}d_{i}h_{i}^{4}\text{.}

We now recall that GlG_{l} is an additive group, then

i, jGlwijvihj=jGlkGlwkvj+khj,{\displaystyle\sum\limits_{i,\text{ }j\in G_{l}}}w_{i-j}v_{i}h_{j}={\displaystyle\sum\limits_{j\in G_{l}}}{\displaystyle\sum\limits_{k\in G_{l}}}w_{k}v_{j+k}h_{j},

and consequently

El(𝒗,𝒉;𝜽)=jGlkGlwkvj+khjjGlajvjjGlbjhj\displaystyle E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right)=-{\displaystyle\sum\limits_{j\in G_{l}}}{\displaystyle\sum\limits_{k\in G_{l}}}w_{k}v_{j+k}h_{j}-{\displaystyle\sum\limits_{j\in G_{l}}}a_{j}v_{j}-{\displaystyle\sum\limits_{j\in G_{l}}}b_{j}h_{j}
+e2jGlNvj2+e2jGlhj2+jGlcjvj4+jGldjhj4.\displaystyle+\frac{e}{2}{\displaystyle\sum\limits_{j\in G_{l}^{N}}}v_{j}^{2}+\frac{e}{2}{\displaystyle\sum\limits_{j\in G_{l}}}h_{j}^{2}+{\displaystyle\sum\limits_{j\in G_{l}}}c_{j}v_{j}^{4}+{\displaystyle\sum\limits_{j\in G_{l}}}d_{j}h_{j}^{4}\text{.}

9 Appendix C: Some probability distributions

From now on, we assume that visible and hidden fields are binary variables. However, most of our mathematical formulation is valid under the assumption that visible and hidden fields are discrete variables. We set

𝑽={V1,V2,,VN},𝑯={H1,H2,,HN},\boldsymbol{V}=\{V_{1},V_{2},\ldots,V_{N}\},\quad\boldsymbol{H}=\{H_{1},H_{2},\ldots,H_{N}\},

N=plN=p^{l}, to be the respective visible and hidden random variable sets. The random variables (𝑽,𝑯)(\boldsymbol{V},\boldsymbol{H}) take values (𝒗,𝒉){0,1}2N(\boldsymbol{v},\boldsymbol{h})\in\{0,1\}^{2N}. For the sake of simplicity, we will identify the random vector 𝑽\boldsymbol{V} with 𝒗\boldsymbol{v}, and the random vector 𝑯\boldsymbol{H} with 𝒉\boldsymbol{h}. We identify GlG_{l} with the set of branches (vertices at the top level) of a rooted tree with l+1l+1 levels and plp^{l} branches. Attached to each branch iGli\in G_{l} there are two are two states: viv_{i}, hih_{i}. With this notation, 𝒗=[vi]iGl\boldsymbol{v}=\left[v_{i}\right]_{i\in G_{l}} is a realization of the visible field, and 𝒉=[hi]iGl\boldsymbol{h}=\left[h_{i}\right]_{i\in G_{l}} is a realization of the hidden field. The joint distribution of the random vectors (𝒗,𝒉)(\boldsymbol{v},\boldsymbol{h}) is given by the following Boltzmann probability distribution:

𝑷l(𝒗,𝒉;𝜽)=exp(El((𝒗,𝒉;𝜽)))Zl,\boldsymbol{P}_{l}(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta})=\frac{\exp\left(-E_{l}\left((\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta})\right)\right)}{Z_{l}}, (5)

where

Zl=𝒗,𝒉exp(El(𝒗,𝒉;𝜽)),Z_{l}={\sum\limits_{\boldsymbol{v},\boldsymbol{h}}}\exp\left(-E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right)\right),

and ElE_{l} is defined in Equation 1

By using the joint distribution of the visible field and the hidden field (5), we compute the marginal probability distributions as follows:

𝑷l(𝒗;𝜽)=𝒉𝑷l(𝒗,𝒉;𝜽)=𝒉exp(El(𝒗,𝒉;𝜽))𝒗,𝒉exp(El(𝒗,𝒉;𝜽)),\boldsymbol{P}_{l}(\boldsymbol{v};\boldsymbol{\theta})={\displaystyle\sum\limits_{\boldsymbol{h}}}\boldsymbol{P}_{l}(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta})=\frac{{\sum\limits_{\boldsymbol{h}}}\exp\left(-E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right)\right)}{{\displaystyle\sum\limits_{\boldsymbol{v},\boldsymbol{h}}}\exp\left(-E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right)\right)},
𝑷l(𝒉;𝜽)=𝒗𝑷l(𝒗,𝒉;𝜽)=𝒗exp(El(𝒗,𝒉;𝜽))𝒗,𝒉exp(El(𝒗,𝒉;𝜽)).\boldsymbol{P}_{l}(\boldsymbol{h};\boldsymbol{\theta})={\displaystyle\sum\limits_{\boldsymbol{v}}}\boldsymbol{P}_{l}(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta})=\frac{{\displaystyle\sum\limits_{\boldsymbol{v}}}\exp\left(-E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right)\right)}{{\displaystyle\sum\limits_{\boldsymbol{v},\boldsymbol{h}}}\exp\left(-E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right)\right)}.

Since we are assuming that 𝒗,𝒉\boldsymbol{v},\boldsymbol{h} are binary, the energy functional takes the form

El(𝒗,𝒉;𝜽)=kGljGlwkvj+khjjGlajvjjGlbjhj.E_{l}(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta})=-\sum_{k\in G_{l}}\sum_{j\in G_{l}}w_{k}v_{j+k}h_{j}-\sum_{j\in G_{l}}a_{j}v_{j}-\sum_{j\in G_{l}}b_{j}h_{j}.

The classical RBM has the advantage of independence between the visible units as well as the hidden units. The pp-adic BM shares the same advantage, i.e., by fixing the hidden field 𝒉\boldsymbol{h}, the random variables viv_{i}, iGli\in G_{l}, become independent. An analog assertion is valid if we fix the visible field 𝒗\boldsymbol{v}. More precisely, the conditional probability distributions satisfy

𝑷l(𝒗𝒉;𝜽)=jGl𝑷l(vj𝒉;𝜽),and𝑷l(𝒉𝒗;𝜽)=jGl𝑷l(hj𝒗;𝜽).\boldsymbol{P}_{l}(\boldsymbol{v}\mid\boldsymbol{h};\boldsymbol{\theta})={\prod\limits_{j\in G_{l}}}\boldsymbol{P}_{l}\left(v_{j}\mid\boldsymbol{h};\boldsymbol{\theta}\right),\quad\mbox{and}\quad\boldsymbol{P}_{l}(\boldsymbol{h}\mid\boldsymbol{v};\boldsymbol{\theta})={\displaystyle\prod\limits_{j\in G_{l}}}\boldsymbol{P}_{l}\left(h_{j}\mid\boldsymbol{v};\boldsymbol{\theta}\right).

Indeed, by direct computation, we have

𝑷l(𝒗𝒉;𝜽)=𝑷l(𝒗,𝒉;𝜽)𝑷l(𝒉;𝜽)\displaystyle\boldsymbol{P}_{l}(\boldsymbol{v}\mid\boldsymbol{h};\boldsymbol{\theta})=\frac{\boldsymbol{P}_{l}(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta})}{\boldsymbol{P}_{l}(\boldsymbol{h};\boldsymbol{\theta})}
=\displaystyle= exp(El(𝒗,𝒉;𝜽))𝒗exp(El(𝒗,𝒉;𝜽))=jGlexp(kGlwkvjhjk+ajvj)𝒗jGlexp(kGlwkvjhjk+ajvj)\displaystyle\frac{\exp\left(-E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right)\right)}{{\displaystyle\sum\limits_{\boldsymbol{v}}}\exp\left(-E_{l}\left(\boldsymbol{v},\boldsymbol{h};\boldsymbol{\theta}\right)\right)}=\frac{{\displaystyle\prod\limits_{j\in G_{l}}}\exp\left({\displaystyle\sum\limits_{k\in G_{l}}}w_{k}v_{j}h_{j-k}+a_{j}v_{j}\right)}{{\displaystyle\sum\limits_{\boldsymbol{v}}}{\displaystyle\prod\limits_{j\in G_{l}}}\exp\left({\displaystyle\sum\limits_{k\in G_{l}}}w_{k}v_{j}h_{j-k}+a_{j}v_{j}\right)}
=\displaystyle= jGlexp(vjkGlwkhjk+ajvj)vjexp(vjkGlwkhjk+ajvj)=jGl𝑷l(vj,𝒉;𝜽)vj𝑷l(vj,𝒉;𝜽)=jGl𝑷l(vj𝒉;𝜽).\displaystyle{\displaystyle\prod\limits_{j\in G_{l}}}\frac{\exp\left(v_{j}{\displaystyle\sum\limits_{k\in G_{l}}}w_{k}h_{j-k}+a_{j}v_{j}\right)}{{\displaystyle\sum\limits_{v_{j}}}\exp\left(v_{j}{\displaystyle\sum\limits_{k\in G_{l}}}w_{k}h_{j-k}+a_{j}v_{j}\right)}={\displaystyle\prod\limits_{j\in G_{l}}}\frac{\boldsymbol{P}_{l}\left(v_{j},\boldsymbol{h};\boldsymbol{\theta}\right)}{{\displaystyle\sum\limits_{v_{j}}}\boldsymbol{P}_{l}\left(v_{j},\boldsymbol{h};\boldsymbol{\theta}\right)}={\displaystyle\prod\limits_{j\in G_{l}}}\boldsymbol{P}_{l}\left(v_{j}\mid\boldsymbol{h};\boldsymbol{\theta}\right).

Similarly, we can prove that

𝑷l(𝒉𝒗;𝜽)=jGl𝑷l(hj𝒗;𝜽).\boldsymbol{P}_{l}(\boldsymbol{h}\mid\boldsymbol{v};\boldsymbol{\theta})={\prod\limits_{j\in G_{l}}}\boldsymbol{P}_{l}\left(h_{j}\mid\boldsymbol{v};\boldsymbol{\theta}\right). (6)

9.1 Gradient of the Log-likelihood

The log-likelihood giving a single visible state 𝒗\boldsymbol{v} is given

ln𝑷l(𝒗|𝜽)=ln1Z𝒉eE(𝒗,𝒉)=ln𝒉eE(𝒗,𝒉)ln𝒉,𝒗eE(𝒗,𝒉).\ln\boldsymbol{P}_{l}(\boldsymbol{v}|\boldsymbol{\theta})=\ln\dfrac{1}{Z}\sum_{\boldsymbol{h}}e^{-E(\boldsymbol{v},\boldsymbol{h})}=\ln\sum_{\boldsymbol{h}}e^{-E(\boldsymbol{v},\boldsymbol{h})}-\ln\sum_{\boldsymbol{h},\boldsymbol{v}}e^{-E(\boldsymbol{v},\boldsymbol{h})}.

Taking the derivative with respect to the parameters gives the following mean-like representation:

𝜽ln𝑷l(𝒗|𝜽)=𝜽ln𝒉eE(𝒗,𝒉)𝜽ln𝒉,𝒗eE(𝒗,𝒉)=1𝒉eE(𝒗,𝒉)𝒉eE(𝒗,𝒉)𝜽E(𝒗,𝒉)+1𝒗,𝒉eE(𝒗,𝒉)𝒗,𝒉eE(𝒗,𝒉)𝜽E(𝒗,𝒉)=𝒉𝑷l(𝒉|𝒗)𝜽E(𝒗,𝒉)+𝒗,𝒉𝑷l(𝒗,𝒉)𝜽E(𝒗,𝒉).\begin{split}&\dfrac{\partial}{\partial\boldsymbol{\theta}}\ln\boldsymbol{P}_{l}(\boldsymbol{v}|\boldsymbol{\theta})=\dfrac{\partial}{\partial\boldsymbol{\theta}}\ln\sum_{\boldsymbol{h}}e^{-E(\boldsymbol{v},\boldsymbol{h})}-\dfrac{\partial}{\partial\boldsymbol{\theta}}\ln\sum_{\boldsymbol{h},\boldsymbol{v}}e^{-E(\boldsymbol{v},\boldsymbol{h})}\\ =&-\dfrac{1}{\sum\limits_{\boldsymbol{h}}e^{-E(\boldsymbol{v},\boldsymbol{h})}}\sum_{\boldsymbol{h}}e^{-E(\boldsymbol{v},\boldsymbol{h})}\dfrac{\partial}{\partial\boldsymbol{\theta}}E(\boldsymbol{v},\boldsymbol{h})+\dfrac{1}{\sum\limits_{\boldsymbol{v},\boldsymbol{h}}e^{-E(\boldsymbol{v},\boldsymbol{h})}}\sum_{\boldsymbol{v},\boldsymbol{h}}e^{-E(\boldsymbol{v},\boldsymbol{h})}\dfrac{\partial}{\partial\boldsymbol{\theta}}E(\boldsymbol{v},\boldsymbol{h})\\ =&-\sum_{\boldsymbol{h}}\boldsymbol{P}_{l}(\boldsymbol{h}|\boldsymbol{v})\dfrac{\partial}{\partial\boldsymbol{\theta}}E(\boldsymbol{v},\boldsymbol{h})+\sum_{\boldsymbol{v},\boldsymbol{h}}\boldsymbol{P}_{l}(\boldsymbol{v},\boldsymbol{h})\dfrac{\partial}{\partial\boldsymbol{\theta}}E(\boldsymbol{v},\boldsymbol{h}).\end{split} (7)

In the case of multiple visible states S={𝒗1,𝒗2,,𝒗s}S=\{\boldsymbol{v}_{1},\boldsymbol{v}_{2},\cdots,\boldsymbol{v}_{s}\}, the log-likelihood is defined in the average sense, i.e., 1s𝒗Sln𝑷l(𝒗|𝜽)\dfrac{1}{s}\sum\limits_{\boldsymbol{v}\in S}\ln\boldsymbol{P}_{l}(\boldsymbol{v}|\boldsymbol{\theta}).

Taking the derivative with respect to wkw_{k} gives

wkln𝑷l(𝒗|𝜽)=𝒉(𝑷l(𝒉|𝒗)jGlvj+khj)𝒗,𝒉(𝑷l(𝒉,𝒗)jGlvj+khj)=𝒉(𝑷l(𝒉|𝒗)jGlvj+khj)𝒗𝑷l(𝒗)(𝒉(𝑷l(𝒉|𝒗)jGlvj+khj)).\begin{split}&\dfrac{\partial}{\partial w_{k}}\ln\boldsymbol{P}_{l}(\boldsymbol{v}|\boldsymbol{\theta})=\sum_{\boldsymbol{h}}\left(\boldsymbol{P}_{l}(\boldsymbol{h}|\boldsymbol{v})\sum_{j\in G_{l}}v_{j+k}h_{j}\right)-\sum_{\boldsymbol{v},\boldsymbol{h}}\left(\boldsymbol{P}_{l}(\boldsymbol{h},\boldsymbol{v})\sum_{j\in G_{l}}v_{j+k}h_{j}\right)\\ =&\sum_{\boldsymbol{h}}\left(\boldsymbol{P}_{l}(\boldsymbol{h}|\boldsymbol{v})\sum_{j\in G_{l}}v_{j+k}h_{j}\right)-\sum_{\boldsymbol{v}}\boldsymbol{P}_{l}(\boldsymbol{v})\left(\sum_{\boldsymbol{h}}\left(\boldsymbol{P}_{l}(\boldsymbol{h}|\boldsymbol{v})\sum_{j\in G_{l}}v_{j+k}h_{j}\right)\right).\end{split} (8)

Now, let 𝕀\mathbb{I} be the ordered indexes of all the positive states in 𝒗\boldsymbol{v}.

𝒉(𝑷l(𝒉|𝒗)iGlvi+khi)=𝒉𝑷l(𝒉|𝒗)(i𝕀hik)=𝒉𝕀𝒉Gl𝕀i𝕀𝑷l(hik|𝒗)iGl𝕀𝑷l(hik|𝒗)(i𝕀hik)=(𝒉𝕀i𝕀𝑷l(hik|𝒗)(i𝕀hik))(𝒉Gl𝕀iGl𝕀𝑷l(hik|𝒗))=1=𝒉𝕀(i𝕀𝑷l(hik|𝒗)i𝕀hik)=i𝕀𝑷l(hik=1|𝒗)=iGl𝑷l(hik=1|𝒗)vi.\begin{split}&\sum_{\boldsymbol{h}}\left(\boldsymbol{P}_{l}(\boldsymbol{h}|\boldsymbol{v})\sum_{i\in G_{l}}v_{i+k}h_{i}\right)=\sum_{\boldsymbol{h}}\boldsymbol{P}_{l}(\boldsymbol{h}|\boldsymbol{v})\left(\sum_{i\in\mathbb{I}}h_{i-k}\right)\\ =&\sum_{\boldsymbol{h}_{{\mathbb{I}}}}\sum_{\boldsymbol{h}_{G_{l}\setminus\mathbb{I}}}\prod_{i\in\ \mathbb{I}}\boldsymbol{P}_{l}(h_{i-k}|\boldsymbol{v})\prod_{i\in G_{l}\setminus\ \mathbb{I}}\boldsymbol{P}_{l}(h_{i-k}|\boldsymbol{v})\left(\sum_{i\in\ \mathbb{I}}h_{i-k}\right)\\ =&\left(\sum_{\boldsymbol{h}_{{\mathbb{I}}}}\prod_{i\in\ \mathbb{I}}\boldsymbol{P}_{l}(h_{i-k}|\boldsymbol{v})\left(\sum_{i\in\ \mathbb{I}}h_{i-k}\right)\right)\underbrace{\left(\sum_{\boldsymbol{h}_{G_{l}\setminus\ \mathbb{I}}}\prod_{i\in G_{l}\setminus\ \mathbb{I}}\boldsymbol{P}_{l}(h_{i-k}|\boldsymbol{v})\right)}_{=1}\\ =&\sum_{\boldsymbol{h}_{{\mathbb{I}}}}\left(\displaystyle\prod_{i\in{{\mathbb{I}}}}\boldsymbol{P}_{l}(h_{i-k}|\boldsymbol{v})\sum_{i\in{{\mathbb{I}}}}h_{i-k}\right)\\ =&\sum_{{i\in\ \mathbb{I}}}\boldsymbol{P}_{l}(h_{i-k}=1|\boldsymbol{v})=\sum_{i\in G_{l}}\boldsymbol{P}_{l}(h_{i-k}=1|\boldsymbol{v})v_{i}.\end{split} (9)

Combing eq. 8 and eq. 9 gives

log𝑷l(𝒗|𝜽)wk=iGl𝑷l(hik=1|𝒗)vi𝒗𝑷l(𝒗)(iGl𝑷l(hik=1|𝒗)vi).\displaystyle\frac{\partial\log\boldsymbol{P}_{l}(\boldsymbol{v}|\boldsymbol{\theta})}{w_{k}}=\sum_{i\in G_{l}}\boldsymbol{P}_{l}(h_{i-k}=1|\boldsymbol{v})v_{i}-\sum_{\boldsymbol{v}}\boldsymbol{P}_{l}(\boldsymbol{v})\left(\sum_{i\in G_{l}}\boldsymbol{P}_{l}(h_{i-k}=1|\boldsymbol{v})v_{i}\right). (10)

Note that the above formula is different from the classical RBM, [48, Formula 29].

We derive the derivatives with respect to aja_{j} and bjb_{j} similarly as in the classical RBM:

ajln𝑷l(𝒗|𝜽)=𝒉𝑷l(𝒉|𝒗)ajEl(𝒗,𝒉)+𝒗,𝒉𝑷l(𝒉,𝒗)ajEl(𝒗,𝒉)\displaystyle\dfrac{\partial}{\partial a_{j}}\ln\boldsymbol{P}_{l}(\boldsymbol{v}|\boldsymbol{\theta})=-\sum_{\boldsymbol{h}}\boldsymbol{P}_{l}(\boldsymbol{h}|\boldsymbol{v})\dfrac{\partial}{\partial a_{j}}E_{l}(\boldsymbol{v},\boldsymbol{h})+\sum_{\boldsymbol{v},\boldsymbol{h}}\boldsymbol{P}_{l}(\boldsymbol{h},\boldsymbol{v})\dfrac{\partial}{\partial a_{j}}E_{l}(\boldsymbol{v},\boldsymbol{h})
=𝒉𝑷l(𝒉|𝒗)vj𝒗,𝒉𝑷l(𝒉,𝒗)vj=vj𝒗𝑷l(𝒗)vj,\displaystyle=\sum_{\boldsymbol{h}}\boldsymbol{P}_{l}(\boldsymbol{h}|\boldsymbol{v})v_{j}-\sum_{\boldsymbol{v},\boldsymbol{h}}\boldsymbol{P}_{l}(\boldsymbol{h},\boldsymbol{v})v_{j}=v_{j}-\sum_{\boldsymbol{v}}\boldsymbol{P}_{l}(\boldsymbol{v})v_{j}, (11)

and

bjln𝑷l(𝒗|𝜽)=𝒉𝑷l(𝒉|𝒗)bjEl(𝒗,𝒉)+𝒗,𝒉𝑷l(𝒉,𝒗)bjEl(𝒗,𝒉)\displaystyle\dfrac{\partial}{\partial b_{j}}\ln\boldsymbol{P}_{l}(\boldsymbol{v}|\boldsymbol{\theta})=-\sum_{\boldsymbol{h}}\boldsymbol{P}_{l}(\boldsymbol{h}|\boldsymbol{v})\dfrac{\partial}{\partial b_{j}}E_{l}(\boldsymbol{v},\boldsymbol{h})+\sum_{\boldsymbol{v},\boldsymbol{h}}\boldsymbol{P}_{l}(\boldsymbol{h},\boldsymbol{v})\dfrac{\partial}{\partial b_{j}}E_{l}(\boldsymbol{v},\boldsymbol{h})
=𝒉𝑷l(𝒉|𝒗)hj𝒗,𝒉𝑷l(𝒉,𝒗)hj=𝑷l(hj=1|𝒗)𝒗𝑷l(hj=1|𝒗).\displaystyle=\sum_{\boldsymbol{h}}\boldsymbol{P}_{l}(\boldsymbol{h}|\boldsymbol{v})h_{j}-\sum_{\boldsymbol{v},\boldsymbol{h}}\boldsymbol{P}_{l}(\boldsymbol{h},\boldsymbol{v})h_{j}=\boldsymbol{P}_{l}(h_{j}=1|\boldsymbol{v})-\sum_{\boldsymbol{v}}\boldsymbol{P}_{l}(h_{j}=1|\boldsymbol{v}). (12)

9.2 Contrastive divergence learning

As in the classical case, the exact computation of the gradient of the log-likelihood involves an exponential number of terms, see (10). We adopt the contrastive divergence (CD) method, introduced by Hinton [49], to the pp-adic case to approximate the minimization of the gradient of the log-likelihood.

The approximation of eq. 10eq. 13 using the contrastive convergence method can be represented as follows:

log𝑷l(𝒗|𝜽)wkiGl𝑷l(hik=1|𝒗(0))viiGl𝑷l(hik=1|𝒗(m))vi(m),log𝑷l(𝒗|𝜽)ajvj(0)vj(m),log𝑷l(𝒗|𝜽)bj𝑷l(hj=1|𝒗(0))𝑷l(hj=1|𝒗(m)),\begin{split}&\frac{\partial\log\boldsymbol{P}_{l}(\boldsymbol{v}|\boldsymbol{\theta})}{w_{k}}\approx\sum_{{i\in G_{l}}}\boldsymbol{P}_{l}(h_{i-k}=1|\boldsymbol{v}^{(0)})v_{i}-\sum_{{i\in G_{l}}}\boldsymbol{P}_{l}(h_{i-k}=1|\boldsymbol{v}^{(m)})v_{i}^{(m)},\\ &\frac{\partial\log\boldsymbol{P}_{l}(\boldsymbol{v}|\boldsymbol{\theta})}{a_{j}}\approx v_{j}^{(0)}-v_{j}^{(m)},\\ &\frac{\partial\log\boldsymbol{P}_{l}(\boldsymbol{v}|\boldsymbol{\theta})}{b_{j}}\approx\boldsymbol{P}_{l}(h_{j}=1|\boldsymbol{v}^{(0)})-\boldsymbol{P}_{l}(h_{j}=1|\boldsymbol{v}^{(m)}),\end{split} (13)

where mm is a pre-determined positive integer, 𝒗(0)\boldsymbol{v}^{(0)} is a training example and 𝒗(m)\boldsymbol{v}^{(m)} is a sample of the Gibbs chain after mm steps. More precisely, we implement the Gibbs sampling in the following way. First, we obtain a sample of 𝒉(0)\boldsymbol{h}^{(0)} using the conditional distribution Pl(𝒉|𝒗(𝟎))P_{l}(\boldsymbol{h}|\boldsymbol{v^{(0)}}). Then, we obtain a sample of 𝒗(1)\boldsymbol{v}^{(1)} using Pl(𝒗|𝒉(0))P_{l}(\boldsymbol{v}|\boldsymbol{h}^{(0)}). We repeat this process until we get 𝒗(m)\boldsymbol{v}^{(m)}.

The following formulas are utilized in the calculation. Let 𝒉i\boldsymbol{h}_{-i} denotes the state of all hidden units expect for the ii-th one:

𝑷l(hi=1|𝒗)=𝑷l(hi=1|𝒉i,𝒗)=𝑷l(hi=1,𝒉i,𝒗)𝑷l(𝒉i,𝒗)\displaystyle\boldsymbol{P}_{l}(h_{i}=1|\boldsymbol{v})=\boldsymbol{P}_{l}(h_{i}=1|\boldsymbol{h}_{-i},\boldsymbol{v})=\dfrac{\boldsymbol{P}_{l}(h_{i}=1,\boldsymbol{h}_{-i},\boldsymbol{v})}{\boldsymbol{P}_{l}(\boldsymbol{h}_{-i},\boldsymbol{v})}
=𝑷l(hi=1,𝒉i,𝒗)𝑷l(hi=1,𝒉i,𝒗)+𝑷l(hi=0,𝒉i,𝒗)=11+𝑷l(hi=0,𝒉i,𝒗)𝑷l(hi=1,𝒉i,𝒗)\displaystyle=\dfrac{\boldsymbol{P}_{l}(h_{i}=1,\boldsymbol{h}_{-i},\boldsymbol{v})}{\boldsymbol{P}_{l}(h_{i}=1,\boldsymbol{h}_{-i},\boldsymbol{v})+\boldsymbol{P}_{l}(h_{i}=0,\boldsymbol{h}_{-i},\boldsymbol{v})}=\dfrac{1}{1+\dfrac{\boldsymbol{P}_{l}(h_{i}=0,\boldsymbol{h}_{-i},\boldsymbol{v})}{\boldsymbol{P}_{l}(h_{i}=1,\boldsymbol{h}_{-i},\boldsymbol{v})}}
=11+1exp(kGlwkvi+k+bi)=:σ(kGlwkvi+k+bi)).\displaystyle=\dfrac{1}{1+\dfrac{1}{\exp\left(\sum\limits_{k\in G_{l}}w_{k}v_{i+k}+b_{i}\right)}}=:\sigma\left({\sum_{k\in G_{l}}w_{k}v_{i+k}+b_{i})}\right).

Similarly, we have

𝑷l(hi=0|𝒗)=σ(kGlwkvi+kbi),𝑷l(vi=1|𝒉)=σ(kGlwikhk+ai),𝑷l(vi=0|𝒉)=σ(kGlwikhkai).\begin{split}&\boldsymbol{P}_{l}(h_{i}=0|\boldsymbol{v})=\sigma\left(-{\sum_{k\in G_{l}}w_{k}v_{i+k}-b_{i}}\right),\\ &\boldsymbol{P}_{l}(v_{i}=1|\boldsymbol{h})=\sigma\left(\sum_{k\in G_{l}}w_{i-k}h_{k}+a_{i}\right),\\ &\boldsymbol{P}_{l}(v_{i}=0|\boldsymbol{h})=\sigma\left(-\sum_{k\in G_{l}}w_{i-k}h_{k}-a_{i}\right).\end{split} (14)

References

  • [1] LeCun, Y., Bengio, Y. & Hinton, G. E. Deep learning. Nature 521, 436–444 (2015).
  • [2] Bahri Y., Kadmon J., Pennington J., Schoenholz S.S., Sohl-Dickstein J., Ganguli S. Statistical Mechanics of Deep Learning. Annual Review of Condensed Matter Physics 11, 501-528 (2020).
  • [3] Buice Michael A., Chow Carson C. Beyond mean field theory: statistical field theory for neural networks. J. Stat. Mech. Theory Exp. 3, P03003, 21 pp. (2013).
  • [4] Buice Michael A., Cowan Jack D., Field-theoretic approach to fluctuation effects in neural networks. Phys. Rev. E 75, no. 5, 051919, 14 pp. (2007).
  • [5] Bachtis D., Aarts G., & Lucini B. Quantum field-theoretic machine learning. Physical Review D, 103 (7), 074510 pp. 14 (2021).
  • [6] Dyer E., Gur-Ari G. Asymptotics of wide networks from Feynman diagrams.  https://doi.org/10.48550/arXiv.1909.11304.
  • [7] Erbin H., Lahoche V., Ousmane Samary D. Nonperturbative renormalization for the neural network-QFT correspondence. Mach. Learn.: Sci. Technol. 3 015027 (2022).
  • [8] Halverson J., Maiti A. and Stoner K.  Neural networks and quantum field theory. Mach. Learn. Sci. Technol. 2, 035002 (2021).
  • [9] Maiti A., Stoner K. and Halverson J. Symmetry-via-duality: Invariant neural network densities from parameter-space correlators. https://arxiv.org/abs/2106.00694.
  • [10] Helias Moritz, Dahmen David, Statistical field theory for neural networks. Lecture Notes in Physics, 970 ( Springer, Cham, 2020).
  • [11] Yaida, S. (2020, August). Non-Gaussian processes and neural networks at finite widths. In Mathematical and Scientific Machine Learning (pp. 165-192). PMLR.
  • [12] Zúñiga-Galindo, W. A. (2023). p-adic statistical field theory and deep belief networks. Physica A: Statistical Mechanics and its Applications, 128492.
  • [13] Le Roux Nicolas, Bengio Yoshua. Representational power of restricted Boltzmann machines and deep belief networks. Neural Comput. 20 , no. 6, 1631–1649 (2008).
  • [14] Albeverio, Sergio; Khrennikov, Andrei; Tirozzi, Brunello p-adic dynamical systems and neural networks, Math. Models Methods Appl. Sci. 9, no. 9, 1417–1437, 1999.
  • [15] Khrennikov, Andrei; Tirozzi, Brunello Learning of p-adic neural networks. Stochastic processes, physics and geometry: new interplays, II (Leipzig, 1999), 395–401, CMS Conf. Proc., 29, Amer. Math. Soc., Providence, RI, 2000.
  • [16] Khrennikov A, Information Dynamics in Cognitive, Psychological, Social and Anomalous Phenomena; Springer: Berlin/Heidelberg, Germany, 2004.
  • [17] Zambrano-Luna, B.A., Zúñiga-Galindo, W.A. p-adic Cellular Neural Networks. J Nonlinear Math Phys (2022). https://doi.org/10.1007/s44198-022-00071-8
  • [18] Zambrano-Luna, B. A., Zúñiga-Galindo, W. A. (2023). p-adic cellular neural networks: Applications to image processing. Physica D: Nonlinear Phenomena, 133668.
  • [19] Volovich I. V., Number theory as the ultimate physical theory. pp-Adic Numbers Ultrametric Anal. Appl. 2, 77–87 (2010).
  • [20] Vladimirov V. S., Volovich I. V., Zelenov E. I. pp-Adic analysis and mathematical physics (Singapore, World Scientific, 1994).
  • [21] Dragovich B., Khrennikov A. Yu., Kozyrev S. V., Volovich I. V. On pp-adic mathematical physics. pp-Adic Numbers Ultrametric Anal. Appl. 1 (1), 1, 1–17 (2009).
  • [22] Lerner E. Y., Misarov M. D. Scalar models in pp-adic quantum field theory and hierarchical models. Theor. Math. Phys. 78, 248–257 (1989).
  • [23] Missarov M. D. pp-Adic φ4\varphi^{4}-theory as a functional equation problem. Lett. Math. Phys. 39(3), 253-260  (1997).
  • [24] Missarov M. D. pp-Adic renormalization group solutions and the Euclidean renormalization group conjectures. p-Adic Numbers Ultrametric Anal. Appl. 4(2), 109-114  (2012)
  • [25] Khrennikov A. Yu. pp-Adic Valued Distributions in Mathematical Physics (Dordrecht, Kluwer Academic Publishers, 1994).
  • [26] Kochubei Anatoly N., Sait-Ametov Mustafa R. Interaction measures on the space of distributions over the field of pp-adic numbers. Infin. Dimens. Anal. Quantum Probab. Relat. Top. 6(3), 389–411 (2003).
  • [27] Khrennikov, Andrei, Kozyrev, Sergei, Zúñiga-Galindo, W. A. Ultrametric Equations and its Applications. Encyclopedia of Mathematics and its Applications 168 (Cambridge, Cambridge University Press, 2018).
  • [28] Zúñiga-Galindo W. A. Non-Archimedean white noise, pseudodifferential stochastic equations, and massive Euclidean fields. J. Fourier Anal. Appl. 23 (2), 288–323 (2017)
  • [29] Mendoza-Martínez M. L., Vallejo J. A., Zúñiga-Galindo, W. A. Acausal quantum theory for non-Archimedean scalar fields. Rev. Math. Phys. 31(4), 1950011, 46 pp. (2019).
  • [30] Arroyo-Ortiz, Edilberto, Zúñiga-Galindo, W. A. Construction of pp-adic covariant quantum fields in the framework of white noise analysis. Rep. Math. Phys. 84(1), 1–34 (2019).
  • [31] Zúñiga-Galindo W. A., Non-Archimedean statistical field theory, Rev. Math. Phys. 34 (8), Paper No. 2250022, 41 pp. (2022).
  • [32] Liu Ding et al. Machine learning by unitary tensor network of hierarchical tree structure. New J. Phys. 21, 073059 (2019).
  • [33] Cheng Song, Wang Lei, Xiang Tao, and Zhang Pan. Tree tensor networks for generative modeling. Phys. Rev. B 99, 155131 (2019).
  • [34] Li Sujie, Feng Pan, Zhou Pengfei, and Zhang Pan. Boltzmann machines as two-dimensional tensor networks. Phys. Rev. B 104, 075154 (2021).
  • [35] Orús, R. Tensor networks for complex quantum systems. Nat Rev Phys 1, 538–550 (2019).
  • [36] Koblitz Neal. pp-Adic Numbers, pp-adic Analysis, and Zeta-Functions. Graduate Texts in Mathematics No. 58 (New York, Springer-Verlag, 1984).
  • [37] Chistyakov D. V. Fractal geometry of images of continuous embeddings of p-adic numbers and solenoids into Euclidean spaces.Theoret. and Math. Phys. 109 (1996), no. 3, 1495–1507 (1997).
  • [38] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks. Communications of the ACM, vol. 54, no. 10, pp. 95-103 (2011).
  • [39] Le Roux Nicolas, Bengio Yoshua. Deep belief networks are compact universal approximators. Neural Comput. 22, no. 8, 2192–2207 (2010).
  • [40] Kleinert Hagen, Schulte-Frohlinde V. Critical properties of ϕ4\phi^{4}-theories (Singapore, World Scientific, 2001).
  • [41] Mussardo Giuseppe. Statistical Field Theory. An Introduction to Exactly Solved Models of Statistical Physics (Oxford , Oxford University Press, 2010).
  • [42] G. E. Hinton, R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science 313, 5786 (2006), doi: 10.1126/science.112764.
  • [43] Albeverio S., Khrennikov A. Yu., Shelkovich V. M. Theory of pp-adicdistributions: linear and nonlinear models (Cambridge University Press, Cambridge 2010).
  • [44] Kochubei Anatoly N. Pseudo-differential equations and stochastics over non-Archimedean fields (New York, Marcel Dekker, 2001).
  • [45] Taibleson M. H. Fourier analysis on local fields (Princeton University Press, 1975).
  • [46] Bocardo-Gaspar Miriam, García-Compeán H., Zúñiga-Galindo W. A. Regularization of p-adic string amplitudes, and multivariate local zeta functions. Lett. Math. Phys. 109, no. 5, 1167–1204 (2019).
  • [47] Halmos P. Measure Theory (D. Van Nostrand Company Inc., New York, 1950).
  • [48] Fischer A., Igel C. An Introduction to Restricted Boltzmann Machines. In: Alvarez, L., Mejail, M., Gomez, L., Jacobo, J. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2012. Lecture Notes in Computer Science, vol 7441. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33275-3_2.
  • [49] Hinton G. E., Training products of experts by minimizing contrastive divergence, Neural Comput. 14, no. 8, pp. 1771–1800 (2002).
  • [50] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques (The MIT Press, Cambridge, MA, 2009).