This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gossiped and Quantized Online Multi-Kernel Learning

Tomas Ortega, , and Hamid Jafarkhani Authors are with the Center for Pervasive Communications & Computing and EECS Department, University of California, Irvine, Irvine, CA 92697 USA (e-mail: {tomaso, hamidj}@uci.edu). This work was supported in part by the NSF Award ECCS-2207457.
Abstract

In instances of online kernel learning where little prior information is available and centralized learning is unfeasible, past research has shown that distributed and online multi-kernel learning provides sub-linear regret as long as every pair of nodes in the network can communicate (i.e., the communications network is a complete graph). In addition, to manage the communication load, which is often a performance bottleneck, communications between nodes can be quantized. This letter expands on these results to non-fully connected graphs, which is often the case in wireless sensor networks. To address this challenge, we propose a gossip algorithm and provide a proof that it achieves sub-linear regret. Experiments with real datasets confirm our findings.

Index Terms:
Kernel-based learning, quantization, gossip algorithms, federated learning, sensor networks.

I Introduction

Many of the latest efforts in machine learning are focused on bringing learning as close to data collection as possible. This is of practical interest in a diverse array of applications, sensor networks in particular. In sensor networks, nodes are often communication-constrained, operate distributively, and can send only a few bits to their neighbors.

There are several papers in the literature that consider the problem of designing distributed learning methods for Online Multi-Kernel Learning (OMKL). In particular, [1] contains the design and regret analysis of quantized and distributed OMKL for a fully connected (complete) graph. Although the fully connected case is important, it is not applicable in many sensor networks. This letter designs a new algorithm that expands the theoretical guarantees of the distributed and quantized OMKL in [1] to an arbitrary general non-fully connected graph. Unlike [1], our method uses a novel hidden state quantization scheme in conjunction with a gossip algorithm. One major advantage, compared with [1], is that our algorithm does not need to communicate the loss function. Another example of a distributed OMKL that works for a non-fully connected graph is presented in [2]. It requires communicating an unlimited number of bits between neighbors, and its assumptions are stricter than those of [1]. To manage the required communication throughput, which is often a performance bottleneck [3, 4], this letter provides the properties of [2] under looser assumptions and while communicating only a limited number of bits between neighboring nodes.

In [5], a distributed stochastic gradient descent scheme with quantized communication is discussed. The proposed algorithm uses an extra shared “hidden” state and can be used for the single-kernel case, but not multi-kernel settings. In addition to extending [1] to general non-fully connected networks, our work can be considered as the extension of [5] to multi-kernel cases.

There are a variety of learning tasks that OMKL is suited for, such as multi-kernel classification, regression, and clustering [6, 7, 8]. For a more in-depth review of the literature, see [1].

II Problem formulation

Let us consider a network with JJ nodes and model it as an undirected connected graph, where two nodes are connected if and only if they can communicate. Our set of nodes VV, and the set of edges EE, define a communication graph G=(V,E)G=(V,E). At each instant of time tt, Node jj receives the data string 𝒙tjd\bm{x}^{j}_{t}\in\mathbb{R}^{d} and the desired response ytjy^{j}_{t}\in\mathbb{R}. Our task is to find an approximating function f(𝒙tj)f(\bm{x}^{j}_{t}) for ytjy^{j}_{t}. As in [1], ff belongs to a Reproducing Kernel Hilbert Space (RKHS) ={ff(𝒙)=t=1j=1Jαtjκ(𝒙,𝒙tj)}\mathcal{H}=\left\{f\mid f(\bm{x})=\sum_{t=1}^{\infty}\sum_{j=1}^{J}\alpha^{j}_{t}\kappa(\bm{x},\bm{x}^{j}_{t})\right\}, where κ(𝒙,𝒙tj):d×d\kappa(\bm{x},\bm{x}^{j}_{t}):\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R} is a kernel function that measures the similarity between xx and 𝒙tj\bm{x}^{j}_{t}. Let us consider the following optimization problem

minft=1Tj=1J𝒞(f(𝒙tj),ytj)+λΩ(f2),\min_{f\in\mathcal{H}}\sum_{t=1}^{T}\sum_{j=1}^{J}\mathcal{C}\left(f(\bm{x}^{j}_{t}),y^{j}_{t}\right)+\lambda\Omega\left(\lVert f\rVert_{\mathcal{H}}^{2}\right), (1)

where 𝒞\mathcal{C} is a cost function and λ>0\lambda>0 is a regularization parameter that controls an increasing function Ω\Omega. An optimal solution for this problem exists in the form f^(𝒙)=t=1Tj=1Jαtjκ(𝒙,𝒙tj)\hat{f}(\bm{x})=\sum_{t=1}^{T}\sum_{j=1}^{J}\alpha_{t}^{j}\kappa(\bm{x},\bm{x}_{t}^{j}), where αtj\alpha_{t}^{j} are real numbers [9].

In multi-kernel learning, a weighted combination of several kernels is selected to improve the performance, compared to single-kernel learning. A convex combination of the kernels {κp}p=1P\left\{\kappa_{p}\right\}_{p=1}^{P}, where κpp\kappa_{p}\in\mathcal{H}_{p} is an RKHS, is also an RKHS, denoted by ¯=1P\bar{\mathcal{H}}=\mathcal{H}_{1}\oplus\cdots\oplus\mathcal{H}_{P} [9]. Here, \oplus indicates the direct sum of Hilbert spaces. Using ¯\bar{\mathcal{H}} instead of \mathcal{H}, we turn problem (1) into:

min{w¯pj},{fp}t=1Tj=1J𝒞(p=1Pw¯pjfp(𝒙tj),ytj)+λΩ(p=1Pw¯pjfp¯2),\min_{\left\{\bar{w}_{p}^{j}\right\},\left\{f_{p}\right\}}\sum_{t=1}^{T}\sum_{j=1}^{J}\mathcal{C}\left(\sum_{p=1}^{P}\bar{w}_{p}^{j}f_{p}(\bm{x}_{t}^{j}),y_{t}^{j}\right)+\\ \lambda\Omega\left(\lVert\sum_{p=1}^{P}\bar{w}_{p}^{j}f_{p}\rVert^{2}_{\bar{\mathcal{H}}}\right), (2a)
s.t. p=1Pw¯pj=1,w¯pj0,fpp.\text{s.t. }\sum_{p=1}^{P}\bar{w}_{p}^{j}=1,\,\bar{w}_{p}^{j}\geq 0,f_{p}\in\mathcal{H}_{p}. (2b)

To evade the curse of dimensionality, a Random Feature (RF) approximation is used [1, 10]. For normalized shift-invariant kernels, i.e., κ(𝒙,𝒙)=κ(𝒙𝒙)\kappa(\bm{x},\bm{x}^{\prime})=\kappa(\bm{x}-\bm{x}^{\prime}), their Fourier transforms πκ(v)\pi_{\kappa}(v) exist, and due to normalization, πκ(0)=1\pi_{\kappa}(0)=1. Also, viewing πκ\pi_{\kappa} as a pdf, κ(𝒙𝒙)=πκ(v)ejv(𝒙𝒙)𝑑v=𝔼v[ejv(𝒙𝒙)]\kappa(\bm{x}-\bm{x}^{\prime})=\int\pi_{\kappa}(v)e^{jv^{\top}(\bm{x}-\bm{x}^{\prime})}\,dv=\mathbb{E}_{v}[e^{jv^{\top}(\bm{x}-\bm{x}^{\prime})}]. Gaussian, Laplacian, and Cauchy kernels are three examples of normalized shift-invariant kernels that we can use [11]. However, any family of shift invariant kernels can be used for the RF approximation we describe next. Drawing DD i.i.d. samples {𝒗𝒊}i=1D\{\bm{v_{i}}\}_{i=1}^{D} from πκ(v)\pi_{\kappa}(v), we define

zV(𝒙)=1D[\displaystyle z_{V}(\bm{x})=\frac{1}{\sqrt{D}}[ sin(𝒗𝟏𝒙),,sin(𝒗𝑫𝒙),\displaystyle\sin(\bm{v_{1}^{\top}x}),\ldots,\sin(\bm{v_{D}^{\top}x}), (3)
cos(𝒗𝟏𝒙),,cos(𝒗𝑫𝒙)],\displaystyle\cos(\bm{v_{1}^{\top}x}),\ldots,\cos(\bm{v_{D}^{\top}x})]^{\top},

which has the property that κ(𝒙,𝒙)=𝔼v[zV(𝒙)zV(𝒙)]\kappa(\bm{x},\bm{x^{\prime}})=\mathbb{E}_{v}[z_{V}(\bm{x^{\prime}})^{\top}z_{V}(\bm{x})]. So, given a fixed {𝒗𝒊}i=1D\{\bm{v_{i}}\}_{i=1}^{D}, we can approximate κ(𝒙,𝒙)zV(𝒙)zV(𝒙)\kappa(\bm{x},\bm{x^{\prime}})\approx z_{V}(\bm{x^{\prime}})^{\top}z_{V}(\bm{x}). This is called the RF approximation, and it is an unbiased and consistent estimation of κ(𝒙,𝒙)\kappa(\bm{x},\bm{x^{\prime}}) [10]. Thus, a weight vector 𝜽2D\bm{\theta}\in\mathbb{R}^{2D} can be constructed such that

f^(𝒙)=t=1Tj=1Jαtjκ(𝒙,𝒙tj)t=1Tj=1JαtjzV(𝒙tj)zV(𝒙)=𝜽zV(𝒙).\hat{f}(\bm{x})=\sum_{t=1}^{T}\sum_{j=1}^{J}\alpha_{t}^{j}\kappa(\bm{x},\bm{x}_{t}^{j})\approx\sum_{t=1}^{T}\sum_{j=1}^{J}\alpha_{t}^{j}z_{V}(\bm{x}_{t}^{j})^{\top}z_{V}(\bm{x})\\ =\bm{\theta}^{\top}z_{V}(\bm{x}). (4)

Then, the loss function is defined as

(f(𝒙))=𝒞(𝜽zV(𝒙),y)+λΩ(𝜽2),\mathcal{L}(f(\bm{x}))=\mathcal{C}(\bm{\theta}^{\top}z_{V}(\bm{x}),y)+\lambda\Omega(\lVert\bm{\theta}\rVert^{2}), (5)

and the following equations are obtained for Kernel pp and Node jj:

f^p,tj(𝒙tj)\displaystyle\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j}) =𝜽p,tjzVp(𝒙tj),\displaystyle={\bm{\theta}^{j}_{p,t}}^{\top}z_{V_{p}}(\bm{x}_{t}^{j}), (6)
𝜽p,t+1j\displaystyle\bm{\theta}_{p,t+1}^{j} =𝜽p,tjη(𝜽p,tjzVp(𝒙tj),ytj),\displaystyle=\bm{\theta}_{p,t}^{j}-\eta\nabla\mathcal{L}({\bm{\theta}^{j}_{p,t}}^{\top}z_{V_{p}}(\bm{x}_{t}^{j}),y_{t}^{j}), (7)
wp,t+1j\displaystyle w_{p,t+1}^{j} =wp,tjexpη(f^p,tj(𝒙tj)),\displaystyle=w_{p,t}^{j}\exp{-\eta\mathcal{L}\left(\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j})\right)}, (8)

where η(0,1)\eta\in(0,1) is a learning rate. The weights are normalized as w¯p,tj=wp,tj/p=1Pwp,tj\bar{w}_{p,t}^{j}=w_{p,t}^{j}/\sum_{p=1}^{P}w_{p,t}^{j} to have

f^tj(x)=p=1Pw¯p,tjf^p,tj(𝒙tj).\hat{f}^{j}_{t}(x)=\sum_{p=1}^{P}\bar{w}_{p,t}^{j}\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j}). (9)

The above equations represent how local online multi-kernel learning models can be built [1, 10]. In contrast to previous research, each node jj in our scenario can only communicate with its neighboring nodes in the set 𝒮jV\mathcal{S}_{j}\subseteq V. Thus, the local model 𝜽p,tj\bm{\theta}^{j}_{p,t} of each node may be different from node to node. Therefore, our algorithm must ensure that all nodes converge to the same model. For this purpose, the local models are propagated using a gossip algorithm. The nodes calculate the weighted average of the information provided by their neighbors at each communication round and use it to update their local information. The weight Node jj assigns to the information coming from Node ii is defined as wijw_{ij}^{\prime}. Notice that the sum of weights i𝒮jwij=1\sum_{i\in\mathcal{S}_{j}}w_{ij}^{\prime}=1 for all jj since the weights of a weighted average must add up to one. We impose wij=wjiw_{ij}^{\prime}=w_{ji}^{\prime} such that we can associate the weights to the edges of the undirected communication graph GG. Thus, the weights define the gossip matrix 𝑾\bm{W}^{\prime}, which is J×JJ\times J and doubly stochastic  [12, 13, 14]. The spectral gap of 𝑾\bm{W}^{\prime} is denoted by ρ=1λ2(𝑾)(0,1]\rho=1-\lambda_{2}(\bm{W}^{\prime})\in(0,1], where λ2(𝑾)\lambda_{2}(\bm{W}^{\prime}) represents the second eigenvalue of 𝑾\bm{W}^{\prime} in descending order.

Such a gossip algorithm works very well when the nodes communicate their states perfectly with their neighbors. However, practically, only a few bits can be communicated, i.e., information needs to be quantized before being shared with the neighbors. We use a random quantizer Q:nnQ:\mathbb{R}^{n}\to\mathbb{R}^{n}, to represent an arbitrary vector 𝒙n\bm{x}\in\mathbb{R}^{n} with Q(𝒙)Q(\bm{x}) in an efficient way. Our results work for any random quantizer that satisfies

𝔼QQ(𝒙)𝒙2(1δ)𝒙2,𝒙n,\mathbb{E}_{Q}\lVert Q(\bm{x})-\bm{x}\rVert^{2}\leq(1-\delta)\lVert\bm{x}\rVert^{2},\quad\forall\bm{x}\in\mathbb{R}^{n}, (10)

for some δ>0\delta>0, which we will call the compression parameter. Here, 𝔼Q\mathbb{E}_{Q} denotes the expected value with respect to the internal randomness of Q()Q(\cdot). In simulations, we use the randomized quantizer of [15]. Each element of a non-zero vector 𝒗n\bm{v}\in\mathbb{R}^{n}, i.e., viv_{i}, is quantized by

QM(vi)=vsign(vi)ξi(𝒗,M),Q_{M}(v_{i})=\lVert v\rVert\operatorname{sign}(v_{i})\xi_{i}(\bm{v},M), (11)

where M=2b1M=2^{b}-1 is the number of quantization levels and defining l=Mvivl=\lfloor M\tfrac{v_{i}}{\lVert v\rVert}\rfloor,

ξi(𝒗,M)={lMwith probability 1Mviv+l,l+1Motherwise,\xi_{i}(\bm{v},M)=\begin{cases}\frac{l}{M}&\text{with probability }1-M\tfrac{v_{i}}{\lVert v\rVert}+l,\\ \frac{l+1}{M}&\text{otherwise},\end{cases} (12)

is represented by bb bits. Following [15, Lemma 3.1], Eq. (10) holds for δ=1min(2DM2,2DM)>0\delta=1-\min\left(\tfrac{2D}{M^{2}},\tfrac{\sqrt{2D}}{M}\right)>0.

III Algorithm and Regret Analysis

We present Algorithm 1 for gossiped and quantized OMKL. For ease of notation, we will denote (𝜽p,tjzVp(𝒙tj),yt)\mathcal{L}({\bm{\theta}^{j}_{p,t}}^{\top}z_{V_{p}}(\bm{x}_{t}^{j}),y_{t}) as (𝜽p,tj)\mathcal{L}(\bm{\theta}^{j}_{p,t}) from now on. In Algorithm 1, at each time instance, nodes collect their local data and transform them according to the RF approximation. Then, the kernel losses are computed and used to update the kernel weights. For the gossip step, we define a hidden state 𝒉p,tj2D\bm{h}_{p,t}^{j}\in\mathbb{R}^{2D} for each node jj that is known for all neighbors in 𝒮j\mathcal{S}_{j} because it is updated by the same quantized values known to all neighbors. The new local state, 𝜽p,tj\bm{\theta}_{p,t}^{j}, will be a sum of 𝜽p,t1j\bm{\theta}_{p,t-1}^{\prime j}, which is an auxiliary variable where we store the local learning, plus a gossip step of size γ\gamma with the information from the hidden states. Subsequently, each node jj prepares the update to the hidden state by quantizing the difference between its local state 𝜽p,tj\bm{\theta}_{p,t}^{j} and the common hidden state 𝒉p,tj\bm{h}_{p,t}^{j}. This quantized difference is sent to the neighbors and is used by them to collectively update the hidden state. Finally, each node performs local learning with a step size η\eta, and stores it in the auxiliary variable 𝜽p,tj\bm{\theta}_{p,t}^{\prime j}.

The role of a hidden state and quantized update is to have an accurate representation of the neighbor’s states without the need for broadcasting the full state at each time instance.

Algorithm 1 Gossiped and Quantized OMKL at Node jj
1:  Initialize wpj=1/Pw_{p}^{j}=1/P and 𝒉pj=𝜽pj=𝜽pj=0\bm{h}_{p}^{j}=\bm{\theta}_{p}^{j}=\bm{\theta}_{p}^{\prime j}=0 for all pp.
2:  for t=1,,Tt=1,\ldots,T do
3:     Obtain data 𝒙tj\bm{x}_{t}^{j} and construct zp(𝒙tj)z_{p}(\bm{x}_{t}^{j}) for all pp.
4:     Compute (f^p,tj(𝒙tj))\mathcal{L}(\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j})) for all pp.
5:     Update wp,tjw_{p,t}^{j} according to (8).
6:     Compute (f^tj(𝒙tj))\mathcal{L}(\hat{f}^{j}_{t}(\bm{x}_{t}^{j}))
7:     for p=1,,Pp=1,\ldots,P do
8:        Update using the hidden states 𝜽p,tj=𝜽p,t1j+γi=1Jwij(𝒉p,ti𝒉p,tj)\bm{\theta}_{p,t}^{j}=\bm{\theta}_{p,t-1}^{\prime j}+\gamma\sum_{i=1}^{J}w^{\prime}_{ij}(\bm{h}_{p,t}^{i}-\bm{h}_{p,t}^{j}).
9:        Compress 𝒒p,tj=Q(𝜽p,tj𝒉p,tj)\bm{q}_{p,t}^{j}=Q(\bm{\theta}_{p,t}^{j}-\bm{h}_{p,t}^{j}).
10:        for ii neighbor of jj, including jj do
11:           Send 𝒒p,tj\bm{q}_{p,t}^{j} to ii and receive 𝒒p,ti\bm{q}_{p,t}^{i}.
12:           Update 𝒉p,t+1i=𝒉p,ti+𝒒p,ti\bm{h}_{p,t+1}^{i}=\bm{h}_{p,t}^{i}+\bm{q}_{p,t}^{i}.
13:        end for
14:        As in (7), update 𝜽p,tj=𝜽p,tjη(𝜽p,tj)\bm{\theta}_{p,t}^{\prime j}=\bm{\theta}_{p,t}^{j}-\eta\nabla\mathcal{L}(\bm{\theta}_{p,t}^{j}).
15:     end for
16:  end for

In what follows, we provide regret analysis to study the performance of our algorithm. First, we make the following assumptions similar to what is done in the literature.

Assumption 1.

Let us assume a KK-smooth loss function

(𝜽1)(𝜽2)K𝜽1𝜽2𝜽1,𝜽22D,\lVert\nabla\mathcal{L}(\bm{\theta}_{1})-\nabla\mathcal{L}(\bm{\theta}_{2})\rVert\leq K\lVert\bm{\theta}_{1}-\bm{\theta}_{2}\rVert\quad\forall\bm{\theta}_{1},\bm{\theta}_{2}\in\mathbb{R}^{2D}, (13)

with a bounded gradient (𝛉)L\lVert\nabla\mathcal{L}(\bm{\theta})\rVert\leq L.

Assumption 2.

Let us assume a convex loss function with respect to 𝛉\bm{\theta}, i.e.,

(𝜽p,tj)(𝜽)(𝜽p,tj)(𝜽p,tj𝜽)𝜽2D.\mathcal{L}(\bm{\theta}_{p,t}^{j})-\mathcal{L}(\bm{\theta})\leq\nabla\mathcal{L}(\bm{\theta}_{p,t}^{j})^{\top}(\bm{\theta}_{p,t}^{j}-\bm{\theta})\quad\forall\bm{\theta}\in\mathbb{R}^{2D}. (14)
Assumption 3.

Let us assume that we have a connected communication graph and a doubly stochastic gossip matrix 𝐖\bm{W}^{\prime} whose entries are zero if and only if their corresponding edges are not present in the communication graph.

Theorem 1.

Under Assumptions 1, 2, and 3, the sequences {f^p,tj}\{\hat{f}^{j}_{p,t}\} and {w¯p,t}\{\bar{w}_{p,t}\} generated by the gossiped and quantized OMKL satisfy

j=1Jt=1T𝔼[(p=1Pw¯p,tjf^p,tj(𝒙tj))]j=1Jt=1T(f^p(𝒙tj))J𝜽p22η+ηJTL22+ηJTL12c+JlnPη+ηJT,\sum_{j=1}^{J}\sum_{t=1}^{T}\mathbb{E}\left[\mathcal{L}\left(\sum_{p=1}^{P}\bar{w}_{p,t}^{j}\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j})\right)\right]-\sum_{j=1}^{J}\sum_{t=1}^{T}\mathcal{L}\left(\hat{f}_{p}^{*}(\bm{x}_{t}^{j})\right)\leq\\ J\frac{\lVert\bm{\theta}^{*}_{p}\rVert^{2}}{2\eta}+\eta\frac{JTL^{2}}{2}+\eta\frac{JTL\sqrt{12}}{c}+\frac{J\ln P}{\eta}+\eta JT, (15)

where 𝛉p\bm{\theta}^{*}_{p} is the optimal solution to the problem for the pp-th kernel, c=ρ2δ82c=\frac{\rho^{2}\delta}{82}, ρ\rho is the spectral gap of the gossip matrix 𝐖\bm{W}^{\prime}, and δ>0\delta>0 is the quantizer’s compression parameter.

To prove Thm. 1, we first need to present Lemma 1, which bounds the difference between a single kernel loss and the optimal loss, and Lemma 2, which bounds the error added by the kernel weights.

Lemma 1.

Let us choose the consensus step size as in [16, Thm. 4.1], i.e., γ=ρ2δ16ρ+ρ2+4β2+2ρβ28ρδ\gamma=\frac{\rho^{2}\delta}{16\rho+\rho^{2}+4\beta^{2}+2\rho\beta^{2}-8\rho\delta}, where we have defined β=𝐈𝐖2[0,2]\beta=\lVert\bm{I}-\bm{W^{\prime}}\rVert_{2}\in[0,2]. Then, for η(0,1]\eta\in(0,1] and under Assumptions 1, 2, and 3, we have

j=1Jt=1T𝔼[(f^p,tj(𝒙tj))]j=1Jt=1T(f^p(𝒙tj))J𝜽p22η+ηJTL22+ηJTL12c.\sum_{j=1}^{J}\sum_{t=1}^{T}\mathbb{E}\left[\mathcal{L}\left(\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j})\right)\right]-\sum_{j=1}^{J}\sum_{t=1}^{T}\mathcal{L}\left(\hat{f}_{p}^{*}(\bm{x}_{t}^{j})\right)\leq\\ J\frac{\lVert\bm{\theta}^{*}_{p}\rVert^{2}}{2\eta}+\eta\frac{JTL^{2}}{2}+\eta\frac{JTL\sqrt{12}}{c}. (16)
Proof.

Similar to [1], let us consider an arbitrary 𝜽2D\bm{\theta}\in\mathbb{R}^{2D}. Define 𝜽¯p,t=1Jj=1J𝜽p,tj\bar{\bm{\theta}}_{p,t}=\frac{1}{J}\sum_{j=1}^{J}\bm{\theta}_{p,t}^{j}, and 𝜽¯p,t\bar{\bm{\theta}}_{p,t}^{\prime} and ¯(𝜽p,t)\nabla\bar{\mathcal{L}}(\bm{\theta}_{p,t}) analogously. Since our gossip scheme preserves averages, we can deduce 𝜽¯p,t+1=𝜽¯p,t=𝜽¯p,tη¯(𝜽p,t)\bar{\bm{\theta}}_{p,t+1}=\bar{\bm{\theta}}_{p,t}^{\prime}=\bar{\bm{\theta}}_{p,t}-\eta\nabla\bar{\mathcal{L}}(\bm{\theta}_{p,t}). Thus,

𝜽¯p,t+1𝜽2\displaystyle\lVert\bar{\bm{\theta}}_{p,t+1}-\bm{\theta}\rVert^{2} =𝜽¯p,tη¯(𝜽p,t)𝜽2\displaystyle=\lVert\bar{\bm{\theta}}_{p,t}-\eta\nabla\bar{\mathcal{L}}(\bm{\theta}_{p,t})-\bm{\theta}\rVert^{2} (17)
=𝜽¯p,t𝜽2+η2¯(𝜽p,t)2\displaystyle=\lVert\bar{\bm{\theta}}_{p,t}-\bm{\theta}\rVert^{2}+\eta^{2}\lVert\nabla\bar{\mathcal{L}}(\bm{\theta}_{p,t})\rVert^{2} (18)
2η¯(𝜽p,t)(𝜽¯p,t𝜽).\displaystyle-2\eta\nabla\bar{\mathcal{L}}(\bm{\theta}_{p,t})^{\top}(\bar{\bm{\theta}}_{p,t}-\bm{\theta}). (19)

By Asm. 2, we have

(𝜽p,tj)(𝜽)(𝜽p,tj)(𝜽p,tj+𝜽¯p,t𝜽¯p,t𝜽)(𝜽p,tj)(𝜽¯p,t𝜽)+L𝜽p,tj𝜽¯p,t.\mathcal{L}(\bm{\theta}_{p,t}^{j})-\mathcal{L}(\bm{\theta})\leq\nabla\mathcal{L}(\bm{\theta}_{p,t}^{j})^{\top}(\bm{\theta}_{p,t}^{j}+\bar{\bm{\theta}}_{p,t}-\bar{\bm{\theta}}_{p,t}-\bm{\theta})\\ \leq\nabla\mathcal{L}(\bm{\theta}_{p,t}^{j})^{\top}(\bar{\bm{\theta}}_{p,t}-\bm{\theta})+L\lVert\bm{\theta}_{p,t}^{j}-\bar{\bm{\theta}}_{p,t}\rVert. (20)

Inserting (20) into (19) and summing over jj, we obtain

j=1J((𝜽p,tj)(𝜽))η2J¯(𝜽p,t)2+J𝜽¯p,t𝜽2𝜽¯p,t+1𝜽22η+Lj=1J𝜽p,tj𝜽¯p,t,\sum_{j=1}^{J}\left(\mathcal{L}(\bm{\theta}_{p,t}^{j})-\mathcal{L}(\bm{\theta})\right)\leq\frac{\eta}{2}J\lVert\nabla\bar{\mathcal{L}}(\bm{\theta}_{p,t})\rVert^{2}+\\ J\frac{\lVert\bar{\bm{\theta}}_{p,t}-\bm{\theta}\rVert^{2}-\lVert\bar{\bm{\theta}}_{p,t+1}-\bm{\theta}\rVert^{2}}{2\eta}+L\sum_{j=1}^{J}\lVert\bm{\theta}_{p,t}^{j}-\bar{\bm{\theta}}_{p,t}\rVert, (21)

where we have used Asm. 1 to bound the gradient. Summing over tt and bounding again, we derive

t=1Tj=1J((𝜽p,tj)(𝜽))η2JTL2+Lt=1Tj=1J𝜽p,tj𝜽¯p,t+J𝜽¯p,1𝜽2𝜽¯p,T+1𝜽22η.\sum_{t=1}^{T}\sum_{j=1}^{J}\left(\mathcal{L}(\bm{\theta}_{p,t}^{j})-\mathcal{L}(\bm{\theta})\right)\leq\frac{\eta}{2}JTL^{2}+L\sum_{t=1}^{T}\sum_{j=1}^{J}\lVert\bm{\theta}_{p,t}^{j}-\bar{\bm{\theta}}_{p,t}\rVert\\ +J\frac{\lVert\bar{\bm{\theta}}_{p,1}-\bm{\theta}\rVert^{2}-\lVert\bar{\bm{\theta}}_{p,T+1}-\bm{\theta}\rVert^{2}}{2\eta}. (22)

Applying [16, Lemma A.2] and Cauchy-Schwarz results in

j=1J𝔼𝜽p,tj𝜽¯p,t2\displaystyle\sum_{j=1}^{J}\mathbb{E}\lVert\bm{\theta}_{p,t}^{j}-\bar{\bm{\theta}}_{p,t}\rVert^{2} η212JL2c2\displaystyle\leq\eta^{2}\frac{12JL^{2}}{c^{2}}\implies (23)
j=1J𝔼𝜽p,tj𝜽¯p,t\displaystyle\sum_{j=1}^{J}\mathbb{E}\lVert\bm{\theta}_{p,t}^{j}-\bar{\bm{\theta}}_{p,t}\rVert ηJL12c.\displaystyle\leq\eta\frac{JL\sqrt{12}}{c}. (24)

The proof is complete, by using (24) to bound (22) and choosing 𝜽p,1j=𝟎\bm{\theta}_{p,1}^{j}=\bm{0} and 𝜽=𝜽p\bm{\theta}=\bm{\theta}_{p}^{*}. ∎

Refer to caption
Figure 1: Comparison between the gossiped OMKL, OMKL from [1] and the single-kernel SGD, labelled Koloskova, from [5]. We have used σ=1\sigma=1 for the latter. The labels indicate the used topology for each algorithm.
Lemma 2.

Under the same conditions as Thm. 1,

j=1Jt=1Tp=1Pw¯p,tj(f^p,tj(𝒙tj))j=1Jt=1T(f^p,tj(𝒙tj))JlnPη+ηJT.\sum_{j=1}^{J}\sum_{t=1}^{T}\sum_{p=1}^{P}\bar{w}_{p,t}^{j}\mathcal{L}\left(\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j})\right)-\sum_{j=1}^{J}\sum_{t=1}^{T}\mathcal{L}\left(\hat{f}^{j}_{p^{\prime},t}(\bm{x}_{t}^{j})\right)\\ \leq\frac{J\ln P}{\eta}+\eta JT. (25)

The proof of Lemma 2 follows the same steps as in [1, Lemma 3].

From the convexity of ()\mathcal{L}(\cdot),

(p=1Pw¯p,tf^p,tj(𝒙t))p=1Pw¯p,t(f^p,tj(𝒙t)),\mathcal{L}\left(\sum_{p=1}^{P}\bar{w}_{p,t}\hat{f}^{j}_{p,t}(\bm{x}_{t})\right)\leq\sum_{p=1}^{P}\bar{w}_{p,t}\mathcal{L}\left(\hat{f}^{j}_{p,t}(\bm{x}_{t})\right), (26)

and Thm. 1 follows immediately using Lemma 1 and Lemma 2.

IV Experimental results and conclusions

To validate Thm. 1, we have tested Algorithm 1 with real datasets for binary classification, with different topologies, and using different values of quantization level. We have used the well-known Kernel Logistic Regression (KLR) loss function [9], i.e., the loss function (5) is defined as

ln(1+exp(y𝜽zV(𝒙)))+λ𝜽2,\ln\left(1+\exp(-y\cdot\bm{\theta}^{\top}z_{V}(\bm{x}))\right)+\lambda\lVert\bm{\theta}\rVert^{2}, (27)

where we have used Ω(𝜽2)=𝜽2\Omega(\lVert\bm{\theta}\rVert^{2})=\lVert\bm{\theta}\rVert^{2}. We have conducted our experiments with three datasets: Banana, Credit-Card, and MNIST. The synthetic data from the Banana dataset [17] (n=5300,d=2n=5300,d=2) are two non-linearly separable clusters. The Credit-Card dataset [18] (n=30000,d=2n=30000,d=2) contains data from credit card users, such as their credit history and whether they have defaulted or not. It is obtained from the UCI machine learning repository [19]. The MNIST dataset [20] (n=70000,d=784n=70000,d=784) is a labeled set of handwritten digits from 0 to 9. We divide them into two classes, those that are number 8 and those that are not.

Our experimental setup has J=20J=20 nodes, dimension D=20D=20 for our RF approximation, regularization parameter λ=0.001\lambda=0.001, and three Gaussian kernels with σ{1,3,5}\sigma\in\{1,3,5\}, where a Gaussian kernel with parameter σ\sigma is defined as κ(𝒙,𝒙)=exp[𝒙𝒙2/(2σ2)]\kappa(\bm{x},\bm{x^{\prime}})=\exp{[-\lVert\bm{x}-\bm{x^{\prime}}\rVert^{2}/(2\sigma^{2})]}. Simulations have been performed 100 times with different sets of Random Features and the corresponding mean of the loss function is plotted. We report the loss instead of the classification accuracy because the difference in the latter is minimal. We have used the quantizer described in (12) with 7 levels of quantization, that is, 3 bits per element in any transmitted array. Our learning rates are η=0.01\eta=0.01 and γ=0.9η=0.009\gamma=0.9\eta=0.009.

Fig. 1 compares the performance of our algorithm versus two benchmarks. Since Algorithm 1 can be viewed as an extension of [1] to non-complete graphs, the first benchmark for comparison is the OMKL algorithm from [1]. There is a small difference between the performance of OMKL and that of our gossiped OMKL although OMKL requires many more inter-node communications. This is despite the fact that OMKL is run over a complete topology, but the gossiped OMKL is executed over the worst-case scenario, i.e., a path topology that includes much smaller number of connections and as a result requires much less communication.

Gossiped OMKL can also be considered as an extension of [5] to multi-kernel learning, which justifies using the single-kernel SGD from [5] as our second benchmark. Gossiped OMKL using three kernels clearly outperforms the single-kernel approach, named Koloskova in Fig. 1.

Refer to caption
Figure 2: The effect of graph topology, without quantization. Average loss function at each iteration, with the Credit-Card dataset. Since the Credit-Card dataset has n=30000n=30000 data observations, and J=20J=20 nodes, we have run the algorithm 15001500 iterations.

Although not shown in the figure, we have observed that, for values of η\eta and γ\gamma chosen in Fig. 1, or smaller values, the choice of topology does not affect the performance of our algorithm. To observe the effect of topology on algorithm performance, larger step sizes (η=0.1\eta=0.1 and γ=0.09\gamma=0.09) are chosen in Fig. 2. The figure shows simulations using the Credit-Card dataset and without quantization on three different topologies: the complete graph, the ring, and the path. We can observe that more densely connected communication graphs lead to better performance. We have also tested the algorithm for M7M\geq 7 levels of quantization, that satisfies the condition δ>0\delta>0 in (12), and all of them perform as well as the non-quantized version, with the loss function differences in the order of 10610^{-6}.

The results show that our gossiped OMKL algorithm can successfully extend the OMKL algorithm [1] to non-complete graphs and the distributed learning algorithm in [5] to multi-kernel learning.

References

  • [1] Y. Shen, S. Karimi-Bidhendi, and H. Jafarkhani, “Distributed and quantized online multi-kernel learning,” IEEE Transactions on Signal Processing, vol. 69, pp. 5496–5511, 2021.
  • [2] S. Hong and J. Chae, “Distributed online learning with multiple kernels,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
  • [3] S. Savazzi, M. Nicoli, M. Bennis, S. Kianoush, and L. Barbieri, “Opportunities of federated learning in connected, cooperative, and automated industrial systems,” IEEE Communications Magazine, vol. 59, no. 2, pp. 16–21, 2021.
  • [4] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021.
  • [5] A. Koloskova, S. U. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” 36th International Conference on Machine Learning, ICML 2019, vol. 2019-June, pp. 6088–6111, 2019.
  • [6] S. C. H. Hoi, R. Jin, P. Zhao, and T. Yang, “Online multiple kernel classification,” Machine Learning, vol. 90, no. 2, p. 289–316, Feb 2013. [Online]. Available: https://doi.org/10.1007/s10994-012-5319-2
  • [7] D. Sahoo, S. C. Hoi, and B. Li, “Online multiple kernel regression,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining.   New York New York USA: ACM, Aug 2014, p. 293–302. [Online]. Available: https://dl.acm.org/doi/10.1145/2623330.2623712
  • [8] C. Tang, Z. Li, J. Wang, X. Liu, W. Zhang, and E. Zhu, “Unified one-step multi-view spectral clustering,” IEEE Transactions on Knowledge and Data Engineering, p. 1–1, 2022.
  • [9] B. Schölkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, ser. Adaptive computation and machine learning.   MIT Press, 2002.
  • [10] Y. Shen, T. Chen, and G. B. Giannakis, “Random feature-based online multi-kernel learning in environments with unknown dynamics,” Journal of Machine Learning Research, vol. 20, 2019.
  • [11] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in Neural Information Processing Systems, vol. 20.   Curran Associates, Inc., 2007. [Online]. Available: https://proceedings.neurips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html
  • [12] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167691104000398
  • [13] S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing markov chain on a graph,” SIAM review, vol. 46, no. 4, pp. 667–689, 2004.
  • [14] B. Gharesifard and J. Cortés, “When does a digraph admit a doubly stochastic adjacency matrix?” in Proceedings of the 2010 American Control Conference.   IEEE, 2010, pp. 2440–2445.
  • [15] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [16] A. Koloskova, T. Lin, S. U. Stich, and M. Jaggi, “Decentralized deep learning with arbitrary communication compression,” in International Conference on Learning Representations, 2019.
  • [17] G. Rätsch, T. Onoda, and K.-R. Müller, “Soft margins for adaboost,” Machine learning, vol. 42, no. 3, pp. 287–320, 2001.
  • [18] I.-C. Yeh and C.-h. Lien, “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients,” Expert systems with applications, vol. 36, no. 2, pp. 2473–2480, 2009.
  • [19] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
  • [20] Y. LeCun, C. Cortes, and C. J.C. Burgess. (1998) The mnist database of handwritten digits. [Online]. Available: http://yann.lecun.com/exdb/mnist/