Gossiped and Quantized Online Multi-Kernel Learning

Tomas Ortega, , and Hamid Jafarkhani Authors are with the Center for Pervasive Communications & Computing and EECS Department, University of California, Irvine, Irvine, CA 92697 USA (e-mail: {tomaso, hamidj}@uci.edu). This work was supported in part by the NSF Award ECCS-2207457.

Abstract

In instances of online kernel learning where little prior information is available and centralized learning is unfeasible, past research has shown that distributed and online multi-kernel learning provides sub-linear regret as long as every pair of nodes in the network can communicate (i.e., the communications network is a complete graph). In addition, to manage the communication load, which is often a performance bottleneck, communications between nodes can be quantized. This letter expands on these results to non-fully connected graphs, which is often the case in wireless sensor networks. To address this challenge, we propose a gossip algorithm and provide a proof that it achieves sub-linear regret. Experiments with real datasets confirm our findings.

Index Terms:

Kernel-based learning, quantization, gossip algorithms, federated learning, sensor networks.

I Introduction

Many of the latest efforts in machine learning are focused on bringing learning as close to data collection as possible. This is of practical interest in a diverse array of applications, sensor networks in particular. In sensor networks, nodes are often communication-constrained, operate distributively, and can send only a few bits to their neighbors.

There are several papers in the literature that consider the problem of designing distributed learning methods for Online Multi-Kernel Learning (OMKL). In particular, [1] contains the design and regret analysis of quantized and distributed OMKL for a fully connected (complete) graph. Although the fully connected case is important, it is not applicable in many sensor networks. This letter designs a new algorithm that expands the theoretical guarantees of the distributed and quantized OMKL in [1] to an arbitrary general non-fully connected graph. Unlike [1], our method uses a novel hidden state quantization scheme in conjunction with a gossip algorithm. One major advantage, compared with [1], is that our algorithm does not need to communicate the loss function. Another example of a distributed OMKL that works for a non-fully connected graph is presented in [2]. It requires communicating an unlimited number of bits between neighbors, and its assumptions are stricter than those of [1]. To manage the required communication throughput, which is often a performance bottleneck [3, 4], this letter provides the properties of [2] under looser assumptions and while communicating only a limited number of bits between neighboring nodes.

In [5], a distributed stochastic gradient descent scheme with quantized communication is discussed. The proposed algorithm uses an extra shared “hidden” state and can be used for the single-kernel case, but not multi-kernel settings. In addition to extending [1] to general non-fully connected networks, our work can be considered as the extension of [5] to multi-kernel cases.

There are a variety of learning tasks that OMKL is suited for, such as multi-kernel classification, regression, and clustering [6, 7, 8]. For a more in-depth review of the literature, see [1].

II Problem formulation

Let us consider a network with $J$ nodes and model it as an undirected connected graph, where two nodes are connected if and only if they can communicate. Our set of nodes $V$ , and the set of edges $E$ , define a communication graph $G=(V,E)$ . At each instant of time $t$ , Node $j$ receives the data string $\bm{x}^{j}_{t}\in\mathbb{R}^{d}$ and the desired response $y^{j}_{t}\in\mathbb{R}$ . Our task is to find an approximating function $f(\bm{x}^{j}_{t})$ for $y^{j}_{t}$ . As in [1], $f$ belongs to a Reproducing Kernel Hilbert Space (RKHS) $\mathcal{H}=\left\{f\mid f(\bm{x})=\sum_{t=1}^{\infty}\sum_{j=1}^{J}\alpha^{j}_{t}\kappa(\bm{x},\bm{x}^{j}_{t})\right\}$ , where $\kappa(\bm{x},\bm{x}^{j}_{t}):\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ is a kernel function that measures the similarity between $x$ and $\bm{x}^{j}_{t}$ . Let us consider the following optimization problem

\min_{f\in\mathcal{H}}\sum_{t=1}^{T}\sum_{j=1}^{J}\mathcal{C}\left(f(\bm{x}^{j}_{t}),y^{j}_{t}\right)+\lambda\Omega\left(\lVert f\rVert_{\mathcal{H}}^{2}\right),

(1)

where $\mathcal{C}$ is a cost function and $\lambda>0$ is a regularization parameter that controls an increasing function $\Omega$ . An optimal solution for this problem exists in the form $\hat{f}(\bm{x})=\sum_{t=1}^{T}\sum_{j=1}^{J}\alpha_{t}^{j}\kappa(\bm{x},\bm{x}_{t}^{j})$ , where $\alpha_{t}^{j}$ are real numbers [9].

In multi-kernel learning, a weighted combination of several kernels is selected to improve the performance, compared to single-kernel learning. A convex combination of the kernels $\left\{\kappa_{p}\right\}_{p=1}^{P}$ , where $\kappa_{p}\in\mathcal{H}_{p}$ is an RKHS, is also an RKHS, denoted by $\bar{\mathcal{H}}=\mathcal{H}_{1}\oplus\cdots\oplus\mathcal{H}_{P}$ [9]. Here, $\oplus$ indicates the direct sum of Hilbert spaces. Using $\bar{\mathcal{H}}$ instead of $\mathcal{H}$ , we turn problem (1) into:

	$\min_{\left\{\bar{w}_{p}^{j}\right\},\left\{f_{p}\right\}}\sum_{t=1}^{T}\sum_{j=1}^{J}\mathcal{C}\left(\sum_{p=1}^{P}\bar{w}_{p}^{j}f_{p}(\bm{x}_{t}^{j}),y_{t}^{j}\right)+\\ \lambda\Omega\left(\lVert\sum_{p=1}^{P}\bar{w}_{p}^{j}f_{p}\rVert^{2}_{\bar{\mathcal{H}}}\right),$		(2a)
	$\text{s.t. }\sum_{p=1}^{P}\bar{w}_{p}^{j}=1,\,\bar{w}_{p}^{j}\geq 0,f_{p}\in\mathcal{H}_{p}.$		(2b)

To evade the curse of dimensionality, a Random Feature (RF) approximation is used [1, 10]. For normalized shift-invariant kernels, i.e., $\kappa(\bm{x},\bm{x}^{\prime})=\kappa(\bm{x}-\bm{x}^{\prime})$ , their Fourier transforms $\pi_{\kappa}(v)$ exist, and due to normalization, $\pi_{\kappa}(0)=1$ . Also, viewing $\pi_{\kappa}$ as a pdf, $\kappa(\bm{x}-\bm{x}^{\prime})=\int\pi_{\kappa}(v)e^{jv^{\top}(\bm{x}-\bm{x}^{\prime})}\,dv=\mathbb{E}_{v}[e^{jv^{\top}(\bm{x}-\bm{x}^{\prime})}]$ . Gaussian, Laplacian, and Cauchy kernels are three examples of normalized shift-invariant kernels that we can use [11]. However, any family of shift invariant kernels can be used for the RF approximation we describe next. Drawing $D$ i.i.d. samples $\{\bm{v_{i}}\}_{i=1}^{D}$ from $\pi_{\kappa}(v)$ , we define

	$\displaystyle z_{V}(\bm{x})=\frac{1}{\sqrt{D}}[$	$\displaystyle\sin(\bm{v_{1}^{\top}x}),\ldots,\sin(\bm{v_{D}^{\top}x}),$		(3)
		$\displaystyle\cos(\bm{v_{1}^{\top}x}),\ldots,\cos(\bm{v_{D}^{\top}x})]^{\top},$		(3)

which has the property that $\kappa(\bm{x},\bm{x^{\prime}})=\mathbb{E}_{v}[z_{V}(\bm{x^{\prime}})^{\top}z_{V}(\bm{x})]$ . So, given a fixed $\{\bm{v_{i}}\}_{i=1}^{D}$ , we can approximate $\kappa(\bm{x},\bm{x^{\prime}})\approx z_{V}(\bm{x^{\prime}})^{\top}z_{V}(\bm{x})$ . This is called the RF approximation, and it is an unbiased and consistent estimation of $\kappa(\bm{x},\bm{x^{\prime}})$ [10]. Thus, a weight vector $\bm{\theta}\in\mathbb{R}^{2D}$ can be constructed such that

\hat{f}(\bm{x})=\sum_{t=1}^{T}\sum_{j=1}^{J}\alpha_{t}^{j}\kappa(\bm{x},\bm{x}_{t}^{j})\approx\sum_{t=1}^{T}\sum_{j=1}^{J}\alpha_{t}^{j}z_{V}(\bm{x}_{t}^{j})^{\top}z_{V}(\bm{x})\\ =\bm{\theta}^{\top}z_{V}(\bm{x}).

(4)

Then, the loss function is defined as

\mathcal{L}(f(\bm{x}))=\mathcal{C}(\bm{\theta}^{\top}z_{V}(\bm{x}),y)+\lambda\Omega(\lVert\bm{\theta}\rVert^{2}),

(5)

and the following equations are obtained for Kernel $p$ and Node $j$ :

$\displaystyle\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j})$	$\displaystyle={\bm{\theta}^{j}_{p,t}}^{\top}z_{V_{p}}(\bm{x}_{t}^{j}),$	(6)
$\displaystyle\bm{\theta}_{p,t+1}^{j}$	$\displaystyle=\bm{\theta}_{p,t}^{j}-\eta\nabla\mathcal{L}({\bm{\theta}^{j}_{p,t}}^{\top}z_{V_{p}}(\bm{x}_{t}^{j}),y_{t}^{j}),$	(7)
$\displaystyle w_{p,t+1}^{j}$	$\displaystyle=w_{p,t}^{j}\exp{-\eta\mathcal{L}\left(\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j})\right)},$	(8)

where $\eta\in(0,1)$ is a learning rate. The weights are normalized as $\bar{w}_{p,t}^{j}=w_{p,t}^{j}/\sum_{p=1}^{P}w_{p,t}^{j}$ to have

\hat{f}^{j}_{t}(x)=\sum_{p=1}^{P}\bar{w}_{p,t}^{j}\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j}).

(9)

The above equations represent how local online multi-kernel learning models can be built [1, 10]. In contrast to previous research, each node $j$ in our scenario can only communicate with its neighboring nodes in the set $\mathcal{S}_{j}\subseteq V$ . Thus, the local model $\bm{\theta}^{j}_{p,t}$ of each node may be different from node to node. Therefore, our algorithm must ensure that all nodes converge to the same model. For this purpose, the local models are propagated using a gossip algorithm. The nodes calculate the weighted average of the information provided by their neighbors at each communication round and use it to update their local information. The weight Node $j$ assigns to the information coming from Node $i$ is defined as $w_{ij}^{\prime}$ . Notice that the sum of weights $\sum_{i\in\mathcal{S}_{j}}w_{ij}^{\prime}=1$ for all $j$ since the weights of a weighted average must add up to one. We impose $w_{ij}^{\prime}=w_{ji}^{\prime}$ such that we can associate the weights to the edges of the undirected communication graph $G$ . Thus, the weights define the gossip matrix $\bm{W}^{\prime}$ , which is $J\times J$ and doubly stochastic [12, 13, 14]. The spectral gap of $\bm{W}^{\prime}$ is denoted by $\rho=1-\lambda_{2}(\bm{W}^{\prime})\in(0,1]$ , where $\lambda_{2}(\bm{W}^{\prime})$ represents the second eigenvalue of $\bm{W}^{\prime}$ in descending order.

Such a gossip algorithm works very well when the nodes communicate their states perfectly with their neighbors. However, practically, only a few bits can be communicated, i.e., information needs to be quantized before being shared with the neighbors. We use a random quantizer $Q:\mathbb{R}^{n}\to\mathbb{R}^{n}$ , to represent an arbitrary vector $\bm{x}\in\mathbb{R}^{n}$ with $Q(\bm{x})$ in an efficient way. Our results work for any random quantizer that satisfies

\mathbb{E}_{Q}\lVert Q(\bm{x})-\bm{x}\rVert^{2}\leq(1-\delta)\lVert\bm{x}\rVert^{2},\quad\forall\bm{x}\in\mathbb{R}^{n},

(10)

for some $\delta>0$ , which we will call the compression parameter. Here, $\mathbb{E}_{Q}$ denotes the expected value with respect to the internal randomness of $Q(\cdot)$ . In simulations, we use the randomized quantizer of [15]. Each element of a non-zero vector $\bm{v}\in\mathbb{R}^{n}$ , i.e., $v_{i}$ , is quantized by

Q_{M}(v_{i})=\lVert v\rVert\operatorname{sign}(v_{i})\xi_{i}(\bm{v},M),

(11)

where $M=2^{b}-1$ is the number of quantization levels and defining $l=\lfloor M\tfrac{v_{i}}{\lVert v\rVert}\rfloor$ ,

\xi_{i}(\bm{v},M)=\begin{cases}\frac{l}{M}&\text{with probability }1-M\tfrac{v_{i}}{\lVert v\rVert}+l,\\ \frac{l+1}{M}&\text{otherwise},\end{cases}

(12)

is represented by $b$ bits. Following [15, Lemma 3.1], Eq. (10) holds for $\delta=1-\min\left(\tfrac{2D}{M^{2}},\tfrac{\sqrt{2D}}{M}\right)>0$ .

III Algorithm and Regret Analysis

We present Algorithm 1 for gossiped and quantized OMKL. For ease of notation, we will denote $\mathcal{L}({\bm{\theta}^{j}_{p,t}}^{\top}z_{V_{p}}(\bm{x}_{t}^{j}),y_{t})$ as $\mathcal{L}(\bm{\theta}^{j}_{p,t})$ from now on. In Algorithm 1, at each time instance, nodes collect their local data and transform them according to the RF approximation. Then, the kernel losses are computed and used to update the kernel weights. For the gossip step, we define a hidden state $\bm{h}_{p,t}^{j}\in\mathbb{R}^{2D}$ for each node $j$ that is known for all neighbors in $\mathcal{S}_{j}$ because it is updated by the same quantized values known to all neighbors. The new local state, $\bm{\theta}_{p,t}^{j}$ , will be a sum of $\bm{\theta}_{p,t-1}^{\prime j}$ , which is an auxiliary variable where we store the local learning, plus a gossip step of size $\gamma$ with the information from the hidden states. Subsequently, each node $j$ prepares the update to the hidden state by quantizing the difference between its local state $\bm{\theta}_{p,t}^{j}$ and the common hidden state $\bm{h}_{p,t}^{j}$ . This quantized difference is sent to the neighbors and is used by them to collectively update the hidden state. Finally, each node performs local learning with a step size $\eta$ , and stores it in the auxiliary variable $\bm{\theta}_{p,t}^{\prime j}$ .

The role of a hidden state and quantized update is to have an accurate representation of the neighbor’s states without the need for broadcasting the full state at each time instance.

Algorithm 1 Gossiped and Quantized OMKL at Node

j

1: Initialize

w_{p}^{j}=1/P

and

\bm{h}_{p}^{j}=\bm{\theta}_{p}^{j}=\bm{\theta}_{p}^{\prime j}=0

for all

p

2: for

t=1,\ldots,T

3: Obtain data

\bm{x}_{t}^{j}

and construct

z_{p}(\bm{x}_{t}^{j})

for all

p

4: Compute

\mathcal{L}(\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j}))

for all

p

5: Update

w_{p,t}^{j}

according to (8).

6: Compute

\mathcal{L}(\hat{f}^{j}_{t}(\bm{x}_{t}^{j}))

7: for

p=1,\ldots,P

8: Update using the hidden states

\bm{\theta}_{p,t}^{j}=\bm{\theta}_{p,t-1}^{\prime j}+\gamma\sum_{i=1}^{J}w^{\prime}_{ij}(\bm{h}_{p,t}^{i}-\bm{h}_{p,t}^{j})

9: Compress

\bm{q}_{p,t}^{j}=Q(\bm{\theta}_{p,t}^{j}-\bm{h}_{p,t}^{j})

10: for

i

neighbor of

j

, including

j

11: Send

\bm{q}_{p,t}^{j}

i

and receive

\bm{q}_{p,t}^{i}

12: Update

\bm{h}_{p,t+1}^{i}=\bm{h}_{p,t}^{i}+\bm{q}_{p,t}^{i}

13: end for

14: As in (7), update

\bm{\theta}_{p,t}^{\prime j}=\bm{\theta}_{p,t}^{j}-\eta\nabla\mathcal{L}(\bm{\theta}_{p,t}^{j})

15: end for

16: end for

In what follows, we provide regret analysis to study the performance of our algorithm. First, we make the following assumptions similar to what is done in the literature.

Assumption 1.

Let us assume a $K$ -smooth loss function

\lVert\nabla\mathcal{L}(\bm{\theta}_{1})-\nabla\mathcal{L}(\bm{\theta}_{2})\rVert\leq K\lVert\bm{\theta}_{1}-\bm{\theta}_{2}\rVert\quad\forall\bm{\theta}_{1},\bm{\theta}_{2}\in\mathbb{R}^{2D},

(13)

with a bounded gradient $\lVert\nabla\mathcal{L}(\bm{\theta})\rVert\leq L$ .

Assumption 2.

Let us assume a convex loss function with respect to $\bm{\theta}$ , i.e.,

\mathcal{L}(\bm{\theta}_{p,t}^{j})-\mathcal{L}(\bm{\theta})\leq\nabla\mathcal{L}(\bm{\theta}_{p,t}^{j})^{\top}(\bm{\theta}_{p,t}^{j}-\bm{\theta})\quad\forall\bm{\theta}\in\mathbb{R}^{2D}.

(14)

Assumption 3.

Let us assume that we have a connected communication graph and a doubly stochastic gossip matrix $\bm{W}^{\prime}$ whose entries are zero if and only if their corresponding edges are not present in the communication graph.

Theorem 1.

Under Assumptions 1, 2, and 3, the sequences $\{\hat{f}^{j}_{p,t}\}$ and $\{\bar{w}_{p,t}\}$ generated by the gossiped and quantized OMKL satisfy

\sum_{j=1}^{J}\sum_{t=1}^{T}\mathbb{E}\left[\mathcal{L}\left(\sum_{p=1}^{P}\bar{w}_{p,t}^{j}\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j})\right)\right]-\sum_{j=1}^{J}\sum_{t=1}^{T}\mathcal{L}\left(\hat{f}_{p}^{*}(\bm{x}_{t}^{j})\right)\leq\\ J\frac{\lVert\bm{\theta}^{*}_{p}\rVert^{2}}{2\eta}+\eta\frac{JTL^{2}}{2}+\eta\frac{JTL\sqrt{12}}{c}+\frac{J\ln P}{\eta}+\eta JT,

(15)

where $\bm{\theta}^{*}_{p}$ is the optimal solution to the problem for the $p$ -th kernel, $c=\frac{\rho^{2}\delta}{82}$ , $\rho$ is the spectral gap of the gossip matrix $\bm{W}^{\prime}$ , and $\delta>0$ is the quantizer’s compression parameter.

To prove Thm. 1, we first need to present Lemma 1, which bounds the difference between a single kernel loss and the optimal loss, and Lemma 2, which bounds the error added by the kernel weights.

Lemma 1.

Let us choose the consensus step size as in [16, Thm. 4.1], i.e., $\gamma=\frac{\rho^{2}\delta}{16\rho+\rho^{2}+4\beta^{2}+2\rho\beta^{2}-8\rho\delta}$ , where we have defined $\beta=\lVert\bm{I}-\bm{W^{\prime}}\rVert_{2}\in[0,2]$ . Then, for $\eta\in(0,1]$ and under Assumptions 1, 2, and 3, we have

\sum_{j=1}^{J}\sum_{t=1}^{T}\mathbb{E}\left[\mathcal{L}\left(\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j})\right)\right]-\sum_{j=1}^{J}\sum_{t=1}^{T}\mathcal{L}\left(\hat{f}_{p}^{*}(\bm{x}_{t}^{j})\right)\leq\\ J\frac{\lVert\bm{\theta}^{*}_{p}\rVert^{2}}{2\eta}+\eta\frac{JTL^{2}}{2}+\eta\frac{JTL\sqrt{12}}{c}.

(16)

Proof.

Similar to [1], let us consider an arbitrary $\bm{\theta}\in\mathbb{R}^{2D}$ . Define $\bar{\bm{\theta}}_{p,t}=\frac{1}{J}\sum_{j=1}^{J}\bm{\theta}_{p,t}^{j}$ , and $\bar{\bm{\theta}}_{p,t}^{\prime}$ and $\nabla\bar{\mathcal{L}}(\bm{\theta}_{p,t})$ analogously. Since our gossip scheme preserves averages, we can deduce $\bar{\bm{\theta}}_{p,t+1}=\bar{\bm{\theta}}_{p,t}^{\prime}=\bar{\bm{\theta}}_{p,t}-\eta\nabla\bar{\mathcal{L}}(\bm{\theta}_{p,t})$ . Thus,

$\displaystyle\lVert\bar{\bm{\theta}}_{p,t+1}-\bm{\theta}\rVert^{2}$	$\displaystyle=\lVert\bar{\bm{\theta}}_{p,t}-\eta\nabla\bar{\mathcal{L}}(\bm{\theta}_{p,t})-\bm{\theta}\rVert^{2}$	(17)
	$\displaystyle=\lVert\bar{\bm{\theta}}_{p,t}-\bm{\theta}\rVert^{2}+\eta^{2}\lVert\nabla\bar{\mathcal{L}}(\bm{\theta}_{p,t})\rVert^{2}$	(18)
	$\displaystyle-2\eta\nabla\bar{\mathcal{L}}(\bm{\theta}_{p,t})^{\top}(\bar{\bm{\theta}}_{p,t}-\bm{\theta}).$	(19)

By Asm. 2, we have

\mathcal{L}(\bm{\theta}_{p,t}^{j})-\mathcal{L}(\bm{\theta})\leq\nabla\mathcal{L}(\bm{\theta}_{p,t}^{j})^{\top}(\bm{\theta}_{p,t}^{j}+\bar{\bm{\theta}}_{p,t}-\bar{\bm{\theta}}_{p,t}-\bm{\theta})\\ \leq\nabla\mathcal{L}(\bm{\theta}_{p,t}^{j})^{\top}(\bar{\bm{\theta}}_{p,t}-\bm{\theta})+L\lVert\bm{\theta}_{p,t}^{j}-\bar{\bm{\theta}}_{p,t}\rVert.

(20)

Inserting (20) into (19) and summing over $j$ , we obtain

\sum_{j=1}^{J}\left(\mathcal{L}(\bm{\theta}_{p,t}^{j})-\mathcal{L}(\bm{\theta})\right)\leq\frac{\eta}{2}J\lVert\nabla\bar{\mathcal{L}}(\bm{\theta}_{p,t})\rVert^{2}+\\ J\frac{\lVert\bar{\bm{\theta}}_{p,t}-\bm{\theta}\rVert^{2}-\lVert\bar{\bm{\theta}}_{p,t+1}-\bm{\theta}\rVert^{2}}{2\eta}+L\sum_{j=1}^{J}\lVert\bm{\theta}_{p,t}^{j}-\bar{\bm{\theta}}_{p,t}\rVert,

(21)

where we have used Asm. 1 to bound the gradient. Summing over $t$ and bounding again, we derive

\sum_{t=1}^{T}\sum_{j=1}^{J}\left(\mathcal{L}(\bm{\theta}_{p,t}^{j})-\mathcal{L}(\bm{\theta})\right)\leq\frac{\eta}{2}JTL^{2}+L\sum_{t=1}^{T}\sum_{j=1}^{J}\lVert\bm{\theta}_{p,t}^{j}-\bar{\bm{\theta}}_{p,t}\rVert\\ +J\frac{\lVert\bar{\bm{\theta}}_{p,1}-\bm{\theta}\rVert^{2}-\lVert\bar{\bm{\theta}}_{p,T+1}-\bm{\theta}\rVert^{2}}{2\eta}.

(22)

Applying [16, Lemma A.2] and Cauchy-Schwarz results in

	$\displaystyle\sum_{j=1}^{J}\mathbb{E}\lVert\bm{\theta}_{p,t}^{j}-\bar{\bm{\theta}}_{p,t}\rVert^{2}$	$\displaystyle\leq\eta^{2}\frac{12JL^{2}}{c^{2}}\implies$		(23)
	$\displaystyle\sum_{j=1}^{J}\mathbb{E}\lVert\bm{\theta}_{p,t}^{j}-\bar{\bm{\theta}}_{p,t}\rVert$	$\displaystyle\leq\eta\frac{JL\sqrt{12}}{c}.$		(24)

The proof is complete, by using (24) to bound (22) and choosing $\bm{\theta}_{p,1}^{j}=\bm{0}$ and $\bm{\theta}=\bm{\theta}_{p}^{*}$ . ∎

Refer to caption — Figure 1: Comparison between the gossiped OMKL, OMKL from [1] and the single-kernel SGD, labelled Koloskova, from [5]. We have used $\sigma=1$ for the latter. The labels indicate the used topology for each algorithm.

Lemma 2.

Under the same conditions as Thm. 1,

\sum_{j=1}^{J}\sum_{t=1}^{T}\sum_{p=1}^{P}\bar{w}_{p,t}^{j}\mathcal{L}\left(\hat{f}^{j}_{p,t}(\bm{x}_{t}^{j})\right)-\sum_{j=1}^{J}\sum_{t=1}^{T}\mathcal{L}\left(\hat{f}^{j}_{p^{\prime},t}(\bm{x}_{t}^{j})\right)\\ \leq\frac{J\ln P}{\eta}+\eta JT.

(25)

The proof of Lemma 2 follows the same steps as in [1, Lemma 3].

From the convexity of $\mathcal{L}(\cdot)$ ,

\mathcal{L}\left(\sum_{p=1}^{P}\bar{w}_{p,t}\hat{f}^{j}_{p,t}(\bm{x}_{t})\right)\leq\sum_{p=1}^{P}\bar{w}_{p,t}\mathcal{L}\left(\hat{f}^{j}_{p,t}(\bm{x}_{t})\right),

(26)

and Thm. 1 follows immediately using Lemma 1 and Lemma 2.

IV Experimental results and conclusions

To validate Thm. 1, we have tested Algorithm 1 with real datasets for binary classification, with different topologies, and using different values of quantization level. We have used the well-known Kernel Logistic Regression (KLR) loss function [9], i.e., the loss function (5) is defined as

\ln\left(1+\exp(-y\cdot\bm{\theta}^{\top}z_{V}(\bm{x}))\right)+\lambda\lVert\bm{\theta}\rVert^{2},

(27)

where we have used $\Omega(\lVert\bm{\theta}\rVert^{2})=\lVert\bm{\theta}\rVert^{2}$ . We have conducted our experiments with three datasets: Banana, Credit-Card, and MNIST. The synthetic data from the Banana dataset [17] ( $n=5300,d=2$ ) are two non-linearly separable clusters. The Credit-Card dataset [18] ( $n=30000,d=2$ ) contains data from credit card users, such as their credit history and whether they have defaulted or not. It is obtained from the UCI machine learning repository [19]. The MNIST dataset [20] ( $n=70000,d=784$ ) is a labeled set of handwritten digits from 0 to 9. We divide them into two classes, those that are number 8 and those that are not.

Our experimental setup has $J=20$ nodes, dimension $D=20$ for our RF approximation, regularization parameter $\lambda=0.001$ , and three Gaussian kernels with $\sigma\in\{1,3,5\}$ , where a Gaussian kernel with parameter $\sigma$ is defined as $\kappa(\bm{x},\bm{x^{\prime}})=\exp{[-\lVert\bm{x}-\bm{x^{\prime}}\rVert^{2}/(2\sigma^{2})]}$ . Simulations have been performed 100 times with different sets of Random Features and the corresponding mean of the loss function is plotted. We report the loss instead of the classification accuracy because the difference in the latter is minimal. We have used the quantizer described in (12) with 7 levels of quantization, that is, 3 bits per element in any transmitted array. Our learning rates are $\eta=0.01$ and $\gamma=0.9\eta=0.009$ .

Fig. 1 compares the performance of our algorithm versus two benchmarks. Since Algorithm 1 can be viewed as an extension of [1] to non-complete graphs, the first benchmark for comparison is the OMKL algorithm from [1]. There is a small difference between the performance of OMKL and that of our gossiped OMKL although OMKL requires many more inter-node communications. This is despite the fact that OMKL is run over a complete topology, but the gossiped OMKL is executed over the worst-case scenario, i.e., a path topology that includes much smaller number of connections and as a result requires much less communication.

Gossiped OMKL can also be considered as an extension of [5] to multi-kernel learning, which justifies using the single-kernel SGD from [5] as our second benchmark. Gossiped OMKL using three kernels clearly outperforms the single-kernel approach, named Koloskova in Fig. 1.

Although not shown in the figure, we have observed that, for values of $\eta$ and $\gamma$ chosen in Fig. 1, or smaller values, the choice of topology does not affect the performance of our algorithm. To observe the effect of topology on algorithm performance, larger step sizes ( $\eta=0.1$ and $\gamma=0.09$ ) are chosen in Fig. 2. The figure shows simulations using the Credit-Card dataset and without quantization on three different topologies: the complete graph, the ring, and the path. We can observe that more densely connected communication graphs lead to better performance. We have also tested the algorithm for $M\geq 7$ levels of quantization, that satisfies the condition $\delta>0$ in (12), and all of them perform as well as the non-quantized version, with the loss function differences in the order of $10^{-6}$ .

The results show that our gossiped OMKL algorithm can successfully extend the OMKL algorithm [1] to non-complete graphs and the distributed learning algorithm in [5] to multi-kernel learning.

References

[1] Y. Shen, S. Karimi-Bidhendi, and H. Jafarkhani, “Distributed and quantized online multi-kernel learning,” IEEE Transactions on Signal Processing, vol. 69, pp. 5496–5511, 2021.
[2] S. Hong and J. Chae, “Distributed online learning with multiple kernels,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
[3] S. Savazzi, M. Nicoli, M. Bennis, S. Kianoush, and L. Barbieri, “Opportunities of federated learning in connected, cooperative, and automated industrial systems,” IEEE Communications Magazine, vol. 59, no. 2, pp. 16–21, 2021.
[4] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021.
[5] A. Koloskova, S. U. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” 36th International Conference on Machine Learning, ICML 2019, vol. 2019-June, pp. 6088–6111, 2019.
[6] S. C. H. Hoi, R. Jin, P. Zhao, and T. Yang, “Online multiple kernel classification,” Machine Learning, vol. 90, no. 2, p. 289–316, Feb 2013. [Online]. Available: https://doi.org/10.1007/s10994-012-5319-2
[7] D. Sahoo, S. C. Hoi, and B. Li, “Online multiple kernel regression,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. New York New York USA: ACM, Aug 2014, p. 293–302. [Online]. Available: https://dl.acm.org/doi/10.1145/2623330.2623712
[8] C. Tang, Z. Li, J. Wang, X. Liu, W. Zhang, and E. Zhu, “Unified one-step multi-view spectral clustering,” IEEE Transactions on Knowledge and Data Engineering, p. 1–1, 2022.
[9] B. Schölkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, ser. Adaptive computation and machine learning. MIT Press, 2002.
[10] Y. Shen, T. Chen, and G. B. Giannakis, “Random feature-based online multi-kernel learning in environments with unknown dynamics,” Journal of Machine Learning Research, vol. 20, 2019.
[11] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in Neural Information Processing Systems, vol. 20. Curran Associates, Inc., 2007. [Online]. Available: https://proceedings.neurips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html
[12] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167691104000398
[13] S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing markov chain on a graph,” SIAM review, vol. 46, no. 4, pp. 667–689, 2004.
[14] B. Gharesifard and J. Cortés, “When does a digraph admit a doubly stochastic adjacency matrix?” in Proceedings of the 2010 American Control Conference. IEEE, 2010, pp. 2440–2445.
[15] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[16] A. Koloskova, T. Lin, S. U. Stich, and M. Jaggi, “Decentralized deep learning with arbitrary communication compression,” in International Conference on Learning Representations, 2019.
[17] G. Rätsch, T. Onoda, and K.-R. Müller, “Soft margins for adaboost,” Machine learning, vol. 42, no. 3, pp. 287–320, 2001.
[18] I.-C. Yeh and C.-h. Lien, “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients,” Expert systems with applications, vol. 36, no. 2, pp. 2473–2480, 2009.
[19] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
[20] Y. LeCun, C. Cortes, and C. J.C. Burgess. (1998) The mnist database of handwritten digits. [Online]. Available: http://yann.lecun.com/exdb/mnist/