Gossiped and Quantized Online Multi-Kernel Learning
Abstract
In instances of online kernel learning where little prior information is available and centralized learning is unfeasible, past research has shown that distributed and online multi-kernel learning provides sub-linear regret as long as every pair of nodes in the network can communicate (i.e., the communications network is a complete graph). In addition, to manage the communication load, which is often a performance bottleneck, communications between nodes can be quantized. This letter expands on these results to non-fully connected graphs, which is often the case in wireless sensor networks. To address this challenge, we propose a gossip algorithm and provide a proof that it achieves sub-linear regret. Experiments with real datasets confirm our findings.
Index Terms:
Kernel-based learning, quantization, gossip algorithms, federated learning, sensor networks.I Introduction
Many of the latest efforts in machine learning are focused on bringing learning as close to data collection as possible. This is of practical interest in a diverse array of applications, sensor networks in particular. In sensor networks, nodes are often communication-constrained, operate distributively, and can send only a few bits to their neighbors.
There are several papers in the literature that consider the problem of designing distributed learning methods for Online Multi-Kernel Learning (OMKL). In particular, [1] contains the design and regret analysis of quantized and distributed OMKL for a fully connected (complete) graph. Although the fully connected case is important, it is not applicable in many sensor networks. This letter designs a new algorithm that expands the theoretical guarantees of the distributed and quantized OMKL in [1] to an arbitrary general non-fully connected graph. Unlike [1], our method uses a novel hidden state quantization scheme in conjunction with a gossip algorithm. One major advantage, compared with [1], is that our algorithm does not need to communicate the loss function. Another example of a distributed OMKL that works for a non-fully connected graph is presented in [2]. It requires communicating an unlimited number of bits between neighbors, and its assumptions are stricter than those of [1]. To manage the required communication throughput, which is often a performance bottleneck [3, 4], this letter provides the properties of [2] under looser assumptions and while communicating only a limited number of bits between neighboring nodes.
In [5], a distributed stochastic gradient descent scheme with quantized communication is discussed. The proposed algorithm uses an extra shared “hidden” state and can be used for the single-kernel case, but not multi-kernel settings. In addition to extending [1] to general non-fully connected networks, our work can be considered as the extension of [5] to multi-kernel cases.
II Problem formulation
Let us consider a network with nodes and model it as an undirected connected graph, where two nodes are connected if and only if they can communicate. Our set of nodes , and the set of edges , define a communication graph . At each instant of time , Node receives the data string and the desired response . Our task is to find an approximating function for . As in [1], belongs to a Reproducing Kernel Hilbert Space (RKHS) , where is a kernel function that measures the similarity between and . Let us consider the following optimization problem
(1) |
where is a cost function and is a regularization parameter that controls an increasing function . An optimal solution for this problem exists in the form , where are real numbers [9].
In multi-kernel learning, a weighted combination of several kernels is selected to improve the performance, compared to single-kernel learning. A convex combination of the kernels , where is an RKHS, is also an RKHS, denoted by [9]. Here, indicates the direct sum of Hilbert spaces. Using instead of , we turn problem (1) into:
(2a) | |||
(2b) |
To evade the curse of dimensionality, a Random Feature (RF) approximation is used [1, 10]. For normalized shift-invariant kernels, i.e., , their Fourier transforms exist, and due to normalization, . Also, viewing as a pdf, . Gaussian, Laplacian, and Cauchy kernels are three examples of normalized shift-invariant kernels that we can use [11]. However, any family of shift invariant kernels can be used for the RF approximation we describe next. Drawing i.i.d. samples from , we define
(3) | ||||
which has the property that . So, given a fixed , we can approximate . This is called the RF approximation, and it is an unbiased and consistent estimation of [10]. Thus, a weight vector can be constructed such that
(4) |
Then, the loss function is defined as
(5) |
and the following equations are obtained for Kernel and Node :
(6) | ||||
(7) | ||||
(8) |
where is a learning rate. The weights are normalized as to have
(9) |
The above equations represent how local online multi-kernel learning models can be built [1, 10]. In contrast to previous research, each node in our scenario can only communicate with its neighboring nodes in the set . Thus, the local model of each node may be different from node to node. Therefore, our algorithm must ensure that all nodes converge to the same model. For this purpose, the local models are propagated using a gossip algorithm. The nodes calculate the weighted average of the information provided by their neighbors at each communication round and use it to update their local information. The weight Node assigns to the information coming from Node is defined as . Notice that the sum of weights for all since the weights of a weighted average must add up to one. We impose such that we can associate the weights to the edges of the undirected communication graph . Thus, the weights define the gossip matrix , which is and doubly stochastic [12, 13, 14]. The spectral gap of is denoted by , where represents the second eigenvalue of in descending order.
Such a gossip algorithm works very well when the nodes communicate their states perfectly with their neighbors. However, practically, only a few bits can be communicated, i.e., information needs to be quantized before being shared with the neighbors. We use a random quantizer , to represent an arbitrary vector with in an efficient way. Our results work for any random quantizer that satisfies
(10) |
for some , which we will call the compression parameter. Here, denotes the expected value with respect to the internal randomness of . In simulations, we use the randomized quantizer of [15]. Each element of a non-zero vector , i.e., , is quantized by
(11) |
where is the number of quantization levels and defining ,
(12) |
is represented by bits. Following [15, Lemma 3.1], Eq. (10) holds for .
III Algorithm and Regret Analysis
We present Algorithm 1 for gossiped and quantized OMKL. For ease of notation, we will denote as from now on. In Algorithm 1, at each time instance, nodes collect their local data and transform them according to the RF approximation. Then, the kernel losses are computed and used to update the kernel weights. For the gossip step, we define a hidden state for each node that is known for all neighbors in because it is updated by the same quantized values known to all neighbors. The new local state, , will be a sum of , which is an auxiliary variable where we store the local learning, plus a gossip step of size with the information from the hidden states. Subsequently, each node prepares the update to the hidden state by quantizing the difference between its local state and the common hidden state . This quantized difference is sent to the neighbors and is used by them to collectively update the hidden state. Finally, each node performs local learning with a step size , and stores it in the auxiliary variable .
The role of a hidden state and quantized update is to have an accurate representation of the neighbor’s states without the need for broadcasting the full state at each time instance.
In what follows, we provide regret analysis to study the performance of our algorithm. First, we make the following assumptions similar to what is done in the literature.
Assumption 1.
Let us assume a -smooth loss function
(13) |
with a bounded gradient .
Assumption 2.
Let us assume a convex loss function with respect to , i.e.,
(14) |
Assumption 3.
Let us assume that we have a connected communication graph and a doubly stochastic gossip matrix whose entries are zero if and only if their corresponding edges are not present in the communication graph.
Theorem 1.
To prove Thm. 1, we first need to present Lemma 1, which bounds the difference between a single kernel loss and the optimal loss, and Lemma 2, which bounds the error added by the kernel weights.
Lemma 1.
Proof.
Similar to [1], let us consider an arbitrary . Define , and and analogously. Since our gossip scheme preserves averages, we can deduce . Thus,
(17) | ||||
(18) | ||||
(19) |
By Asm. 2, we have
(20) |
Inserting (20) into (19) and summing over , we obtain
(21) |
where we have used Asm. 1 to bound the gradient. Summing over and bounding again, we derive
(22) |
Applying [16, Lemma A.2] and Cauchy-Schwarz results in
(23) | ||||
(24) |
The proof is complete, by using (24) to bound (22) and choosing and . ∎

Lemma 2.
Under the same conditions as Thm. 1,
(25) |
IV Experimental results and conclusions
To validate Thm. 1, we have tested Algorithm 1 with real datasets for binary classification, with different topologies, and using different values of quantization level. We have used the well-known Kernel Logistic Regression (KLR) loss function [9], i.e., the loss function (5) is defined as
(27) |
where we have used . We have conducted our experiments with three datasets: Banana, Credit-Card, and MNIST. The synthetic data from the Banana dataset [17] () are two non-linearly separable clusters. The Credit-Card dataset [18] () contains data from credit card users, such as their credit history and whether they have defaulted or not. It is obtained from the UCI machine learning repository [19]. The MNIST dataset [20] () is a labeled set of handwritten digits from 0 to 9. We divide them into two classes, those that are number 8 and those that are not.
Our experimental setup has nodes, dimension for our RF approximation, regularization parameter , and three Gaussian kernels with , where a Gaussian kernel with parameter is defined as . Simulations have been performed 100 times with different sets of Random Features and the corresponding mean of the loss function is plotted. We report the loss instead of the classification accuracy because the difference in the latter is minimal. We have used the quantizer described in (12) with 7 levels of quantization, that is, 3 bits per element in any transmitted array. Our learning rates are and .
Fig. 1 compares the performance of our algorithm versus two benchmarks. Since Algorithm 1 can be viewed as an extension of [1] to non-complete graphs, the first benchmark for comparison is the OMKL algorithm from [1]. There is a small difference between the performance of OMKL and that of our gossiped OMKL although OMKL requires many more inter-node communications. This is despite the fact that OMKL is run over a complete topology, but the gossiped OMKL is executed over the worst-case scenario, i.e., a path topology that includes much smaller number of connections and as a result requires much less communication.
Gossiped OMKL can also be considered as an extension of [5] to multi-kernel learning, which justifies using the single-kernel SGD from [5] as our second benchmark. Gossiped OMKL using three kernels clearly outperforms the single-kernel approach, named Koloskova in Fig. 1.

Although not shown in the figure, we have observed that, for values of and chosen in Fig. 1, or smaller values, the choice of topology does not affect the performance of our algorithm. To observe the effect of topology on algorithm performance, larger step sizes ( and ) are chosen in Fig. 2. The figure shows simulations using the Credit-Card dataset and without quantization on three different topologies: the complete graph, the ring, and the path. We can observe that more densely connected communication graphs lead to better performance. We have also tested the algorithm for levels of quantization, that satisfies the condition in (12), and all of them perform as well as the non-quantized version, with the loss function differences in the order of .
References
- [1] Y. Shen, S. Karimi-Bidhendi, and H. Jafarkhani, “Distributed and quantized online multi-kernel learning,” IEEE Transactions on Signal Processing, vol. 69, pp. 5496–5511, 2021.
- [2] S. Hong and J. Chae, “Distributed online learning with multiple kernels,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
- [3] S. Savazzi, M. Nicoli, M. Bennis, S. Kianoush, and L. Barbieri, “Opportunities of federated learning in connected, cooperative, and automated industrial systems,” IEEE Communications Magazine, vol. 59, no. 2, pp. 16–21, 2021.
- [4] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021.
- [5] A. Koloskova, S. U. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” 36th International Conference on Machine Learning, ICML 2019, vol. 2019-June, pp. 6088–6111, 2019.
- [6] S. C. H. Hoi, R. Jin, P. Zhao, and T. Yang, “Online multiple kernel classification,” Machine Learning, vol. 90, no. 2, p. 289–316, Feb 2013. [Online]. Available: https://doi.org/10.1007/s10994-012-5319-2
- [7] D. Sahoo, S. C. Hoi, and B. Li, “Online multiple kernel regression,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. New York New York USA: ACM, Aug 2014, p. 293–302. [Online]. Available: https://dl.acm.org/doi/10.1145/2623330.2623712
- [8] C. Tang, Z. Li, J. Wang, X. Liu, W. Zhang, and E. Zhu, “Unified one-step multi-view spectral clustering,” IEEE Transactions on Knowledge and Data Engineering, p. 1–1, 2022.
- [9] B. Schölkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, ser. Adaptive computation and machine learning. MIT Press, 2002.
- [10] Y. Shen, T. Chen, and G. B. Giannakis, “Random feature-based online multi-kernel learning in environments with unknown dynamics,” Journal of Machine Learning Research, vol. 20, 2019.
- [11] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in Neural Information Processing Systems, vol. 20. Curran Associates, Inc., 2007. [Online]. Available: https://proceedings.neurips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html
- [12] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167691104000398
- [13] S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing markov chain on a graph,” SIAM review, vol. 46, no. 4, pp. 667–689, 2004.
- [14] B. Gharesifard and J. Cortés, “When does a digraph admit a doubly stochastic adjacency matrix?” in Proceedings of the 2010 American Control Conference. IEEE, 2010, pp. 2440–2445.
- [15] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- [16] A. Koloskova, T. Lin, S. U. Stich, and M. Jaggi, “Decentralized deep learning with arbitrary communication compression,” in International Conference on Learning Representations, 2019.
- [17] G. Rätsch, T. Onoda, and K.-R. Müller, “Soft margins for adaboost,” Machine learning, vol. 42, no. 3, pp. 287–320, 2001.
- [18] I.-C. Yeh and C.-h. Lien, “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients,” Expert systems with applications, vol. 36, no. 2, pp. 2473–2480, 2009.
- [19] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
- [20] Y. LeCun, C. Cortes, and C. J.C. Burgess. (1998) The mnist database of handwritten digits. [Online]. Available: http://yann.lecun.com/exdb/mnist/