Fusing Multiple Algorithms for Heterogeneous Online Learning

\NameDarshan Gadginmath \Email[email protected]
\NameShivanshu Tripathi \Email[email protected]
\NameFabio Pasqualetti \Email[email protected]
\addrUniversity of California Riverside This material is based upon work supported in part by awards ONR-N00014-19-1-2264, ARO W911NF-20-2-0267, and AFOSRFA9550-19-1-0235.

Abstract

This study addresses the challenge of online learning in contexts where agents accumulate disparate data, face resource constraints, and use different local algorithms. This paper introduces the Switched Online Learning Algorithm (SOLA), designed to solve the heterogeneous online learning problem by amalgamating updates from diverse agents through a dynamic switching mechanism contingent upon their respective performance and available resources. We theoretically analyze the design of the selecting mechanism to ensure that the regret of SOLA is bounded. Our findings show that the number of changes in selection needs to be bounded by a parameter dependent on the performance of the different local algorithms. Additionally, two test cases are presented to emphasize the effectiveness of SOLA, first on an online linear regression problem and then on an online classification problem with the MNIST dataset.

1 Introduction

Multi-agent learning frequently involves scenarios in which agents gather disparate data at varying rates, collectively seeking to address an online optimization problem. Some instances include collaborative localization, search-and-rescue operations, coverage control, etc. Compounding the complexity of these scenarios is the constraint of limited data processing and computational capabilities, and the need for time-sensitive decision-making. Conventionally, either heterogeneous data or gradients are pooled [Zhang et al.(2018)Zhang, Cisse, Dauphin, and Lopez-Paz, Esfandiari et al.(2021)Esfandiari, Tan, Jiang, Balu, Herron, Hegde, and Sarkar, Aketi et al.(2023)Aketi, Kodge, and Roy], or heterogeneity is ignored and a distributed learning algorithm is deployed [Le Bars et al.(2023)Le Bars, Bellet, Tommasi, Lavoie, and Kermarrec, Zhu et al.(2021)Zhu, Xu, Liu, and Jin]. The former approach raises privacy concerns, while the latter proves suboptimal due to the inherent heterogeneity of the data. Effectively addressing the heterogeneity of data and resource constraints in an online learning problem remains an unresolved challenge.

An alternative approach to mitigate the challenges posed by data heterogeneity and resource constraints is to employ distinct algorithms tailored to the specific characteristics of the data, computation, and communication resources. Nevertheless, adopting distinct algorithms introduces the risk of underutilizing the available data. In this study, we present a systematic method to integrate updates provided by distinct algorithms to solve the heterogeneous online learning problem.

1.1 Problem Statement

We seek to solve the following online minimization problem:

\displaystyle\min_{x}F(x,\bigcup\limits_{i}^{M}\mathcal{D}^{i}(t)).

(1)

Here, $x\in^{n}$ is a parameter that needs to be collectively estimated by agents $i\in\{1,2,\dots,M\}$ . The data gathered by each agent up to time $t$ is denoted by $\mathcal{D}^{i}(t)$ . More precisely, let $T^{i}\in$ denote the set of time instances when agent $i$ collects new data. Then $T^{i}=\{t^{i}_{1},t^{i}_{2},\dots\}$ , where $t^{i}_{j}$ denotes the $j^{th}$ round of sampling by agent $i$ . We use the set $T$ to denote all the time instances when new data is available. That is, $T=\bigcup_{i}~{}T^{i}=\{t_{1},t_{2},\dots\}$ . We drop the superscript $i$ to denote a time instance when new data is acquired by any agent. The elements of $T$ satisfy $t_{1}\leq t_{2}\leq\dots\bar{t}$ , where $\bar{t}=\max_{i,j}t^{i}_{j}$ . Let the samples collected by agent $i$ at the $j^{th}$ round be denoted by $s(t^{i}_{j})$ . The data collected by agent $i$ until time $t$ is

\displaystyle\mathcal{D}^{i}(t)=\bigcup\limits_{\begin{subarray}{c}t^{i}_{j}\in T^{i},t^{i}_{j}\leq t\end{subarray}}s(t^{i}_{j}).

(2)

Every agent $i$ is constrained by their computation power and communication capabilities. Hence, they need to employ a local algorithm $\mathcal{A}^{i}$ with their local data to solve the online optimization problem. For instance, a single agent can employ a centralized algorithm such as gradient descent (GD), stochastic gradient descent (SGD), or batch-wise GD [Dixit et al.(2019)Dixit, Bedi, Tripathi, and Rajawat]. An agent who is composed of a system of smaller units can collectively employ distributed algorithms such as Decentralized SGD or Federated Learning depending on the availability of resources. As an example, consider the following situation: an e-commerce entity equipped with data centers strategically situated across varied geographical locations, adopts an asynchronous data acquisition methodology from online users for the purpose of targeted advertisement display. In this scenario, each center employs a localized algorithm to process its specific dataset, owing to limitations in both computational and communicative capabilities.

Agents may choose to delay their decision-making process in order to accumulate an ample amount of data or computational resources necessary for solving problem (1). However, this approach may be suboptimal, given that decisions are frequently time-sensitive. A practical example is that of naval vessels mapping the sea for adversarial entities. Vessels positioned at varying distances from the shoreline collect sonar data, yet their computational and communicative capabilities are constrained by their respective locations. Specifically, vessels in closer proximity to the coast benefit from superior computational resources, albeit with diminished data quality, as elucidated by [Ferla and Jensen(2002)]. In this scenario, waiting to gather sufficient resources can be extremely dangerous. Therefore, the updates from different algorithms need to be fused in an online fashion as soon as updates are available to solve the problem (1).

1.2 Related Work

Distributed online optimization has been extensively studied as detailed in [McMahan(2017), Hoi et al.(2021)Hoi, Sahoo, Lu, and Zhao, Li et al.(2023b)Li, Xie, and Li]. However, these studies do not consider cooperatively using different algorithms in a constrained time-sensitive setting. Popular algorithms such as decentralized SGD and Federated Learning in the presence of asynchronous agents were studied in [Jiang et al.(2021)Jiang, Zhang, Gu, and Zhu, Chen et al.(2020)Chen, Ning, Slawski, and Rangwala]. In [Jiang et al.(2021)Jiang, Zhang, Gu, and Zhu], decentralized SGD is proposed with asynchronous agents but it does not incorporate agents running different algorithms and it also requires extensive communication between agents. Asynchronous online Federated Learning [Chen et al.(2020)Chen, Ning, Slawski, and Rangwala] requires the presence of a coordinator, and still does not fuse different algorithms. Model fusion has received attention in supervised learning [Li et al.(2023a)Li, Peng, Zhang, Ding, Hu, and Shen]. Model fusion in online learning case has received significantly less interest due to its complexities [Foster et al.(2015)Foster, Rakhlin, and Sridharan, Hoang et al.(2019)Hoang, Hoang, Low, and How, Cutkosky(2019)]. These works typically consider selecting models from several algorithms at every time step. Particularly, [Cutkosky(2019)] shows that algorithms with bounded regrets can be fused simply by averaging the parameters and still maintain bounded regret. However, in our case, we seek to fuse the updates from different agents by employing only a single agent at any given time. Therefore, it is unclear how agents with different data and resources could cooperatively solve problem (1) by running their own local algorithm.

1.3 Contributions

We provide an algorithm called the Switched Online Learning Algorithm (SOLA) to solve Problem (1) by fusing the updates from agents running different algorithms. We solve the considered problem by switching between the agents and fusing their updates based on their performance. We provide a sufficient condition to guarantee a bound on the regret of SOLA based on the rate at which different algorithms are chosen. To this end, we model SOLA as a switched dynamical system and ensure its contractivity. We numerically analyze the performance of our algorithm for the online linear regression problem and also the online classification problem with the MNIST dataset¹¹1Code repository: https://github.com/Shivanshu-007/Heterogeneous-online-optimization .

2 Switched Online Learning Algorithm (SOLA)

In this section, we describe the proposed Switched Online Learning Algorithm (SOLA). The input to any local algorithm $\mathcal{A}^{i}$ at time $t$ is the data $\mathcal{D}^{i}(t)\in^{m(t)\times p}$ , and the parameter $x\in^{n}$ . Note that the dimension of $\mathcal{D}^{i}(t)$ is dependent on time as each agent acquires new data over time. The number of samples is given by $m(t)$ and the number of features is $p$ . Henceforth, we simply use the variable $k$ to denote the discrete instances $t_{k}\in T$ , i.e. $k\in\{1,2,\dots,|T|\}$ . The update provided by algorithm $\mathcal{A}^{i}$ is given by the map $\mathcal{A}^{i}(x,\mathcal{D}^{i}(k)):^{n}\times^{m(k)\times p}\rightarrow^{n}$ . At any instance $k$ , the selecting signal $\sigma(k):\{1,2,...,|T|\}\rightarrow\{1,2,\dots,M\}$ selects an agent $i$ if the agent has new data. Agent $\sigma(k)$ uses its local algorithm $\mathcal{A}^{\sigma(k)}$ to update the parameter $x(k-1)$ . The update provided by $\mathcal{A}^{\sigma(k)}$ is used to solve the problem (1) as

\displaystyle x(k)=\alpha(k)\ \mathcal{A}^{\sigma(k)}(x(k-1),\mathcal{D}^{\sigma(k)}(k))+(1-\alpha(k))\ x(k-1).

(3)

We call $\alpha(k)\in[0,1]$ as the fusing variable. We introduce it to smoothly incorporate new updates from the chosen local algorithms $\mathcal{A}^{\sigma(k)}$ . The fusing variable depends on the performance of the algorithm of the local algorithms. Let us define a performance metric as $P(x,\mathcal{D}):^{n}\times^{m\times p}\rightarrow_{\geq 0}$ . Common performance metrics for classification problems are precision, recall, true positive rate, etc. [Hand(2012)]. In regression problems, performance is often computed as the inverse of the norm of error, or the trace of the inverse of error covariance [Kay(1993)]. We are particularly interested in metrics that have higher values to signify better performance. Given the performance metric, the fusing variable $\alpha(k)$ is defined as:

\displaystyle\alpha(k)

\displaystyle=\frac{P(x(k^{+}),\mathcal{D}^{\sigma(k)}(k))}{P(x(k^{+}),\mathcal{D}^{\sigma(k)}(k))+P(x(k-1),\mathcal{D}^{\sigma(k-1)}(k-1))},

(4)

where $x(k^{+})=\mathcal{A}^{\sigma(k)}(x(k-1),D^{\sigma(k)}(k))$ . Note that $\alpha(k)$ is using the performance of $\mathcal{A}^{\sigma(k)}$ to incorporate its update. If the performance is poor on the local data $\mathcal{D}^{\sigma(k)}(k)$ , the fusing variable $\alpha(k)$ is closer to $0$ , whereas $1-\alpha(k)$ is closer to 1. This implies that the former parameter $x(k-1)$ has a higher influence on the updated parameter $x(k)$ . Similarly, if the performance of $\mathcal{A}^{\sigma(k)}$ is superior to that of $x(k)$ , more weight is given to $x(k^{+})$ . We describe the proposed algorithm SOLA in detail in Algorithm 1.

Remark 2.1.

(Naive switching): The design of the fusing variable $\alpha(k)$ is crucial to SOLA because the updates from $\mathcal{A}^{\sigma(k)}$ can be vastly different from $x(k)$ . For instance, by setting $\alpha(k)=1$ , the update of one agent acts as the input to the subsequent agents. This is a naive method of switching between agents without any consideration for the performance of each agent. Switching naively between different agents may cause abrupt and large changes in the parameters, which may be undesirable. This is illustrated in the online regression problem represented in Figure 1. SOLA chooses between two agents running centralized GD and decentralized SGD with five sub-units. Figure 1 compares the naive case when $\alpha(k)=1$ , and the case when $\alpha(k)$ is dependent on the performance of the local algorithms. We see frequent jumps in the parameter $x(k)$ resulting in a chattering behavior with an improper choice of $\alpha(k)$ . However, when $0\leq\alpha(k)\leq 1$ , we see an improvement in the performance and the chattering behavior is absent. The detailed simulation setting is provided in Section 4.1.

Refer to caption — Figure 1: Online linear regression with SOLA comparing a naive choice of the fusing variable $\alpha_{t}=1$ and a fusing variable that incorporates the performance of each agent. It is evident that a naive choice of the fusing variable results in frequent jumps of the parameter $x(t)$ .

Remark 2.2.

(Choice of performance metric): SOLA uses updates from different agents for a general online optimization problem. Therefore, the optimal choice of a performance metric is dependent on the cost function, data and the constraints of each agent. For instance, consider the online linear regression problem. The recursive least squares algorithm in [Kay(1993)] incorporates data arriving incrementally to solve an online linear regression problem. Although the algorithm operates with only a single agent, it incorporates incoming updates by weighting them with the inverse of the error covariance. It can be shown that the choice of using the inverse of the error covariance in the recursive least-squares algorithm produces the Best Linear Unbiased Estimator (BLUE).

Algorithm 1 Switched Learning Algorithm

Initialize:

x(0),~{}\sigma(0),~{}\mathcal{D}^{\sigma(0)}(0)

.\SetAlgoLined

1. For

k=1

|T|

2. Select agent

i

using switching signal,

\sigma(k)=i

3. Agent

\sigma(k)

receives

x(k-1)

and

P(x(k-1),\mathcal{D}^{\sigma}(k-1)(k-1))

from

\sigma(k-1)

4. Update parameter

x(k-1)

using algorithm

\mathcal{A}^{\sigma(k)}

x(k^{+})=\mathcal{A}^{\sigma(k)}(x(k-1),D_{\sigma(k)}(k))

5. Compute performance metric

P(x(k^{+})),D_{\sigma(k)}(k))

and fusing variable

\alpha(k)

(4).

6. Incorporate update:

x(k)=\alpha(k)\mathcal{A}^{\sigma(k)}(x(k-1),\mathcal{D}^{\sigma(k)}(k))+(1-\alpha(k))x(k-1)

3 Design of Selecting Signal

The selecting signal $\sigma(k)$ determines the choice of subsystem for the update, and in turn, determines the regret of SOLA. In this section, we analyze the effect of the switching signal on the regret of the Switched Learning Algorithm. Let us denote any algorithm employed to solve problem (1) as $\mathcal{A}$ . The regret of algorithm $\mathcal{A}$ is denoted by $\mathcal{R}_{\mathcal{A}}$ and is defined for $K\geq 1$ as

\displaystyle\mathcal{R}_{\mathcal{A}}(K)

\displaystyle=\sum_{k=1}^{K}F(x(k),\mathcal{D}(k))-\sum_{k=1}^{T}F(x^{*},\mathcal{D}(k)).

(5)

Here $x(k)$ refers to the parameters provided by algorithm $\mathcal{A}$ at time $k$ using data $\mathcal{D}(k)$ , and $x^{*}$ is the optimal parameter given all the data a priori. It is essential for an algorithm to have bounded regret as it compares the performance of the algorithm to an optimal choice. We now show that SOLA has bounded regret under the right conditions of the selecting signal $\sigma(k)$ .

We analyze the regret achieved by SOLA by viewing the algorithm as a dynamical system. Optimization algorithms have received extensive attention from the view of dynamical systems [Elisseeff et al.(2005)Elisseeff, Evgeniou, Pontil, and Kaelbing, Shikhman and Stein(2009), Ross and Bagnell(2011), Saha et al.(2012)Saha, Jain, and Tewari, Sontag(2022), Cisneros-Velarde and Bullo(2022), Kozachkov et al.(2022)Kozachkov, Wensing, and Slotine]. Particularly, algorithms that are stable have been shown to have bounded regret [Ross and Bagnell(2011), Saha et al.(2012)Saha, Jain, and Tewari]. A useful tool to study these optimization algorithms and stability is contraction theory [Cisneros-Velarde and Bullo(2022), Kozachkov et al.(2022)Kozachkov, Wensing, and Slotine]. A contracting algorithm is defined as follows.

Definition 3.1.

(Contracting algorithm, [Lohmiller and Slotine(1998)]) Let the updates provided by an algorithm $\mathcal{A}$ be given by the dynamical system

\displaystyle x(k)=\mathcal{A}(x(k-1),\mathcal{D}(k))

(6)

where $x(k)$ and $\mathcal{D}(k)$ are the parameter and data at time $k$ , respectively. The differential dynamics of the algorithm is then given by

\displaystyle\delta_{x(k)}=\frac{\partial\mathcal{A}(x(k-1),\mathcal{D}(k))}{\partial x(k-1)}\delta_{x(k-1)}.

The associated distance for the differential dynamics is denoted by

	$\displaystyle V(x(k))$	$\displaystyle=\delta_{x(k)}^{\top}M(x(k))\delta_{x(k)},$
	$\displaystyle M(x(k))$	$\displaystyle=\left(\frac{\partial\mathcal{A}(x(k),\mathcal{D}(t))}{\partial x(k)}\right)^{\top}\frac{\partial\mathcal{A}(x(k),\mathcal{D}(t))}{\partial x(k)}.$

Here $M(x(k))$ is a symmetric positive-definite matrix function, and it is uniformly bounded as $\gamma_{1}I\leq M(x(k))\leq\gamma_{2}I$ , for all $x(k)$ . The algorithm $\mathcal{A}$ is said to be contracting if

\displaystyle V(x(k))\leq\beta V(x(k-1)),

(7)

where $0<\beta\leq 1$ , is the rate of contraction.

We show that any contracting algorithm achieves bounded regret in Appendix A. Given that contracting algorithms achieve bounded regret, it is sufficient to ensure that SOLA is a contracting algorithm by designing the switching signal $\sigma(k)$ . We make the following assumptions for our regret analysis.

(A1)

$F(x,\mathcal{D})$ is $\ell$ -convex in $x$ : $\nabla^{2}F(x,\mathcal{D})>\ell I$ .
(A2)

Every local algorithm $\mathcal{A}^{i}$ is contracting with a rate $\beta^{i}$ . Further, for each pair of algorithms $(i,j)$ , there exists $\mu^{ij}>1$ such that the distance of differential dynamics is bounded as $V^{i}(x(k))=\mu^{ij}V^{j}(x(k))$ .

Assumption (A1) is a commonly used in regret analysis for online problems [McMahan(2017)]. However, (A2) needs more careful attention as it comes from the perspective of dynamical systems. Several algorithms such as gradient descent [Cisneros-Velarde and Bullo(2022)], SGD and decentralized SGD algorithms have been proven to be contractive [Boffi and Slotine(2020)]. When these algorithms are used to solve a common online learning problem, assumption (A2) captures the relationship between the differential dynamics of the local algorithms. Particularly, it provides an upper bound on the relative difference between the performance of the various local algorithms. We introduce $\bar{\mu}$ and $\bar{\beta}$ which will be useful for our regret analysis.

\displaystyle\bar{\mu}=\mathrm{arg}\max\limits_{ij}\mu^{ij},\quad\bar{\beta}=\mathrm{arg}\max\limits_{i}\beta^{i}

The variables $\bar{\mu}$ corresponds to the largest difference in differential dynamics and $\bar{\beta}$ corresponds to the slowest contracting rate. The following theorem addresses the design of the selecting signal such that SOLA achieves bounded regret.

Theorem 3.2.

(SOLA achieves bounded regret) For a selecting signal $\sigma(k)$ , let the number of switches between different agents be $N(k_{1},k_{2})$ over any horizon $\left[k_{1},k_{2}\right]$ , where $k_{2}>k_{1}$ . That is,

\displaystyle N(k_{1},k_{2})

\displaystyle=\sum\limits_{k=k_{1}+1}^{k_{2}}r(k),\quad\mathrm{where~{}}r(k)=\begin{cases}0,\quad\sigma(k)=\sigma(k-1),\\ 1,\quad\mathrm{otherwise}.\end{cases}

(8)

Then, if the number of switches satisfies

\displaystyle N(k_{1},k_{2})

\displaystyle\leq N_{0}+\frac{k_{2}-k_{1}}{\tau},

(9)

where $N_{0}>0$ is a constant, and $\tau=-\frac{\ln(\bar{\mu})}{\ln(\bar{\beta})}$ , then SOLA achieves bounded regret: $\mathcal{R}(K)\leq\epsilon(K)$ , where $\epsilon(K)$ is a decreasing function in $K$ .

We refer the reader to Appendix B for the proof of Theorem 3.2. In the literature related to switched systems [Liberzon(2003)], $N_{0}$ is typically referred to as the chatter bound and $\tau$ is referred to as the average dwell time.

Remark 3.3.

(Effect of cost function, data, and algorithm on the number of switches) It is important to note that $\tau$ captures the effect of the cost function, the data and the characteristics of the different local algorithms on the admissible switching signal in Theorem 3.2. For instance, poor data or a slow learning rate for the agent $i$ results in high values of $\beta^{i}$ . This leads to a higher value of $\bar{\beta}$ , which in turn means that the number of switches $N(k_{1},k_{2})$ needs to be small. Conversely, algorithms that learn fast allow for fast switching. Another aspect of $\tau$ is the similarity of the behavior of local algorithms captured through Assumption (A2). If the different local algorithms behave similarly, $\bar{\mu}\approx 1$ . This leads to a small value of $\tau$ . Conversely, if the algorithms behave very differently $\bar{\mu}$ is larger which restricts the frequency of switching local algorithms.

4 Numerical results

In this section, we show the effectiveness of SOLA for online linear regression and online classification using the MNIST dataset.

4.1 Online linear regression

We conduct experiments for online linear regression with a synthetic dataset. The agents acquire data $\big{(}A^{i}(k),B^{i}(k)\big{)}$ , where $A^{i}(k)\in^{1\times m(k)}$ , $B^{i}(k)\in^{3\times m(k)}$ , and $x\in^{3\times 1}$ . The online linear regression problem is

\displaystyle\min\limits_{x}\left|\left|A^{\sigma(k)}(k)-x^{\top}~{}B^{\sigma(k)}(k)\right|\right|^{2}_{2}.

In this experiment, the number of agents $M=2$ . One agent acquires data sampled as $A^{1}(k)={x^{*}}^{\top}B^{1}(k)+\zeta^{1}(k)$ , where $\zeta^{1}(k)\sim\mathcal{N}(0,3I)$ . The other agent collects data $A^{2}(k)={x^{*}}^{\top}B^{2}(k)+\zeta^{2}(k)$ , where $\zeta^{2}(k)\sim\mathcal{N}(0,30I)$ . Here, $B^{1}(k),B^{2}(k)\sim\mathcal{N}(0,0.5I).$ Figure 2 compares the performance of SOLA for different settings. The selecting signal $\sigma(k)$ periodically chooses between the two agents every ten instances, i.e. $N(k,k+10)=1$ . In Figure 2, we compare the performance of SOLA in two different settings, (i) Agent 1 has one sub-unit that uses centralized gradient descent, and agent 2 uses decentralized SGD with 5 sub-units, (ii) Agent 1 has 5 sub-units that perform FedAvg [McMahan et al.(2017)McMahan, Moore, Ramage, Hampson, and y Arcas] and agent 2 uses decentralized SGD. Lastly, we compare the performance of only agent 2 using decentralized SGD without SOLA. From Figure 2, it is evident SOLA with GD and decentralized SGD performs the best, whereas SOLA with FedAvg with decentralized SGD converges slower. Decentralized SGD by itself converges slower and has a higher error.

In Figure 2(b), we compare SOLA with $M=3$ . The three agents perform centralized GD, decentralized GD with five sub-units and FedAvg with five sub-units. The data used by the agents is the same as mentioned above. It is evident that a naive choice of the fusing variable not only causes more error but also leads to chattering. SOLA has a higher error when $M=3$ as compared to when $M=3$ . This is because fusing more agents requires a good choice of the fusing variable and data of good quality.

4.2 Online classification

In this experiment, we consider the online classification problem with $M=2$ and the cost is the cross-entropy loss. Here, one agent samples data from the MNIST dataset and employs centralized gradient descent. It uses a neural network that has one hidden layer with 128 neurons. The second agent performs decentralized SGD with sub-units having the same neural network architecture as agent 1. We compare the performance of SOLA for different numbers of sub-units for the second agent. In the first case, there are five sub-units and each sub-unit acquires only two labels. For example, the first sub-unit receives labels $0$ and $1$ , the second sub-unit receives labels $2$ and $3$ , and so on. In the second case, there are ten sub-units and each sub-unit receives only one distinct label. Further, every sub-unit is restricted to possess only $128$ images (reducing computation). The images sampled by the sub-units are from the MNIST dataset but marred with Gaussian noise of variance $0.5$ and zero mean. Figures 3(a) and (b) show the accuracy and entropy loss of SOLA on testing data over time. We denote the number of sub-units as $N$ in Figure 3. We observed that the performance metric $P(x,\mathcal{D})=\frac{1}{F(x,\mathcal{D})}$ gave the best performance for both accuracy and loss. The selecting signal periodically chooses between agents every five instances, i.e. $N(k,k+5)=1$ . It can be seen that SOLA still achieves an accuracy close to eighty-two percent for both cases of five and ten sub-units. However, the naive fusing choice with five sub-units has lower accuracy and more chattering.

5 Conclusions and Future Work

In this work, we considered the scenario where agents with different data and resources use different local algorithms to solve an online learning problem. Our proposed algorithm, SOLA, provides a way to systematically fuse the updates from different algorithms and ensure that regret is bounded. We also numerically analyze the performance of SOLA for different online learning scenarios. Future directions include the case of dynamically changing data distributions, tighter regret bounds and the adversarial case where byzantine agents provide malicious updates.

Appendix A Contracting Optimizers Achieve Bounded Regret

The connection between the stability of learning algorithms and bounded regret for online learning problems has been studied in [Poggio et al.(2011)Poggio, Voinea, and Rosasco, Ross and Bagnell(2011), Saha et al.(2012)Saha, Jain, and Tewari]. Particularly, in [Poggio et al.(2011)Poggio, Voinea, and Rosasco], the notion of online stability is defined as follows:

Definition A.1.

(Online stability) An algorithm $\mathcal{A}$ is said to be online stable if

\displaystyle\mathbb{E}\left|\left|F(x^{\mathcal{A}}(k),\mathcal{D}(k))-F(x^{\mathcal{A}}(k-1),\mathcal{D}(k-1))\right|\right|\leq\epsilon_{os}(k),\quad\forall t,

(10)

where $\epsilon_{os}(k)\rightarrow 0$ as $k\rightarrow\infty$ .

The notion of online stability captures the fact that for an online stable algorithm, the change in cost incurred between any time instances is bounded by a non-increasing function $\epsilon_{os}(k)$ . Importantly, Theorem 18 in [Ross and Bagnell(2011)] shows that online-stable algorithms achieve bounded regret where $\epsilon(K)\leq\sum_{k=1}^{K}\epsilon_{os}(k)$ . Also, [Poggio et al.(2011)Poggio, Voinea, and Rosasco] shows that iterative gradient-based methods such as GD, and SGD achieve online stability. We first model any iterative gradient descent algorithm $\mathcal{A}$ as a perturbated gradient descent.

\displaystyle x(t)

\displaystyle=\mathcal{A}(x(t-1),\mathcal{D})=x(t-1)-\eta~{}(\nabla F(x(t-1),\mathcal{D})+\xi(k)).

(11)

The perturbation to the true gradient is $\xi(k)\sim\mathcal{N}(0,\Sigma(\nabla F(x(k-1)))$ and $\eta$ is a constant learning rate. Further, the covariance of the perturbation $\Sigma(\nabla F(x(k)))$ satisfies $\lim\limits_{t\rightarrow\infty}\Sigma(\nabla F(x(k)))=\bm{0}$ , where $\bm{0}$ denotes the zero matrix. This model is commonly used in the analysis of algorithms such as SGD [Jastrzebski et al.(2017)Jastrzebski, Kenton, Arpit, Ballas, Fischer, Bengio, and Storkey, Mandt et al.(2017)Mandt, Hoffman, and Blei, Li and Orabona(2019)] and decentralized SGD [Boffi and Slotine(2020)]. The dynamics of any single sub-unit of decentralized SGD or FedAvg can be expressed with (11). Further, the average parameter of all the sub-units also follows the same dynamics, however, the noise characteristics of $\xi$ differ. Here, we show that perturbed iterative gradient-based algorithms are contracting as given by Definition 3.1, and are are online stable.

Theorem A.2.

(Gradient-based algorithms are contractive and achieve online stablility) Consider any iterative gradient-based algorithm $\mathcal{A}$ which updates the parameter $x$ as given by (11). If an algorithm $\mathcal{A}$ is contracting by Definition 3.1 in expectation, i.e.,

\displaystyle\mathbb{E}_{\xi}\left[V(x(k))\right]\leq\beta~{}\mathbb{E}_{\xi}\left[V(x(k-1))\right],

then $\mathcal{A}$ achieves online stability.

Proof A.3.

The differential dynamics of the system (11) is given by

	$\displaystyle\delta_{x(k)}$	$\displaystyle=P(x(k-1))\delta_{x(k-1)}-\eta\delta_{\xi(k-1)},$		(12)
	$\displaystyle P(x(k))$	$\displaystyle=\frac{\partial\big{(}x(k)-\eta~{}\nabla F(x(k))\big{)}}{\partial x(k)}=I-\eta\nabla^{2}F(x(k))$		(13)

Here $\xi$ is treated as an external signal. For the Euclidean metric $M(x(k))=I$ , the distance of the differential dynamics is given by

\displaystyle V(x(k))=\delta_{x(k-1)}P(x(k-1))^{2}\delta_{x(k-1)}-2\eta\delta_{\xi(k-1)}^{\top}P(x(k-1))\delta_{x(k-1)}+\eta^{2}||\delta_{\xi(k-1)}||^{2}.

(14)

Note that $P(x(k))<I-\eta\ell I$ due to the $\ell$ -convexity of $F$ . Therefore,

\displaystyle V(x(k))<(1-\eta\ell)^{2}V(x(k-1))-2\eta\delta_{\xi(k-1)}^{\top}P(x(k-1))\delta_{x(k-1)}+\eta^{2}||\delta_{\xi(k-1)}||^{2}.

(15)

Taking the expectation with respect to the perturbation $\xi(t)$ ,

\displaystyle\mathbb{E}[V(x(k))]

\displaystyle<(1-\eta\ell)^{2}\mathbb{E}[V(x(k-1))]+\eta^{2}\mathrm{Var}[\xi(k-1)].

(16)

If the learning rate $\eta$ ensures that

\displaystyle(1-\eta\ell)^{2}\mathbb{E}[V(x(k))]+\eta^{2}\mathrm{Var}[\xi(k-1)]\leq\mathbb{E}[V(x(k))],

(17)

then $\mathcal{A}$ is contractive. The proof of online stability follows along the same lines as Theorem 2 of [Poggio et al.(2011)Poggio, Voinea, and Rosasco]. The cost $F(x(t+1))$ can be expressed using the first two derivatives by Taylor expansion. The stability of the perturbed gradient can be then used to bound the change in cost as given by Definition A.1.

Appendix B Proof of Theorem 3.2

First, consider SOLA with the fusing variable $\alpha(k)=1$ . Then,

\displaystyle x(k)

\displaystyle=\mathcal{A}^{\sigma(k-1)}(x(k-1),\mathcal{D}(k-1)).

(18)

By assumption (A2), we have that between any two instances $k_{1}$ and $k_{2}$

	$\displaystyle V^{\sigma(k_{2})}(x(k_{2}))$	$\displaystyle\leq\bar{\beta}V^{\sigma(k_{2})}(x(k_{2}-1))\leq\bar{\mu}\bar{\beta}V^{\sigma(k_{2}-1)}(x(k_{2}-1))$
		$\displaystyle\leq{\bar{\mu}}^{N(k_{1},k_{2})}{\bar{\beta}}^{(k_{2}-k_{1})}V^{\sigma(k_{1})}x(k_{1})=e^{N(k_{1},k_{2})\ln\bar{\mu}}~{}e^{(k_{2}-k_{1})\ln\bar{\beta}}~{}V^{\sigma(k_{1})}x(k_{1})$

To ensure that SOLA is a contraction, we need that

\displaystyle N(k_{1},k_{2})\ln u^{*}+(k_{2}-k_{1})\ln\bar{\beta}

\displaystyle\leq 0

To admit at least one switch if $\frac{(k_{2}-k_{1})}{\tau}<1$ , we introduce $N_{0}>0$ . Therefore, when the number of switches $N(k_{1},k_{2})$ satisfies

\displaystyle N(k_{1},k_{2})

\displaystyle\leq N_{0}+\frac{(k_{2}-k_{1})}{\tau},

(19)

SOLA is a contracting optimizer that achieves online stability. Further, when $0\leq\alpha(k)\leq 1$ , SOLA uses a convex combination of the update by the local algorithm $\mathcal{A}^{\sigma(k)}$ and the parameter at the previous time instance $x(k-1)$ . The convex combination of contracting systems also results in a contracting system as shown in [Lohmiller and Slotine(1998)]. Hence, the overall SOLA algorithm achieves online stability.

References

[Aketi et al.(2023)Aketi, Kodge, and Roy] S. A. Aketi, S. Kodge, and K. Roy. Neighborhood gradient mean: An efficient decentralized learning method for non-iid data. Transactions on Machine Learning Research, 2023.
[Boffi and Slotine(2020)] N. M. Boffi and J.-J. E. Slotine. A continuous-time analysis of distributed stochastic gradient. Neural computation, 32(1):36–96, 2020.
[Chen et al.(2020)Chen, Ning, Slawski, and Rangwala] Y. Chen, Y. Ning, M. Slawski, and H. Rangwala. Asynchronous online federated learning for edge devices with non-iid data. In 2020 IEEE International Conference on Big Data (Big Data), pages 15–24. IEEE, 2020.
[Cisneros-Velarde and Bullo(2022)] P. Cisneros-Velarde and F. Bullo. A contraction theory approach to optimization algorithms from acceleration flows. In International Conference on Artificial Intelligence and Statistics, pages 1321–1335. PMLR, 2022.
[Cutkosky(2019)] A. Cutkosky. Combining online learning guarantees. In Conference on Learning Theory, pages 895–913. PMLR, 2019.
[Dixit et al.(2019)Dixit, Bedi, Tripathi, and Rajawat] Rishabh Dixit, Amrit Singh Bedi, Ruchi Tripathi, and Ketan Rajawat. Online learning with inexact proximal online gradient descent algorithms. IEEE Transactions on Signal Processing, 67(5):1338–1352, 2019.
[Elisseeff et al.(2005)Elisseeff, Evgeniou, Pontil, and Kaelbing] A. Elisseeff, T. Evgeniou, M. Pontil, and L. P. Kaelbing. Stability of randomized learning algorithms. Journal of Machine Learning Research, 6(1), 2005.
[Esfandiari et al.(2021)Esfandiari, Tan, Jiang, Balu, Herron, Hegde, and Sarkar] Y. Esfandiari, S. Y. Tan, Z. Jiang, A. Balu, E. Herron, C. Hegde, and S. Sarkar. Cross-gradient aggregation for decentralized learning from non-iid data. In International Conference on Machine Learning, pages 3036–3046. PMLR, 2021.
[Ferla and Jensen(2002)] C. M. Ferla and F. B. Jensen. Are current environmental databases adequate for sonar predictions in shallow water? In Impact of Littoral Environmental Variability of Acoustic Predictions and Sonar Performance, pages 555–562. Springer, 2002.
[Foster et al.(2015)Foster, Rakhlin, and Sridharan] D. J. Foster, A. Rakhlin, and K. Sridharan. Adaptive online learning. Advances in Neural Information Processing Systems, 28, 2015.
[Hand(2012)] D. J. Hand. Assessing the performance of classification methods. International Statistical Review, 80(3):400–414, 2012.
[Hoang et al.(2019)Hoang, Hoang, Low, and How] T. N. Hoang, Q. M. Hoang, K. H. Low, and J. How. Collective online learning of gaussian processes in massive multi-agent systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7850–7857, 2019.
[Hoi et al.(2021)Hoi, Sahoo, Lu, and Zhao] S. C. H. Hoi, D. Sahoo, J. Lu, and P. Zhao. Online learning: A comprehensive survey. Neurocomputing, 459:249–289, 2021.
[Jastrzebski et al.(2017)Jastrzebski, Kenton, Arpit, Ballas, Fischer, Bengio, and Storkey] S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
[Jiang et al.(2021)Jiang, Zhang, Gu, and Zhu] J. Jiang, W. Zhang, J. Gu, and W. Zhu. Asynchronous decentralized online learning. Advances in Neural Information Processing Systems, 34:20185–20196, 2021.
[Kay(1993)] S. M. Kay. Fundamentals of statistical signal processing: estimation theory. Prentice-Hall, Inc., 1993.
[Kozachkov et al.(2022)Kozachkov, Wensing, and Slotine] L. Kozachkov, P. Wensing, and J. J. E. Slotine. Generalization as dynamical robustness–the role of riemannian contraction in supervised learning. Transactions on Machine Learning Research, 2022.
[Le Bars et al.(2023)Le Bars, Bellet, Tommasi, Lavoie, and Kermarrec] B. Le Bars, A. Bellet, M. Tommasi, E. Lavoie, and A.M. Kermarrec. Refined convergence and topology learning for decentralized sgd with heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pages 1672–1702. PMLR, 2023.
[Li et al.(2023a)Li, Peng, Zhang, Ding, Hu, and Shen] W. Li, Yong Peng, Miao Zhang, Liang Ding, Han Hu, and Li Shen. Deep model fusion: A survey. arXiv preprint arXiv:2309.15698, 2023a.
[Li and Orabona(2019)] X. Li and F. Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd international conference on artificial intelligence and statistics, pages 983–992. PMLR, 2019.
[Li et al.(2023b)Li, Xie, and Li] X. Li, L. Xie, and N. Li. A survey on distributed online optimization and online games. Annual Reviews in Control, 56:100904, 2023b.
[Liberzon(2003)] D. Liberzon. Switching in systems and control, volume 190. Springer, 2003.
[Lohmiller and Slotine(1998)] W. Lohmiller and J.-J. E. Slotine. On contraction analysis for non-linear systems. Automatica, 34(6):683–696, 1998.
[Mandt et al.(2017)Mandt, Hoffman, and Blei] S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.
[McMahan et al.(2017)McMahan, Moore, Ramage, Hampson, and y Arcas] B. McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
[McMahan(2017)] H. B. McMahan. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117–3166, 2017.
[Poggio et al.(2011)Poggio, Voinea, and Rosasco] T. Poggio, S. Voinea, and L. Rosasco. Online learning, stability, and stochastic gradient descent. arXiv preprint arXiv:1105.4701, 2011.
[Ross and Bagnell(2011)] S. Ross and J. A. Bagnell. Stability conditions for online learnability. arXiv preprint arXiv:1108.3154, 2011.
[Saha et al.(2012)Saha, Jain, and Tewari] A. Saha, P. Jain, and A. Tewari. The interplay between stability and regret in online learning. arXiv preprint arXiv:1211.6158, 2012.
[Shikhman and Stein(2009)] V. Shikhman and O. Stein. Constrained optimization: projected gradient flows. Journal of optimization theory and applications, 140:117–130, 2009.
[Sontag(2022)] E. D. Sontag. Remarks on input to state stability of perturbed gradient flows, motivated by model-free feedback control learning. Systems & Control Letters, 161:105138, 2022.
[Zhang et al.(2018)Zhang, Cisse, Dauphin, and Lopez-Paz] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
[Zhu et al.(2021)Zhu, Xu, Liu, and Jin] H. Zhu, J. Xu, S. Liu, and Y. Jin. Federated learning on non-iid data: A survey. Neurocomputing, 465:371–390, 2021.