This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Decentralized Federated Learning with Model Caching on Mobile Agents

Xiaoyu Wang1, Guojun Xiong2, Houwei Cao3, Jian Li2, Yong Liu1
1New York University, 2Stony Brook University, 3New York Institute of Technology
{wang.xiaoyu, yongliu}@nyu.edu, {guojun.xiong, jian.li.3}@stonybrook.edu,
[email protected]
Abstract

Federated Learning (FL) aims to train a shared model using data and computation power on distributed agents coordinated by a central server. Decentralized FL (DFL) utilizes local model exchange and aggregation between agents to reduce the communication and computation overheads on the central server. However, when agents are mobile, the communication opportunity between agents can be sporadic, largely hindering the convergence and accuracy of DFL. In this paper, we study delay-tolerant model spreading and aggregation enabled by model caching on mobile agents. Each agent stores not only its own model, but also models of agents encountered in the recent past. When two agents meet, they exchange their own models as well as the cached models. Local model aggregation works on all models in the cache. We theoretically analyze the convergence of DFL with cached models, explicitly taking into account the model staleness introduced by caching. We design and compare different model caching algorithms for different DFL and mobility scenarios. We conduct detailed case studies in a vehicular network to systematically investigate the interplay between agent mobility, cache staleness, and model convergence. In our experiments, cached DFL converges quickly, and significantly outperforms DFL without caching.

1 Introduction

1.1 Federated Learning on Mobile Agents

Federated learning (FL) is a type of distributed machine learning (ML) that prioritizes data privacy [mcmahan2017communication]. The traditional FL involves a central server that connects with a large number of agents. The agents retain their data and do not share them with the server. During each communication round, the server sends the current global model to the agents, and a small subset of agents are chosen to update the global model by running stochastic gradient descent (SGD) [robbins1951stochastic] for multiple iterations on their local data. The central server then aggregates the updated parameters to obtain the new global model. FL is a natural fit for the emerging Internet-of-Things (IoT) systems, where each IoT device not only can sense its surrounding environment to collect local data, but also is equipped with computation resources for local model training, as well as communication interfaces to interact with a central server for model aggregation. Many IoT devices are mobile, ranging from mobile phones/tablets, autonomous cars/drones, to self-navigating robots, etc. In the recent research efforts on smart connected vehicles, there has been a focus on integrating vehicle-to-everything (V2X) networks with Machine Learning (ML) tools and distributed decision making [barbieri2022decentralized], particularly in the area of computer vision tasks such as traffic light and signal recognition, road condition sensing, intelligent obstacle avoidance, and intelligent road routing, etc. With FL, vehicles locally train deep ML models and upload only the model parameters (i.e., neural network weights and biases) to the central server. This approach not only reduces bandwidth consumption, as the size of model parameters is much smaller than the size of raw image/video data, but also leverages computing power on vehicles, and protects user privacy.

However, FL on mobile agents still faces communication and computation challenges. The movements of mobile agents, especially at high speed, lead to fast-changing channel conditions on the wireless connections between mobile agents and the central FL server, resulting in long latency in FL [niknam2020federated]. Battery-powered mobile agents also have limited power budget for long-range wireless communications. Meanwhile, the non-i.i.d data distributions on mobile agents make it difficult for local models to converge. As a result, FL on mobile agents to obtain an optimal global model remains an open challenge. Decentralized federated learning (DFL) has emerged as a potential solution where local model aggregations are conducted between neighboring mobile agents using local device-to-device (D2D) communications with high bandwidth, low latency and low power consumption [10251949] . Preliminary studies have demonstrated that decentralized FL algorithms have the potential to significantly reduce the high communication costs associated with centralized FL. However, blindly applying model aggregation algorithms, such as FedSDG [shokri2015privacy] and FedAvg [mcmahan2017communication], developed for centralized FL to distributed FL cannot achieve fast convergence and high model accuracy [liu2022distributed].

1.2 Delay-Tolerant Model Communication and Aggregation Through Caching

D2D model communication between a pair of mobile agents is possible only if they are within each other’s transmission ranges. If mobile agents only meet with each others sporadically, there will not be enough model aggregation opportunity for fast convergence. In addition, with non-i.i.d data distributions on agents, if an agent only meets with agents from a small cluster, there is no way for the agent to interact with models trained by data samples outside of its cluster, leading to disaggregated local models that cannot perform well on the global data distribution. It is therefore essential to achieve fast and even model spreading using limited D2D communication opportunities among mobile agents. A similar problem was studied in the context of Mobile Ad hoc Network (MANET), where wireless communication between mobile nodes are sporadic. The efficiency of data dissemination in MANET can be significantly improved by Delay-Tolerant Networking (DTN) [burleigh2003delay]: a mobile node caches data it received from nodes it met in the past; when meeting with a new node, it not only transfers its own data, but also the cached data of other nodes. Essentially, node mobility forms a new “communication" channel through which cached data are transported through node movement in physical space. It is worth noting that, due to multi-hop caching-and-relay, DTN transmission incurs longer delay than D2D direct transmission. Data staleness can be controlled by caching and relay algorithms to match the target application’s delay tolerance.

Table 1: Notations and Terminologies.
Notation Description
NN Number of agents
TT Number of global epochs
[N][N] Set of integers {1,,N}\{1,...,N\}
KK Number of local updates
xi(t)x_{i}(t) Model in the ttht^{th} epoch on agent ii
x(t)x(t) Global Model in the ttht^{th} epoch, x(t)=𝔼i[N][xi(t)]x(t)=\mathbb{E}_{i\in[N]}[x_{i}(t)]
xi(t,k)x_{i}(t,k) Model initialized from xtx_{t}, after kk-th local update on agent ii
x~i(t)\tilde{x}_{i}(t) Model xi(t)x_{i}(t) after local updates
𝒟i\mathcal{D}^{i} Dataset on the ii-th agent
α\alpha Aggregation weight
tτt-\tau Staleness
τmax\tau_{max} Tolerance of staleness in cache
||||||\cdot|| All the norms in the paper are l2l_{2}-norms

Motivated by DTN, we propose delay-tolerant DFL communication and aggregation enabled by model caching on mobile agents. To realize DTN-like model spreading, each mobile agent stores not only its own local model, but also local models received from other agents in the recent history. Whenever it meets another agent, it transfers its own model as well as the cached models to the agent through high-speed D2D communication. Local model aggregation on an agent works on all its cached models, mimicking a local parameter server. Compared with DFL without caching, DTN-like model spreading can push local models faster and more evenly to the whole network; aggregating all cached models can facilitate more balanced learning than pairwise model aggregation. While DFL model caching sounds promising, it also faces a new challenge of model staleness: a cached model from an agent is not the current model on that agent, with the staleness determined by the mobility patterns, as well as the model spreading and caching algorithms. Using stale models in model aggregation may slow down or even deviate model convergence.

The key challenge we want to address in this paper is how to design cached model spreading and aggregation algorithms to achieve fast convergence and high accuracy in DFL on mobile agents. Towards this goal, we make the following contributions:

  1. 1.

    We develop a new decentralized FL framework that utilizes model caching on mobile agents to realize delay-tolerant model communication and aggregation;

  2. 2.

    We theoretically analyze the convergence of aggregation with cached models, explicitly taking into account the model staleness;

  3. 3.

    We design and compare different model caching algorithms for different DFL and mobility scenarios.

  4. 4.

    We conduct a detailed case study on vehicular network to systematically investigate the interplay between agent mobility, cache staleness, and convergence of model aggregation. Our experimental results demonstrate that our cached DFL converges quickly and significantly outperforms DFL without caching.

2 Background

2.1 Global Training Objective

Similar to the standard federated learning problem, the overall objective of mobile DFL is to learn a single global statistical model from data stored on tens to potentially millions of mobile agents. The overall goal is to find the optimal model weights xdx^{*}\in\mathbb{R}^{d} to minimize the global loss function:

minxF(x),whereF(x)=1Ni[N]𝔼zi𝒟if(x;zi),\min_{x}F(x),\,\text{where}\,F(x)=\frac{1}{N}\sum_{i\in[N]}\mathbb{E}_{z^{i}\sim\mathcal{D}^{i}}{f(x;z^{i})}, (1)

where NN denotes the total number of mobile agents, and each agent has its own local dataset, i.e., 𝒟i𝒟j,ij\mathcal{D}^{i}\neq\mathcal{D}^{j},\forall i\neq j. And ziz^{i} is sampled from the local data 𝒟i\mathcal{D}^{i}.

2.2 DFL Training with Local Model Caching

All agents participate in DFL training over TT global epochs. At the beginning of the ttht^{th} epoch, agent ii’s local model is xi(t)x_{i}(t). After KK steps of SGD to solve the following optimization problem with a regularized loss function:

minx𝔼ziDif(x;zi)+ρ2xxi(t)2,\min_{x}\mathbb{E}_{z^{i}\sim D^{i}}f(x;z^{i})+\frac{\rho}{2}||x-x_{i}(t)||^{2},

agent ii obtains an updated local model x~i(t)\tilde{x}_{i}(t). Meanwhile, during the ttht^{th} epoch, driven by their mobility patterns, each agent meets and exchanges models with other agents. Other than its own model, agent ii also stores models it received from other agents encountered in the recent history in its local cache 𝒞i(t)\mathcal{C}_{i}(t). When two agents meet, they not only exchange their own local models, but also share their cached models with each others to maximize the efficiency of DTN-like model spreading. The models received by agent ii will be used to update its model cache 𝒞i(t)\mathcal{C}_{i}(t), following different cache update algorithms, such as LRU update method (Algorithm 2) or Group-based LRU update method, which will be described in details later. As the cache size of each agent is limited, it is important to design an efficient cache update rule in order to maximize the caching benefit.

After cache updating, each agent conducts local model aggregation using all the cached models with customized aggregation weights {αj(0,1)}\{\alpha_{j}\in(0,1)\} to get the updated local model xi(t+1)x_{i}(t+1) for epoch t+1t+1. In our simulation, we take the aggregation weight as αj=(nj/j𝒞i(t)nj)\alpha_{j}=(n_{j}/\sum_{j\in\mathcal{C}_{i}(t)}n_{j}), where njn_{j} is the number of samples on agent jj.

The whole process repeats until the end of TT global epochs. The detailed algorithm is shown in Algorithm 1. zkiz^{i}_{k} are randomly drawn local data samples on agent ii for the kk-th local update, and η\eta is the learning rate.

Algorithm 1 Cached Decentralized Federated Learning (DFedCache)

Input: T Global epochs, K local updates and {xi(0)}i=1N\{x_{i}(0)\}^{N}_{i=1}, Tolerance of staleness τmax\tau_{max}

1:
2:function Local Update(xi(t)x_{i}(t)):
3:     xi(t,0)=xi(t)x_{i}(t,0)=x_{i}(t)
4:     Define gx(t)(x;z)=f(x;z)+ρ2xx(t)2g_{x(t)}(x;z)=f(x;z)+\frac{\rho}{2}||x-x(t)||^{2}
5:     for k=1,2,,Kk=1,2,\ldots,K do
6:         Randomly sample zki𝒟iz^{i}_{k}\sim\mathcal{D}^{i}
7:         xi(t,k)=xi(t,k1)ηgxi(t)(xi(t,k1);zki)x_{i}(t,k)=x_{i}(t,k-1)-\eta\nabla g_{x_{i}(t)}(x_{i}(t,k-1);z^{i}_{k})
8:     end for
9:     return x~i(t)=xi(t,K)\tilde{x}_{i}(t)=x_{i}(t,K)
10:end function
1:
2:function Model Aggregation(𝒞i(t)\mathcal{C}_{i}(t)):
3:     xi(t+1)=j𝒞i(t)αjx~j(τ)x_{i}(t+1)=\sum_{j\in\mathcal{C}_{i}(t)}\alpha_{j}\tilde{x}_{j}(\tau)
4:     return xi(t+1)x_{i}(t+1)
5:end function
1: Main Process:
2:for t=0,1,T1t=0,1,\ldots T-1 do
3:     for i=0,1,,N1i=0,1,\ldots,N-1 do
4:         x~i(t)\tilde{x}_{i}(t)\leftarrowLocal Update(xi(t)x_{i}(t))
5:         𝒞i(t)\mathcal{C}_{i}(t)\leftarrowCache Update(𝒞i(t1),τmax\mathcal{C}_{i}(t-1),\tau_{max})
6:         xi(t+1)x_{i}(t+1)\leftarrowModel Aggregation(𝒞i(t)\mathcal{C}_{i}(t))
7:     end for
8:end for

Output: {xi(T)}i=1N\{x_{i}(T)\}^{N}_{i=1}

Remark 1.

Note the NN agents communicate with each others in a mobile D2D network. D2D communication can only happen between an agent and its neighbors within a short range (e.g. several hundred meters). Since agent locations are constantly changing (for instance, vehicles continuously move along the road network of a city), D2D network topology is dynamic and can be sparse at any given epoch. To ensure the eventual model convergence, the union graph of D2D networks over multiple epochs should be strongly-connected for efficient DTN-like model spreading. We also assume D2D communications are non-blocking, carried by short-distance high-throughput communication methods such as mmWave or WiGig, which has enough capacity to complete the exchange of cached models before the agents go out of each other’s communication ranges.

Remark 2.

Intuitively, comparing to the DFL without cache (e.g. DeFedAvg [sun2022decentralized]), where each agent can only get new model by averaging with another model, cached DFL uses more models (delayed version) for aggregation, thus bring more underlying information from dataset on more agents. Although DFedCache introduces stale models, it can benefit model convergence, especially in highly heterogeneous data distribution scenarios. Overall, our cached DFL framework allows each agent to act as a local proxy for Centralized Federated Learning with delayed cached models, thus speedup the convergence especially in highly heterogeneous data distribution scenarios, where are challenging for the traditional DFL to converge.

Remark 3.

As mentioned above, our approach inevitably introduces stale models. Intuitively, larger staleness results in greater error in the global model. For the cached models with large staleness tτt-\tau, we could set a threshold τmax\tau_{max} to kick old models out of model spreading, which is described in the cache update algorithms. The practical value for τmax\tau_{max} should be related to the cache capacity and communication frequency between agents. In our experimental results, we choose τmax\tau_{max} to be 10 or 20 epochs. Results and analysis about the effect of different τmax\tau_{max} can be found in results section 5.

3 Convergence Analysis

We now theoretically investigate the impact of caching, especially the staleness of cached models, on DFL model convergence. We introduce some definitions and assumptions.

Definition 1 (Smoothness).

A differentiable function ff is LL-smooth if for x,y,\forall x,y, f(y)f(x)f(x),yx+L2yx2f(y)-f(x)\leq\langle\nabla f(x),y-x\rangle+\frac{L}{2}||y-x||^{2}, where L>0L>0.

Definition 2 (Bounded Variance).

There exists constant ς>0\varsigma>0 such that the global variability of the local gradient of the loss function is bounded Fj(x)F(x)2ς2,j[N],xd||\nabla F_{j}(x)-\nabla F(x)||^{2}\leq\varsigma^{2},\forall j\in[N],x\in\mathbb{R}^{d}.

Definition 3 (L-Lipschitz Continuous Gradient).

There exists a constant L>0L>0, such that F(x)F(x)Lxx,x,xd.||\nabla F(x)-\nabla F(x^{\prime})||\leq L^{\prime}||x-x^{\prime}||,\forall x,x^{\prime}\in\mathbb{R}^{d}.

Theorem 4.

Assume that FF is LL-smooth and convex, and each agent executes KK local updates before meeting and exchanging models, after that, then does model aggregation. We also assume bounded staleness τ<τmax\tau<\tau_{max}, as the kick-out threshold. Furthermore, we assume, xd,i[N]\forall x\in\mathbb{R}^{d},i\in[N], and z𝒟i,f(x;z)2V,gx(x;z)2V,xd\forall z\sim\mathcal{D}^{i},||\nabla f(x;z)||^{2}\leq V,||\nabla g_{x^{\prime}}(x;z)||^{2}\leq V,\forall x^{\prime}\in\mathbb{R}^{d}, and F\nabla F satisfies L-Lipschitz Continuous Gradient. For any small constant ϵ>0\epsilon>0, if we take ρ>0\rho>0, and satisfying (1+2ρ+ϵ)V+(ρ2ρ2)x(t,k1)x(t)20,x(t,k1),x(t)-(1+2\rho+\epsilon)V+(\rho^{2}-\frac{\rho}{2})||x(t,k-1)-x(t)||^{2}\geq 0,\forall x(t,k-1),x(t), after TT global epochs, Algorithm 1 converges to a critical point:

mint=0T1𝔼F(x(t))2\displaystyle\min_{t=0}^{T-1}\mathbb{E}||\nabla F(x(t))||^{2} τmaxϵηC1KT𝔼[F(x(0))F(xM(T)(T)]+𝒪(ηρK2ϵC1)\displaystyle\leq\frac{\tau_{max}}{\epsilon\eta C_{1}KT}\mathbb{E}[F(x(0))-F(x_{M(T)}(T)]+\mathcal{O}(\frac{\eta\rho K^{2}}{\epsilon C_{1}})
𝒪(τmaxϵηC1KT)+𝒪(ηρK2ϵC1).\displaystyle\leq\mathcal{O}(\frac{\tau_{max}}{\epsilon\eta C_{1}KT})+\mathcal{O}(\frac{\eta\rho K^{2}}{\epsilon C_{1}}). (2)

3.1 Proof Sketch

We now highlight the key ideas and challenges behind our convergence proof.

Step 1: Similar to Theorem 1 in DBLP:journals/corr/abs-1903-03934, we bound the expected cost reduction after KK steps of local updates on the jj-th agent, j[N]\forall j\in[N], as

𝔼[F(x~j(t))F(xj(t))]=𝔼[F(xj(t,K))F(xj(t,0))]ηϵk=0K1𝔼F(xj(t,k))2+η2𝒪(ρK3V).\begin{split}\mathbb{E}[F(\tilde{x}_{j}(t))-F(x_{j}(t))]&=\mathbb{E}[F(x_{j}(t,K))-F(x_{j}(t,0))]\\ &\leq-\eta\epsilon\sum_{k=0}^{K-1}\mathbb{E}||\nabla F(x_{j}(t,k))||^{2}+\eta^{2}\mathcal{O}(\rho K^{3}V).\end{split}

Step 2: For any epoch tt, we find the index M(t)M(t) of the agent whose model is the “worst", i.e., M(t)=argmaxj[N]{F(xj(t))}M(t)=\arg\max_{j\in[N]}\{F(x_{j}(t))\}, and the “worst" model on all agents over the time period of [tτmax+1,t][t-\tau_{max}+1,t] as

𝒯(t,τmax)=argmaxt[tτmax+1,t]{F(xM(t)(t))}.\mathcal{T}(t,\tau_{max})=\arg\max_{t\in[t-\tau_{max}+1,t]}\{F(x_{M(t)}(t))\}.

Step 3: We bound the cost reduction of the “worst" model at epoch t+1t+1 from the “worst" model in the time period of [tτmax+1,t][t-\tau_{max}+1,t], i.e., the worst possible model that can be stored in some agent’s cache at time tt, as:

𝔼[F(xM(t+1)(t+1))\displaystyle\mathbb{E}[F\left(x_{M(t+1)}(t+1)\right) F(xM(𝒯(t,τmax))(𝒯(t,τmax)))]\displaystyle-F\left(x_{M(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max}))\right)]
ϵηC1Kminτ=ttτmax+1F(x(τ))2+η2𝒪(ρK3V).\displaystyle\leq-\epsilon\eta C_{1}\mathrlap{K}\min_{\tau=t}^{t-\tau_{max}+1}||\nabla F(x(\tau))||^{2}+\eta^{2}\mathcal{O}(\rho K^{3}V). (3)

Step 4: We iteratively construct a time sequence {T0,T1,T2,,TNT}{0,1,,T1}\{T^{\prime}_{0},T^{\prime}_{1},T^{\prime}_{2},...,T^{\prime}_{N_{T}}\}\subseteq\{0,1,...,T-1\} in the backward fashion so that

TNT\displaystyle T^{\prime}_{N_{T}} =T1;\displaystyle=T-1;
Ti\displaystyle T^{\prime}_{i} =𝒯(Ti+1,τmax)1,1iNT1;\displaystyle=\mathcal{T}(T^{\prime}_{i+1},\tau_{max})-1,\quad 1\leq i\leq N_{T}-1;
T0\displaystyle T^{\prime}_{0} =0.\displaystyle=0.

Step 5: Applying inequality (3.1) at all time instances {T0,T1,T2,,TNT}\{T^{\prime}_{0},T^{\prime}_{1},T^{\prime}_{2},...,T^{\prime}_{N_{T}}\}, after T global epochs, we have,

mint=0T1𝔼\displaystyle\min_{t=0}^{T-1}\mathbb{E} [F(x(t))2]1NTt=T0TNTminτ=ttτmax+1F(x(τ))2\displaystyle[||\nabla F(x(t))||^{2}]\leq\frac{1}{N_{T}}\sum_{t=T^{\prime}_{0}}^{T^{\prime}_{N_{T}}}\min_{\tau=t}^{t-\tau_{max}+1}||\nabla F(x(\tau))||^{2}
1ϵηKC1NTt=T0TNT𝔼[F(xM(𝒯(t,τmax))(𝒯(t,τmax)))F(xM(t+1)(t+1))]+𝒪(ηρK2VϵC1)\displaystyle\leq\frac{1}{\epsilon\eta KC_{1}N_{T}}\sum_{t=T^{\prime}_{0}}^{T^{\prime}_{N_{T}}}\mathbb{E}[F(x_{M(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))-F(x_{M(t+1)}(t+1))]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})
1ϵηKC1NT𝔼[F(x(0))F(xM(T)(T)]+𝒪(ηρK2VϵC1)\displaystyle\leq\frac{1}{\epsilon\eta KC_{1}N_{T}}\mathbb{E}[F(x(0))-F(x_{M(T)}(T)]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})
𝒪(τmaxϵηKC1T)+𝒪(ηρK2ϵC1).\displaystyle\leq\mathcal{O}(\frac{\tau_{max}}{\epsilon\eta KC_{1}T})+\mathcal{O}(\frac{\eta\rho K^{2}}{\epsilon C_{1}}). (4)

Step 6: With the results in (3.1) and by leveraging Theorem 1 in yang2021achieving, Algorithm 1 converges to a critical point after TT global epochs.

4 Realization of DFedCache

4.1 Datasets

We conduct experiments using three standard FL benchmark datasets: MNIST [deng2012mnist], FashionMNIST [xiao2017fashionmnistnovelimagedataset], CIFAR-10 [krizhevsky2009learning] on N=100N=100 vehicles, with CNN, CNN and ResNet-18 [he2016deep] as models respectively. We design three different data distribution settings: non-i.i.d, i.i.d, Dirichlet distribution method. In extreme non-i.i.d, we use a setting similar to su2022boost, data points in training set are sorted by labels and then evenly divided into 200 shards, with each shards containing 1–2 labels out of 10 labels. Then 200 shards are randomly assigned to 100 vehicles unevenly: 10% vehicles receive 4 shards, 20% vehicles receive 3 shards, 30% vehicles receive 2 shards and the rest 40% receive 1 shards. For i.i.d, we randomly allocate all the training data points to 100 vehicles. For Dirichlet distribution, we follow the setting in xiong2024deprl, to take a heterogeneous allocation by sampling piDirN(π)p_{i}\sim Dir_{N}(\pi), where π\pi is the parameter of Dirichlet distribution. We take π=0.5\pi=0.5 in our following experiments.

4.2 Evaluation Setup

The baseline algorithm is Decentralized FedAvg [sun2022decentralized], which implements simple decentralized federated optimization. For convenience, we name Decentralized FedAvg as DFL in the following results. We set batch size to 64 in all experiments. For MNIST and FashionMNIST, we use 60k data points for training and 10k data points for testing. For CIFAR-10, we use 50k data points for training and 10k data points for testing. Different from training set partition, we do not split the testset. For MNIST and FashionMNIST, we test local models of 100 vehicles on the 10k data points of the whole test set and get the average test accuracy for the evaluation metric. What’s more, for CIFAR-10, due to the computing overhead, we sample 1000 data points from test set for each vehicle and use the average test accuracy of 100 vehicles as the evaluation metric. For all the experiments, we train for 1,000 global epochs, and implement early stop when the average test accuracy stops increasing for at least 20 epochs. For MNIST and FashionMNIST experiments, we use 10 compute nodes, each with 10 CPUs, to simulate DFL on 100 vehicles. CIFAR-10 results are obtained from 1 compute node with 5 CPUs and 1 A100 NVIDIA GPU. We implement all algorithms in PyTorch [paszke2017automatic] on Python3.

Algorithm 2 LRU Model Cache Update (LRU Update)

Input: Current cache 𝒞i(t)\mathcal{C}_{i}(t), Cache of agent jj: 𝒞j(t)\mathcal{C}_{j}(t), Model xj(t)x_{j}(t) from agent jj, Current time tt, Cache size 𝒞max\mathcal{C}_{max}, Tolerance of staleness τmax\tau_{max}

1: Main Process:
2:for xk(τ)𝒞i(t)x_{k}(\tau)\in\mathcal{C}_{i}(t) or 𝒞j(t)\mathcal{C}_{j}(t) do
3:     if tττmaxt-\tau\geq\tau_{max} then
4:         Remove xk(τ)x_{k}(\tau) from related cache 𝒞i(t)\mathcal{C}_{i}(t) or 𝒞j(t)\mathcal{C}_{j}(t)
5:     end if
6:end for
7:Add or replace xj(t)x_{j}(t) into 𝒞i(t)\mathcal{C}_{i}(t)
8:for xk(τ)𝒞j(t)x_{k}(\tau)\in\mathcal{C}_{j}(t) do
9:     if xk𝒞i(t)x_{k}\notin\mathcal{C}_{i}(t) then
10:         Add xk(τ)x_{k}(\tau) into 𝒞i(t)\mathcal{C}_{i}(t)
11:     else
12:         Retrieve xk(τ)𝒞i(t)x_{k}(\tau^{\prime})\in\mathcal{C}_{i}(t)
13:         if τ>τ\tau>\tau^{\prime} then
14:              Replace xk(τ)x_{k}(\tau^{\prime}) with xk(τ)x_{k}(\tau) into 𝒞i(t)\mathcal{C}_{i}(t)
15:         end if
16:     end if
17:end for
18:Sort models in 𝒞i(t)\mathcal{C}_{i}(t) in descending order of τ\tau.
19:Keep the first 𝒞max\mathcal{C}_{max} models in 𝒞i(t)\mathcal{C}_{i}(t) then get 𝒞i(t+1)\mathcal{C}_{i}(t+1)

Output: 𝒞i(t+1)\mathcal{C}_{i}(t+1)

4.3 Optimization Method

We use SGD as the optimizer and set the initial learning rate η=0.1\eta=0.1 for all experiments, and use learning rate scheduler named ReduceLROnPlateau from PyTorch, to automatically adjust the learning rate for each training.

4.4 Mobile DFL Simulation

Manhattan Mobility Model maps are generated from the real road data of Manhattan from INRIX111© 2024 INRIX, Inc.,”INRIX documentation,” INRIX documentation https://docs.inrix.com/., which is shown in Fig.1. We follow the setting of Manhattan Mobility Model from bai2003important: a vehicle is allowed to move along the grid of horizontal and vertical streets on the map. At an intersection, the vehicle can turn left, right or go straight. This choice is probabilistic: the probability of moving on the same street is 0.5, and the probabilities of turn into one of the rest roads are equal. For example, at an intersection with 3 more roads to turn into, then the probability to turn into each road is 0.1667. Follow the setting from su2022boost: each vehicle is equipped with DSRC and mmWave communication components, can communicate to other vehicles within the communication range of 100 meters, while the default velocity of each vehicle is 13.89 m/s. In our experiment, we set up the number of local updates K=10K=10, and the interval between each epoch is 120120 seconds. Within each global epoch, vehicles train their models on local datasets while moving on the road and updating their cache by uploading and downloading models with other encountered vehicles.

Refer to caption
Figure 1: Manhattan Mobility Model Map. The dots represent the intersections while the edges between nodes represent road in Manhattan.

4.5 LRU Cache Update

Algorithm 2, which we name as LRU method for convenience, is the basic cache update method we proposed. Basically, the LRU updating rule aims to fetch the most-up-to-date models and keep as many of them in the cache as possible. At line 13, 18, the metric for making choice among models is the timestamp of models, which is defined as the epoch time when the model was received from the original vehicle, rather than received from the cache. What’s more, to fully utilize the caching mechanism, a vehicle not only fetches other vehicles’ own trained models, but also models in their caches. For instance, at epoch tt, vehicle ii can directly fetch model xj(t)x_{j}(t) from vehicle jj, at line 7, and also fetch models xk(τ)𝒞j(t)x_{k}(\tau)\in\mathcal{C}_{j}(t) from the cache of vehicle jj, at line 8. This way, each vehicle can not only fetch the models directly from its neighbours, but also indirectly from its neighbours’ neighbours, thus boosting the spreading of the underlying data information from different vehicles, and improving the DFL convergence speed, especially with heterogeneous data distribution. Additionally, at line 2, before updating cache, models with staleness tττmaxt-\tau\geq\tau_{max} will be removed from each vehicle’s cache.

Refer to caption
Figure 2: DFL with Caching vs. DFL without Caching.

5 Numerical Results

5.1 Caching vs. Non-caching

To evaluate the performance of DFedCache, we compare the DFL with LRU Cache, Centralized FL (CFL) and DFL on MNSIT, FashionMNIST, CIFAR-10 with three distributions: non-i.i.d, i.i.d, Dirichlet with π=0.5\pi=0.5. For LRU update, we take the cache size as 10 for MNIST and FashionMNIST, and 3 for CIFAR-10 and τmax=5\tau_{max}=5 based on practical considerations. Given a speed of 13.89m/s13.89m/s and and a communication distance of 100m100m, the communication window of two agents driving in opposite directions could be limited. Additionally, considering the size of chosen models and communication overhead, the above cache sizes were chosen. From the results in Fig. 2, we can see that DFedCache boosts the convergence and outperforms non-caching method (DFL) and gains performance much closer to CFL in all the cases, especially in non-i.i.d. scenarios.

5.2 Impact of Cache Size

Then we evaluate the performance gains with different cache sizes from 1 to 30 and τmax=10\tau_{max}=10, on MNIST and FashionMNIST in Fig. 3. LRU can benefit more from larger cache sizes, especially in non-i.i.d scenarios, as aggregation with more cached models gets closer to CFL and training with global data distribution.

Refer to caption
Figure 3: DFL with LRU at Different Cache Sizes.

5.3 Impact of Model Staleness

As mentioned in the previous sections, one drawback of model caching is introducing stale models into aggregation, so it is very important to choose a proper staleness tolerance τmax\tau_{max}. First, we statistically calculate the relation between the average number and the average age of cached models at different τmax\tau_{max} from 1 to 20, when epoch time is 30s, 60s, 120s, with unlimited cache size, in Table 2. We can see that, with the fixed epoch time, as the τmax\tau_{max} increases, the average number and average age of cached models increase, approximately linearly. It’s not hard to understand, as every epoch each agent can fetch a limited number of models directly from other agents. Increasing the staleness tolerance τmax\tau_{max} will increase the number of cached models, as well as the age of cached models. What’s more, we can see that the communication frequency or the moving speed of agents will also impact the average age of cached models, as the faster an agent moves, the more models it can fetch within each epoch, which we will further discuss later.

Table 2: Average number and average age of cached models with different τmax\tau_{max} and different epoch time: 30s, 60s, 120s. Columns for different τmax\tau_{max}, and rows for different epoch time. There are two sub-rows for each epoch time, with the first sub-row be the average number of cached models, the second sub-row be their average age.
τmax\tau_{max} 1 2 3 4 5 10 20
30s num 0.8549 1.6841 2.6805 3.8584 5.2054 14.8520 44.8272
age 0.0000 0.4904 1.0489 1.6385 2.2461 5.4381 11.5735
60s num 1.6562 3.8227 6.4792 10.3889 14.0749 43.0518 90.2576
age 0.000 0.5593 1.1604 1.8075 2.4396 5.5832 9.7954
120s num 3.6936 9.9149 18.5016 31.3576 35.1958 90.2469 98.5412
age 0.0000 0.6189 1.2715 1.9194 1.5081 4.7131 5.2357

And we also compare the performance of DFL with LRU at different τmax\tau_{max} on MNIST under non-i.i.d and i.i.d in Fig.4. Here we pick the epoch time 30s for a clear view. First, for non-i.i.d, the larger τmax\tau_{max} brings faster convergence at the beginning, which is consistent with our previous conclusion, as it allows more cached models which bring more benefits than harm to the training with non-i.i.d. However, for i.i.d scenario, larger τmax\tau_{max} stops improving the performance of models, as introducing more cached models hardly benefits training in i.i.d, and the model staleness will prevent the convergence. What’s more, if we zoom in the final converging phase, which is amplified in the bottom of each figure, we observe that larger τmax\tau_{max} will reduce the final converged accuracy in both non-i.i.d and i.i.d scenarios, as it introduces more staleness. Even with large τmax\tau_{max} and large staleness, LRU method still achieves similar or even better performance than DFL in non-i.i.d. scenarios.

Refer to caption
Figure 4: Impact of τmax\tau_{max} on Model Convergence.

5.4 Mobility’s Impact on Convergence

Vehicle mobility directly determines the frequency and efficiency of cache-based model spreading and aggregation. So we evaluate the performance of cached DFL at different vehicle speeds. We fix cache size at 10, τmax=10\tau_{max}=10 with non-i.i.d. data distribution. We take the our previous speed v=v0=13.89m/sv=v_{0}=13.89m/s and K=30K=30 local updates as the base, named as speedup x1. To speedup, vv increases while KK reduces to keep the fair comparison under the same wall clock. For instance, for speedup x3, v=3v0v=3v_{0} and K=10K=10. Results in Fig. 5 shows that when the mobility speed increases, although the number of local updates decreases, the spread of all models in the whole vehicle network is boosted, thus disseminating local models more quickly among all vehicles, leading to faster model convergence.

Refer to caption
Figure 5: Convergence at Different Mobility Speed.

5.5 Experiments with Grouped Mobility Patterns and Data Distributions

In practice, vehicle mobility patterns and local data distributions may naturally form groups. For example, a vehicle may mostly move within its home area, and collect data specific to that area. Vehicles within the same area meet with each others frequently, but have very similar data distributions. A fresh model of a vehicle in the same area is not as valuable as a slightly older model of a vehicle from another area. So model caching should not just consider model freshness, but should also take into account the coverage of group-based data distributions. For those scenarios, we develop a Group-based (GB) caching algorithm as Algorithm 3. More specifically, knowing that there are mm distribution groups, one should maintain balanced presences of models from different groups. One straightforward extension of the LRU caching algorithm is to partition the cache into mm sub-caches, one for each group. Each sub-cache is updated by local models from its associated group, using the LRU policy.

Algorithm 3 Group Based Cache Update (GB Update)

Input: Current cache 𝒞i(t)\mathcal{C}_{i}(t), Cache of device jj: 𝒞j(t)\mathcal{C}_{j}(t), Model xj(t)x_{j}(t) from device jj, Current time tt, Cache size 𝒞max\mathcal{C}_{max}, Tolerance of staleness τmax\tau_{max}, Device group mapping list 𝒜={𝒜1,𝒜2,,𝒜n}\mathcal{A}=\{\mathcal{A}_{1},\mathcal{A}_{2},...,\mathcal{A}_{n}\}, here 𝒜i{1,2,,m},i[N]\mathcal{A}_{i}\in\{1,2,...,m\},\forall i\in[N]. Group cache size list ={r1,r2,,rm}\mathcal{R}=\{r_{1},r_{2},...,r_{m}\}, here r𝒜ir_{\mathcal{A}_{i}} is number of cache slots reserved for models from devices belong to group 𝒜i\mathcal{A}_{i}.
Here the model x(i,t)x(i,t^{\prime}) is local updated at time tt^{\prime}, based on car ii’s dataset.

1:
2:function Group_prune_cache(𝒞i(t),,𝒜\mathcal{C}_{i}(t),\mathcal{R},\mathcal{A}):
3:     Create empty lists of group 1,2,,m,\mathcal{L}_{1},\mathcal{L}_{2},...,\mathcal{L}_{m},
4:     for xk(τ)𝒞i(t)x_{k}(\tau)\in\mathcal{C}_{i}(t) do
5:         Put xk(τ)x_{k}(\tau) into 𝒜k\mathcal{L}_{\mathcal{A}_{k}}
6:     end for
7:     Empty 𝒞i(t)\mathcal{C}_{i}(t)
8:     for k=1,2,,mk=1,2,...,m do
9:         Sort models in k\mathcal{L}_{k} by τ\tau descending.
10:         Take and put first rkr_{k} models in k\mathcal{L}_{k} into 𝒞i(t)\mathcal{C}_{i}(t)
11:     end for
12:     return 𝒞i(t+1)\mathcal{C}_{i}(t+1)
13:end function
1:Main Process:
2:for xk(τ)𝒞i(t)x_{k}(\tau)\in\mathcal{C}_{i}(t) or 𝒞j(t)\mathcal{C}_{j}(t) do
3:     if tττmaxt-\tau\geq\tau_{max} then
4:         Remove xk(τ)x_{k}(\tau) from related cache 𝒞i(t)\mathcal{C}_{i}(t) or 𝒞j(t)\mathcal{C}_{j}(t)
5:     end if
6:end for
7:Add or replace xj(t)x_{j}(t) into 𝒞i(t)\mathcal{C}_{i}(t)
8:for xk(τ)𝒞j(t)x_{k}(\tau)\in\mathcal{C}_{j}(t) do
9:     if xk𝒞i(t)x_{k}\notin\mathcal{C}_{i}(t) then
10:         Add xk(τ)x_{k}(\tau) into 𝒞i(t)\mathcal{C}_{i}(t)
11:     else
12:         Retrieve xk(τ)𝒞i(t)x_{k}(\tau^{\prime})\in\mathcal{C}_{i}(t)
13:         if τ>τ\tau>\tau^{\prime} then
14:              Replace xk(τ)x_{k}(\tau) with xk(τ)x_{k}(\tau^{\prime}) into 𝒞i(t)\mathcal{C}_{i}(t)
15:         end if
16:     end if
17:end for
18:𝒞i(t+1)\mathcal{C}_{i}(t+1)\leftarrow GROUP_PRUNE_CACHE(𝒞i(t),,𝒜\mathcal{C}_{i}(t),\mathcal{R},\mathcal{A})

Output: 𝒞i(t+1)\mathcal{C}_{i}(t+1)

We now conduct a case study for group-based cache update. As shown in Fig 1, the whole Manhattan road network is divided into 3 areas, downtown, mid-town and uptown. Each area has 30 area-restricted vehicles that randomly moves within that area, and 3 or 4 free vehicles that can move into any area. We set 4 different area-related data distributions: Non-overlap, 1-overlap, 2-overlap, 3-overlap. nn-overlap means the number of shared labels between areas is nn. We use the same non-i.i.d shards method in the previous section to allocate data points to the vehicles in the same area. On each vehicle, we evenly divide the cache for the three areas. We evaluate our proposed GB cache method on FashionMNIST. As shown in Fig. 6, while vanilla LRU converges faster at the very beginning, it cannot outperform DFL at last. However, the GB update method can solve the problem of LRU update and outperform DFL under different overlap settings.

Refer to caption
Figure 6: Group-based LRU Cache Update Performance under Different Data Distribution Overlaps.

5.6 Discussions

In general, the convergence rate of DFedCache outperforms non-caching method (DFL), especially for non-i.i.d. distributions. Larger cache size and smaller τmax\tau_{max} make DFedCache closer to the performance of CFL. Empirically, there is a trade-off between the age and number of cached models, it is critical to choose a proper τmax\tau_{max} to control the staleness. What’s more, the choice of τmax\tau_{max} should take into account the diversity data distributions on agents. In general, with non-i.i.d data distributions, the benefits of increasing the number of cached models can outweigh the damages caused by model staleness; while when data distributions are close to i.i.d, it is better to use a small number of fresh models than a large number of stale models. Similarly conclusions can also be drawn from the results of area-restricted vehicles. What’s more, in a system of moving agents, the mobility will also have big impact on the training, usually higher speed and communication frequency improve the model convergence and accuracy.

6 Related Work

Decentralized FL (DFL) has been increasingly applied in vehicular networks, leveraging existing frameworks like vehicle-to-vehicle (V2V) communication [yuan2024decentralized]. V2V FL facilitates knowledge sharing among vehicles and has been explored in various studies [samarakoon2019distributed, pokhrel2020decentralized, yu2020proactive, chen2021bdfl, barbieri2022decentralized, su2022boost]. samarakoon2019distributed studied optimized joint power and resource allocation for ultra-reliable low-latency communication (URLLC) using FL.

su2022boost introduced DFL with Diversified Data Sources to address data diversity issues in DFL, improving model accuracy and convergence speed in vehicular networks. None of the previous studies explored model caching on vehicles. Convergence of asynchronous federated optimization was studied in DBLP:journals/corr/abs-1903-03934. Their analysis focused on pairwise model aggregation between an agent and the parameter server, does not cover decentralized model aggregation with stale cached models in our proposed framework.

7 Conclusion & Future Work

In this paper, we developed a novel decentralized federated learning framework that leverages on model caching on mobile agents for fast and even model spreading. We theoretically analyzed the convergence of DFL with cached models. Through extensive case studies in a vehicle network, we demonstrated that our algorithm significantly outperforms DFL without model caching, especially for agents with non-i.i.d data distributions. We employed only simple model caching and aggregation algorithms in the current study. We will investigate more refined model caching and aggregation algorithms customized for different agent mobility patterns and non-i.i.d. data distributions.

References

  • Bai et al. [2003] Fan Bai, Narayanan Sadagopan, and Ahmed Helmy. Important: A framework to systematically analyze the impact of mobility on performance of routing protocols for adhoc networks. In IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428), volume 2, pages 825–835. IEEE, 2003.
  • Barbieri et al. [2022] Luca Barbieri, Stefano Savazzi, Mattia Brambilla, and Monica Nicoli. Decentralized federated learning for extended sensing in 6g connected vehicles. Vehicular Communications, 33:100396, 2022.
  • Burleigh et al. [2003] Scott Burleigh, Adrian Hooke, Leigh Torgerson, Kevin Fall, Vint Cerf, Bob Durst, Keith Scott, and Howard Weiss. Delay-tolerant networking: an approach to interplanetary internet. IEEE Communications Magazine, 41(6):128–136, 2003.
  • Chen et al. [2021] Jin-Hua Chen, Min-Rong Chen, Guo-Qiang Zeng, and Jia-Si Weng. Bdfl: A byzantine-fault-tolerance decentralized federated learning method for autonomous vehicle. IEEE Transactions on Vehicular Technology, 70(9):8639–8652, 2021.
  • Deng [2012] Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Krizhevsky [2009] A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Tront, 2009.
  • Liu et al. [2022] Ji Liu, Jizhou Huang, Yang Zhou, Xuhong Li, Shilei Ji, Haoyi Xiong, and Dejing Dou. From distributed machine learning to federated learning: A survey. Knowledge and Information Systems, 64(4):885–917, 2022.
  • Martínez Beltrán et al. [2023] Enrique Tomás Martínez Beltrán, Mario Quiles Pérez, Pedro Miguel Sánchez Sánchez, Sergio López Bernal, Gérôme Bovet, Manuel Gil Pérez, Gregorio Martínez Pérez, and Alberto Huertas Celdrán. Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges. IEEE Communications Surveys & Tutorials, 25(4):2983–3013, 2023. doi: 10.1109/COMST.2023.3315746.
  • McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • Niknam et al. [2020] Solmaz Niknam, Harpreet S Dhillon, and Jeffrey H Reed. Federated learning for wireless communications: Motivation, opportunities, and challenges. IEEE Communications Magazine, 58(6):46–51, 2020.
  • Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • Pokhrel and Choi [2020] Shiva Raj Pokhrel and Jinho Choi. A decentralized federated learning approach for connected autonomous vehicles. In 2020 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), pages 1–6. IEEE, 2020.
  • Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • Samarakoon et al. [2019] Sumudu Samarakoon, Mehdi Bennis, Walid Saad, and Mérouane Debbah. Distributed federated learning for ultra-reliable low-latency vehicular communications. IEEE Transactions on Communications, 68(2):1146–1159, 2019.
  • Shokri and Shmatikov [2015] Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 1310–1321, 2015.
  • Su et al. [2022] Dongyuan Su, Yipeng Zhou, and Laizhong Cui. Boost decentralized federated learning in vehicular networks by diversifying data sources. In 2022 IEEE 30th International Conference on Network Protocols (ICNP), pages 1–11. IEEE, 2022.
  • Sun et al. [2022] Tao Sun, Dongsheng Li, and Bao Wang. Decentralized federated averaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017. URL https://arxiv.org/abs/1708.07747.
  • Xie et al. [2019] Cong Xie, Sanmi Koyejo, and Indranil Gupta. Asynchronous federated optimization. CoRR, abs/1903.03934, 2019. URL http://arxiv.org/abs/1903.03934.
  • Xiong et al. [2024] Guojun Xiong, Gang Yan, Shiqiang Wang, and Jian Li. Deprl: Achieving linear convergence speedup in personalized decentralized learning with shared representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16103–16111, 2024.
  • Yang et al. [2021] Haibo Yang, Minghong Fang, and Jia Liu. Achieving linear speedup with partial worker participation in non-iid federated learning. Proceedings of ICLR, 2021.
  • Yu et al. [2020] Zhengxin Yu, Jia Hu, Geyong Min, Han Xu, and Jed Mills. Proactive content caching for internet-of-vehicles based on peer-to-peer federated learning. In 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS), pages 601–608. IEEE, 2020.
  • Yuan et al. [2024] Liangqi Yuan, Ziran Wang, Lichao Sun, S Yu Philip, and Christopher G Brinton. Decentralized federated learning: A survey and perspective. IEEE Internet of Things Journal, 2024.
  • Zhang et al. [2019] Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back. Advances in neural information processing systems, 32, 2019.

Appendix A Convergence Analysis

We now theoretically investigate the impact of caching, especially the staleness of cached models, on DFL model convergence. We introduce some definitions and assumptions.

Definition 4 (Smoothness).

A differentiable function ff is LL-smooth if for x,y,\forall x,y, f(y)f(x)f(x),yx+L2yx2f(y)-f(x)\leq\langle\nabla f(x),y-x\rangle+\frac{L}{2}||y-x||^{2}, where L>0L>0.

Definition 5 (Bounded Variance).

There exists a constant ς>0\varsigma>0 such that the global variability of the local gradient of the loss function is bounded Fj(x)F(x)2ς2,j[N],xd||\nabla F_{j}(x)-\nabla F(x)||^{2}\leq\varsigma^{2},\forall j\in[N],x\in\mathbb{R}^{d}.

Definition 6 (L-Lipschitz Continuous Gradient).

There exists a constant L>0L>0, such that F(x)F(x)Lxx,x,xd.||\nabla F(x)-\nabla F(x^{\prime})||\leq L^{\prime}||x-x^{\prime}||,\forall x,x^{\prime}\in\mathbb{R}^{d}.

Theorem 5.

Assume that FF is LL-smooth and convex, and each agent executes KK local updates before meeting and exchanging models, after that, then does model aggregation. We also assume bounded staleness τ<τmax\tau<\tau_{max}, as the kick-out threshold. Furthermore, we assume, xd,i[N]\forall x\in\mathbb{R}^{d},i\in[N], and z𝒟i,f(x;z)2V,gx(x;z)2V,xd\forall z\sim\mathcal{D}^{i},||\nabla f(x;z)||^{2}\leq V,||\nabla g_{x^{\prime}}(x;z)||^{2}\leq V,\forall x^{\prime}\in\mathbb{R}^{d}, and F\nabla F satisfies L-Lipschitz Continuous Gradient. For any small constant ϵ>0\epsilon>0, if we take ρ>0\rho>0, and satisfying (1+2ρ+ϵ)V+(ρ2ρ2)x(t,k1)x(t)20,x(t,k1),x(t)-(1+2\rho+\epsilon)V+(\rho^{2}-\frac{\rho}{2})||x(t,k-1)-x(t)||^{2}\geq 0,\forall x(t,k-1),x(t), after TT global epochs, Algorithm 1 converges to a critical point:

mint=0T1𝔼F(x(t))2τmaxϵηC1KT𝔼[F(x(0))F(xM(T)(T)]+𝒪(ητmaxK2ϵC1T)𝒪(τmaxϵηC1KT)+𝒪(ηρK2ϵC1).\displaystyle\begin{split}\min_{t=0}^{T-1}\mathbb{E}||\nabla F(x(t))||^{2}\leq&\frac{\tau_{max}}{\epsilon\eta C_{1}KT}\mathbb{E}[F(x(0))-F(x_{M(T)}(T)]+\mathcal{O}(\frac{\eta\tau_{max}K^{2}}{\epsilon C_{1}T})\\ \leq&\mathcal{O}(\frac{\tau_{max}}{\epsilon\eta C_{1}KT})+\mathcal{O}(\frac{\eta\rho K^{2}}{\epsilon C_{1}}).\end{split}

A.1 Proof

Similar to Theorem 1 in [DBLP:journals/corr/abs-1903-03934], we can get the boundary before and after KK times local updates on jj-th device, j[N]\forall j\in[N]:

𝔼[F(x~j(t))F(xj(t))]\displaystyle\mathbb{E}[F(\tilde{x}_{j}(t))-F(x_{j}(t))] =𝔼[F(xj(t,K))F(xj(t,0))]\displaystyle=\mathbb{E}[F(x_{j}(t,K))-F(x_{j}(t,0))]
ηϵk=0K1𝔼F(xj(t,k))2+η2𝒪(ρK3V).\displaystyle\leq-\eta\epsilon\sum_{k=0}^{K-1}\mathbb{E}||\nabla F(x_{j}(t,k))||^{2}+\eta^{2}\mathcal{O}(\rho K^{3}V). (5)

For any epoch tt, we find the index M(t)M(t) of the agent whose model is the “worst", i.e., M(t)=argmaxj[N]{F(xj(t))}M(t)=\arg\max_{j\in[N]}\{F(x_{j}(t))\}, then we can get,

𝔼[F\displaystyle\mathbb{E}[F (xi(t+1))F(x)]𝔼[F(xM(t+1)(t+1))F(x)]\displaystyle(x_{i}(t+1))-F(x_{*})]\leq\mathbb{E}[F(x_{M(t+1)}(t+1))-F(x^{*})]
\displaystyle\leq 𝔼[F(1|CM(t+1)(t)|j,τCM(t+1)(t)x~j(τ))]F(x)\displaystyle\mathbb{E}\Big{[}F\Big{(}\frac{1}{|C_{M(t+1)}(t)|}\sum_{j,\tau\in C_{M(t+1)}(t)}\tilde{x}_{j}(\tau)\Big{)}\Big{]}-F(x^{*})
\displaystyle\leq 𝔼[1|CM(t+1)(t)|j,τCM(t+1)(t)F(x~j(τ))]F(x)\displaystyle\mathbb{E}\Big{[}\frac{1}{|C_{M(t+1)}(t)|}\sum_{j,\tau\in C_{M(t+1)}(t)}F(\tilde{x}_{j}(\tau))\Big{]}-F(x^{*})
\displaystyle\leq 𝔼[1|CM(t+1)(t)|j,τCM(t+1)(t)F(xj(τ))]F(x)+η2𝒪(ρK3V)\displaystyle\mathbb{E}\Big{[}\frac{1}{|C_{M(t+1)}(t)|}\sum_{j,\tau\in C_{M(t+1)}(t)}F(x_{j}(\tau))\Big{]}-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V)
ϵη|CM(t+1)(t)|j,τCM(t+1)(t)k=0K1𝔼F(xj(τ,k))2\displaystyle-\frac{\epsilon\eta}{|C_{M(t+1)}(t)|}\sum_{j,\tau\in C_{M(t+1)}(t)}\sum_{k=0}^{K-1}\mathbb{E}||\nabla F(x_{j}(\tau,k))||^{2}
\displaystyle\leq 𝔼[1|CM(t+1)(t)|τ=tτmax+1t(xj(τ)CM(t)(t)xj(τ)xM(τ)(τ)F(xj(τ))+xM(τ)(τ)CM(t)(t)F(xM(τ)(τ)))]\displaystyle\mathbb{E}\Big{[}\frac{1}{|C_{M(t+1)}(t)|}\sum_{\tau=t-\tau_{max}+1}^{t}\mathrlap{\Big{(}}\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\mathllap{F(}x_{j}(\tau))+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\mathllap{F(}x_{M(\tau)}(\tau))\Big{)}\Big{]}
FF(x)+η2𝒪(ρK3V).\displaystyle-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V). (6)

here F=ϵη|CM(t+1)(t)|j,τCM(t+1)(t)k=0K1𝔼F(xj(τ,k))2\nabla F=\frac{\epsilon\eta}{|C_{M(t+1)}(t)|}\sum_{j,\tau\in C_{M(t+1)}(t)}\sum_{k=0}^{K-1}\mathbb{E}||\nabla F(x_{j}(\tau,k))||^{2}.
We can easy know, for given τ,jM(τ)\tau,\forall j\neq M(\tau), we have F(xM(τ)(τ))>F(xj(τ))F(x_{M(\tau)}(\tau))>F(x_{j}(\tau)), then,

𝔼[xj(τ)CM(t)(t)xj(τ)xM(τ)(τ)F(xj(τ))]=(xj(τ)CM(t)(t)xj(τ)xM(τ)(τ)1)h(τ)F(xM(τ)(τ)).\mathbb{E}\Big{[}\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}F(x_{j}(\tau))\Big{]}=\Big{(}\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot 1\Big{)}h(\tau)F(x_{M(\tau)}(\tau)). (7)

Here 0<h(τ)<10<h(\tau)<1, so we rearrange the equation above, we can get:

𝔼[F\displaystyle\mathbb{E}[F (xM(t+1)(t+1))F(x)]\displaystyle(x_{M(t+1)}(t+1))-F(x^{*})]
\displaystyle\leq 𝔼[τ=ttτmax+1xj(τ)CM(t)(t)xj(τ)xM(τ)(τ)h(τ)+xM(τ)(τ)CM(t)(t)1|CM(t+1)(t)|F(xM(τ)(τ))]\displaystyle\mathbb{E}\Big{[}\sum_{\tau=t}^{t-\tau_{max}+1}\frac{\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot h(\tau)+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\cdot 1}{|C_{M(t+1)}(t)|}F(x_{M(\tau)}(\tau))\Big{]}
FF(x)+η2𝒪(ρK3V)\displaystyle-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V)
=\displaystyle= 𝔼[τ=ttτmax+1xj(τ)CM(t)(t)xj(τ)xM(τ)(τ)h(τ)+xM(τ)(τ)CM(t)(t)1|CM(t+1)(t)|F(xM(τ)(τ))]\displaystyle\mathbb{E}\Big{[}\sum_{\tau=t}^{t-\tau_{max}+1}\frac{\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot h(\tau)+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\cdot 1}{|C_{M(t+1)}(t)|}F(x_{M(\tau)}(\tau))\Big{]}
FF(x)+η2𝒪(ρK3V)\displaystyle-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V)
=\displaystyle= 𝔼[τ=ttτmax+1xj(τ)CM(t)(t)xj(τ)xM(τ)(τ)h(τ)+xM(τ)(τ)CM(t)(t)1τ=ttτmax+1(xj(τ)CM(t)(t)xj(τ)xM(τ)(τ)1+xM(τ)(τ)CM(t)(t)1)F(xM(τ)(τ))]\displaystyle\mathbb{E}\Big{[}\sum_{\tau=t}^{t-\tau_{max}+1}\frac{\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot h(\tau)+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\cdot 1}{\sum_{\tau=t}^{t-\tau_{max}+1}(\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot 1+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\cdot 1)}F(x_{M(\tau)}(\tau))\Big{]}
FF(x)+η2𝒪(ρK3V).\displaystyle-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V). (8)

We define the “worst" model on all agents over the time period of [tτmax+1,t][t-\tau_{max}+1,t] as

𝒯(t,τmax)=argmaxt[tτmax+1,t]{F(xM(t)(t))}.\mathcal{T}(t,\tau_{max})=\arg\max_{t\in[t-\tau_{max}+1,t]}\{F(x_{M(t)}(t))\}.

Then we can get,

𝔼[F\displaystyle\mathbb{E}[F (xM(t+1)(t+1))F(x)]\displaystyle(x_{M(t+1)}(t+1))-F(x^{*})]
\displaystyle\leq 𝔼[τ=ttτmax+1xj(τ)CM(t)(t)xj(τ)xM(τ)(τ)h(τ)+xM(τ)(τ)CM(t)(t)1τ=ttτmax+1(xj(τ)CM(t)(t)xj(τ)xM(τ)(τ)1+xM(τ)(τ)CM(t)(t)1)\displaystyle\mathbb{E}\Big{[}\sum_{\tau=t}^{t-\tau_{max}+1}\frac{\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot h(\tau)+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\cdot 1}{\sum_{\tau=t}^{t-\tau_{max}+1}(\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot 1+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\cdot 1)}
F(x(𝒯(t,τmax))(𝒯(t,τmax)))]FF(x)+η2𝒪(ρK3V)\displaystyle\cdot F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))\Big{]}-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V)
=\displaystyle= 𝔼[(1τ=ttτmax+1xj(τ)CM(t)(t)xj(τ)xM(τ)(τ)(1h(τ))|CM(t+1)(t)|)F(x(𝒯(t,τmax))(𝒯(t,τmax)))]\displaystyle\mathbb{E}\Big{[}\Big{(}1-\sum_{\tau=t}^{t-\tau_{max}+1}\frac{\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot(1-h(\tau))}{|C_{M(t+1)}(t)|}\Big{)}F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))\Big{]}
FF(x)+η2𝒪(ρK3V)\displaystyle-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V)
=\displaystyle= 𝔼[(1ξ(t,τmax))F(x(𝒯(t,τmax))(𝒯(t,τmax)))]FF(x)+η2𝒪(ρK3V)\displaystyle\mathbb{E}\Big{[}\Big{(}1-\xi(t,\tau_{max})\Big{)}F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))\Big{]}-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V)
\displaystyle\leq 𝔼[F(x(𝒯(t,τmax))(𝒯(t,τmax)))]FF(x)+η2𝒪(ρK3V).\displaystyle\mathbb{E}\Big{[}F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))\Big{]}-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V). (9)

Here, ξ(t,τmax)=τ=ttτmax+1xj(τ)CM(t)(t)xj(τ)xM(τ)(τ)(1h(τ))|CM(t+1)(t)|[0,1)\xi(t,\tau_{max})=\sum_{\tau=t}^{t-\tau_{max}+1}\frac{\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot(1-h(\tau))}{|C_{M(t+1)}(t)|}\in[0,1), so we have,

𝔼[F(xM(t+1)(t+1))F(x(𝒯(t,τmax))(𝒯(t,τmax)))]F+η2𝒪(ρK3V).\mathbb{E}[F(x_{M(t+1)}(t+1))-F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))]\leq-\nabla F+\eta^{2}\mathcal{O}(\rho K^{3}V). (10)

Then,

𝔼[F\displaystyle\mathbb{E}[F (xM(t+1)(t+1))F(x(𝒯(t,τmax))(𝒯(t,τmax)))]\displaystyle(x_{M(t+1)}(t+1))-F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))]
F+η2𝒪(ρK3V)\displaystyle\leq-\nabla F+\eta^{2}\mathcal{O}(\rho K^{3}V)
=ϵη|CM(t+1)(t)|j,τCM(t+1)(t)k=0K1F(xj(τ,k))2+η2𝒪(ρK3V).\displaystyle=-\frac{\epsilon\eta}{|C_{M(t+1)}(t)|}\sum_{j,\tau\in C_{M(t+1)}(t)}\sum_{k=0}^{K-1}||\nabla F(x_{j}(\tau,k))||^{2}+\eta^{2}\mathcal{O}(\rho K^{3}V). (11)

By taking use of the property of L-Lipschitz Continuous Gradient, we can get,

F(xj(τ,k))F(xj(τ,k1))\displaystyle||\nabla F(x_{j}(\tau,k))-\nabla F(x_{j}(\tau,k-1))|| Lxj(τ,k)xj(τ,k1)\displaystyle\leq L^{\prime}||x_{j}(\tau,k)-x_{j}(\tau,k-1)||
ηLgxj(τ)(xj(τ,k1))\displaystyle\leq\eta L^{\prime}||\nabla g_{x_{j}(\tau)}(x_{j}(\tau,k-1))||
ηLV.\displaystyle\leq\eta L^{\prime}\sqrt{V}. (12)

Then, we can get, k[1,K]\forall k\in[1,K], F(xj(τ,k))2F(xj(τ,k1))2+η2L2V||\nabla F(x_{j}(\tau,k))||^{2}\geq||\nabla F(x_{j}(\tau,k-1))||^{2}+\eta^{2}L^{\prime 2}V, then

F(xj(τ,k))2F(xj(τ))2+η2L2kV.||\nabla F(x_{j}(\tau,k))||^{2}\geq||\nabla F(x_{j}(\tau))||^{2}+\eta^{2}L^{\prime 2}kV. (13)

So we have,

j,τCM(t+1)(t)k=0K1\displaystyle\sum_{j,\tau\in C_{M(t+1)}(t)}\sum_{k=0}^{K-1} F(xj(τ,k))2\displaystyle||\nabla F(x_{j}(\tau,k))||^{2}
j,τCM(t+1)(t)KF(xj(τ))2+η2L2K(K1)V\displaystyle\geq\sum_{j,\tau\in C_{M(t+1)}(t)}K||\nabla F(x_{j}(\tau))||^{2}+\eta^{2}L^{\prime 2}K(K-1)V
j,τCM(t+1)(t)KF(xj(τ))2+η2𝒪(K2V).\displaystyle\geq\sum_{j,\tau\in C_{M(t+1)}(t)}K||\nabla F(x_{j}(\tau))||^{2}+\eta^{2}\mathcal{O}(K^{2}V). (14)

We assume the ratio of gradient expectation on the global distribution to gradient expectation on local data distribution of any device is bounded, i.e.,

𝔼[F(xj(t))2]C1𝔼[F(x(t))2],\displaystyle\mathbb{E}[||\nabla F(x_{j}(t))||^{2}]\geq C_{1}\mathbb{E}[||\nabla F(x(t))||^{2}], (15)

So we get,

j,τCM(t+1)(t)\displaystyle\sum_{j,\tau\in C_{M(t+1)}(t)} k=0K1𝔼[F(xj(τ,k))2]\displaystyle\sum_{k=0}^{K-1}\mathbb{E}[||\nabla F(x_{j}(\tau,k))||^{2}]
\displaystyle\geq C1K|CM(t+1)(t)|minτ=tτmax+1t𝔼[F(x(τ))2]\displaystyle C_{1}K|C_{M(t+1)}(t)|\min_{\tau=t-\tau_{max}+1}^{t}\mathbb{E}[||\nabla F(x(\tau))||^{2}]
+η2|CM(t+1)(t)|𝒪(K2V),\displaystyle+\eta^{2}|C_{M(t+1)}(t)|\mathcal{O}(K^{2}V), (16)

Inequality (A.1) becomes:

𝔼\displaystyle\mathbb{E} [F(xM(t+1)(t+1))F(x(𝒯(t,τmax))(𝒯(t,τmax)))]\displaystyle[F(x_{M(t+1)}(t+1))-F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))]
ϵηC1Kminτ=tτmax+1tF(x(τ))2+η2𝒪(ρK3V).\displaystyle\leq-\epsilon\eta C_{1}K\min_{\tau=t-\tau_{max}+1}^{t}||\nabla F(x(\tau))||^{2}+\eta^{2}\mathcal{O}(\rho K^{3}V). (17)

By rearranging the terms, we have,

minτ=tτmax+1tF(x(τ))2\displaystyle\min_{\tau=t-\tau_{max}+1}^{t}||\nabla F(x(\tau))||^{2}\leq (1ϵηC1K)𝔼[F(x(𝒯(t,τmax))(𝒯(t,τmax)))\displaystyle(\frac{1}{\epsilon\eta C_{1}K})\mathbb{E}[F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))
F(xM(t+1)(t+1))]+𝒪(ηρK2VC1ϵ).\displaystyle-F(x_{M(t+1)}(t+1))]+\mathcal{O}(\frac{\eta\rho K^{2}V}{C_{1}\epsilon}). (18)

We iteratively construct a time sequence {T0,T1,T2,,TNT}{0,1,,T1}\{T^{\prime}_{0},T^{\prime}_{1},T^{\prime}_{2},...,T^{\prime}_{N_{T}}\}\subseteq\{0,1,...,T-1\} in the backward fashion so that

TNT\displaystyle T^{\prime}_{N_{T}} =T1;\displaystyle=T-1;
Ti\displaystyle T^{\prime}_{i} =𝒯(Ti+1,τmax)1,1iNT1;\displaystyle=\mathcal{T}(T^{\prime}_{i+1},\tau_{max})-1,\quad 1\leq i\leq N_{T}-1;
T0\displaystyle T^{\prime}_{0} =0.\displaystyle=0.

From the equation above, we can also get the inequality about NTN_{T},

TτmaxNTT.\frac{T}{\tau_{max}}\leq N_{T}\leq T. (19)

Applying inequality (A.1) at all time instances {T0,T1,T2,,TNT}\{T^{\prime}_{0},T^{\prime}_{1},T^{\prime}_{2},...,T^{\prime}_{N_{T}}\}, after T global epochs, we have,

mint=0T1𝔼[\displaystyle\min_{t=0}^{T-1}\mathbb{E}[ ||F(x(t))||2]1NTt=T0TNminτ=tτmax+1t||F(x(τ))||2\displaystyle||\nabla F(x(t))||^{2}]\leq\frac{1}{N_{T}}\sum_{t=T^{\prime}_{0}}^{T^{\prime}_{N}}\min_{\tau=t-\tau_{max}+1}^{t}||\nabla F(x(\tau))||^{2}
\displaystyle\leq 1ϵηC1KNTt=T0TN𝔼[F(xMT(t,τmax)(T(t,τmax)))F(xM(t+1)(t+1))]+𝒪(ηρK2VϵC1)\displaystyle\frac{1}{\epsilon\eta C_{1}KN_{T}}\sum_{t=T^{\prime}_{0}}^{T^{\prime}_{N}}\mathbb{E}[F(x_{M_{T(t,\tau_{max})}}(T(t,\tau_{max})))-F(x_{M(t+1)}(t+1))]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})
\displaystyle\leq 1ϵηC1KNT𝔼[F(xM(T0)(T0))F(xM(TNT+1)(TNT+1))]+𝒪(ηρK2VϵC1)\displaystyle\frac{1}{\epsilon\eta C_{1}KN_{T}}\mathbb{E}[F(x_{M(T^{\prime}_{0})}(T^{\prime}_{0}))-F(x_{M(T^{\prime}_{N_{T}}+1)}(T^{\prime}_{N_{T}}+1))]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})
\displaystyle\leq 1ϵηC1KNT𝔼[F(xM(0)(0))F(xM(T)(T)]+𝒪(ηρK2VϵC1)\displaystyle\frac{1}{\epsilon\eta C_{1}KN_{T}}\mathbb{E}[F(x_{M(0)}(0))-F(x_{M(T)}(T)]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})
\displaystyle\leq 1ϵηC1KNT𝔼[F(x(0))F(xM(T)(T)]+𝒪(ηρK2VϵC1)\displaystyle\frac{1}{\epsilon\eta C_{1}KN_{T}}\mathbb{E}[F(x(0))-F(x_{M(T)}(T)]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})
\displaystyle\leq τmaxϵηC1KT𝔼[F(x(0))F(xM(T)(T)]+𝒪(ηρK2VϵC1)\displaystyle\frac{\tau_{max}}{\epsilon\eta C_{1}KT}\mathbb{E}[F(x(0))-F(x_{M(T)}(T)]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})
\displaystyle\leq τmaxϵηC1KT𝔼[F(x(0))F(x(T)]+𝒪(ηρK2VϵC1)\displaystyle\frac{\tau_{max}}{\epsilon\eta C_{1}KT}\mathbb{E}[F(x(0))-F(x(T)]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})
\displaystyle\leq 𝒪(τmaxϵηC1KT)+𝒪(ηρK2ϵC1).\displaystyle\mathcal{O}(\frac{\tau_{max}}{\epsilon\eta C_{1}KT})+\mathcal{O}(\frac{\eta\rho K^{2}}{\epsilon C_{1}}). (20)

With the results in (A.1) and by leveraging Theorem 1 in yang2021achieving, Algorithm 1 converges to a critical point after TT global epochs.

Appendix B Experimental details

B.1 Data distribution in Group-based LRU Cache Experiment

B.1.1 nn-overlap

For MNIST, FashionMNIST with 10 classes labels, we allocate labels into 3 groups as:
Non-overlap:
Area 1: (0,1,2,3), Area 2: (4,5,6), Area 3: (7,8,9)
1-overlap:
Area 1: (9,0,1,2,3), Area 2: (3,4,5,6), Area 3: (6,7,8,9)
2-overlap:
Area 1: (8,9,0,1,2,3), Area 2: (2,3,4,5,6), Area 3: (5,6,7,8,9)
3-overlap:
Area 1: (7,8,9,0,1,2,3), Area 2: (1,2,3,4,5,6), Area 3: (4,5,6,7,8,9)

B.2 Computing infrastructure

For MNIST and FashionMNIST experiments, we use 10 computing nodes, each with 10 CPUs, model: Lenovo SR670, to simulate DFL on 100 vehicles. CIFAR-10 results are obtained from 1 compute node with 5 CPUs and 1 A100 NVIDIA GPU, model: SD650-N V2. They are all using Ubuntu-20.04.1 system with pre-installed openmpi/intel/4.0.5 module, and 100 Gb/s network speed.

B.3 Results with variation

As our metric is the average test accuracy over 100 devices, we also provide the variation of the average test accuracy in our experimental results as following,

Refer to caption
Figure 7: DFL with Caching vs. DFL without Caching with Variation.
Refer to caption
Figure 8: DFL with LRU at Different Cache Sizes with Variation.
Refer to caption
Figure 9: Impact of τmax\tau_{max} on Model Convergence with Variation.
Table 3: Average number and average age of cached models with different τmax\tau_{max} and different epoch time: 30s, 60s, 120s. Columns for different τmax\tau_{max}, and rows for different epoch time. There are four sub-rows for each epoch time, with the first two sub-rows be the average number of cached models and variation, the second two sub-rows be their average age and variation.
τmax\tau_{max} 1 2 3 4 5 10 20
30s num 0.8549 1.6841 2.6805 3.8584 5.2054 14.8520 44.8272
varnvar_{n} 0.0321 0.0903 0.1941 0.3902 0.6986 6.2270 42.8144
age 0.0000 0.4904 1.0489 1.6385 2.2461 5.4381 11.5735
varavar_{a} 0.0000 0.0052 0.0116 0.0203 0.0320 0.1577 1.0520
60s num 1.6562 3.8227 6.4792 10.3889 14.0749 43.0518 90.2576
varnvar_{n} 0.0865 0.3594 0.9925 2.824 5.1801 32.8912 55.5829
age 0.000 0.5593 1.1604 1.8075 2.4396 5.5832 9.7954
varavar_{a} 0.0000 0.0035 0.0089 0.0188 0.0251 0.1448 0.8375
120s num 3.6936 9.9149 18.5016 31.3576 35.1958 90.2469 98.5412
varnvar_{n} 0.3166 2.3825 5.9696 21.7408 56.8980 28.8958 29.9676
age 0.0000 0.6189 1.2715 1.9194 1.5081 4.7131 5.2357
varavar_{a} 0.0000 0.0028 0.0058 0.0212 0.9742 0.1249 0.1808
Refer to caption
Figure 10: Relation between average number and average age of cached models with different τmax\tau_{max} (number close to the dot), with unlimited cache size.
Refer to caption
Figure 11: relation between average number and average age of cached models with variation of average age.
Refer to caption
Figure 12: relation between average number and average age of cached models with variation of average number.
Refer to caption
Figure 13: Convergence at Different Mobility Speed with Variation.
Refer to caption
Figure 14: Group-based LRU Cache Update Performance under Different Data Distribution Overlaps with Variation.
Refer to caption
Figure 15: Group-based LRU Cache Update Performance under Different Data Distribution Overlaps with Variation.

B.4 Final (hyper-)parameters in experiments

learning rate η=0.1\eta=0.1, batch size = 64. For all the experiments, we train for 1,000 global epochs, and implement early stop when the average test accuracy stops increasing for at least 20 epochs. Without specific explanation, we take the following parameters as the final default parameters in our experiments. Local epoch K=10K=10, vehicle speed v=13.59m/sv=13.59m/s, communication distance 100m100m, communication interval between each epoch 120s120s, τmax=10\tau_{max}=10, number of devices N=100N=100, cache size 1010.

B.5 NN models

We choose the models from the classical repository 222 Federated-Learning-PyTorch , Implementation of Communication-Efficient Learning of Deep Networks from Decentralized Data © 2024 GitHub, Inc.https://github.com/AshwinRJ/Federated-Learning-PyTorch.: we run convolutional neural network (CNN) as shown in Table 4, for MNIST, convolutional neural network (CNN) as shown in Table 5, for FashionMNIST. Inspired by [zhang2019lookahead], we choose ResNet-18 for Cifar-10, as shown in Table 6.

Table 4: CNN Architecture for MNIST
Layer Type Size
Convolution + ReLU 5×5×105\times 5\times 10
Max Pooling 2×22\times 2
Convolution + ReLU + Dropout 5×5×205\times 5\times 20
Dropout(2D) p=0.5p=0.5
Max Pooling 2×22\times 2
Fully Connected + ReLU 320×50320\times 50
Dropout p=0.5p=0.5
Fully Connected 50×1050\times 10
Table 5: CNN Architecture for FashionMNIST
Layer Type Size
Convolution + BatchNorm + ReLU 5×5×165\times 5\times 16
Max Pooling 2×22\times 2
Convolution + BatchNorm + ReLU 5×5×325\times 5\times 32
Max Pooling 2×22\times 2
Fully Connected 7×7×32×107\times 7\times 32\times 10
Table 6: ResNet-18 Architecture
Layer Type Output Size
Convolution + BatchNorm + ReLU 32×32×6432\times 32\times 64
Residual Block 1 (2x BasicBlock) 32×32×6432\times 32\times 64
Residual Block 2 (2x BasicBlock) 16×16×12816\times 16\times 128
Residual Block 3 (2x BasicBlock) 8×8×2568\times 8\times 256
Residual Block 4 (2x BasicBlock) 4×4×5124\times 4\times 512
Average Pooling 1×1×5121\times 1\times 512
Fully Connected 1×1×101\times 1\times 10