Decentralized Federated Learning with Model Caching on Mobile Agents

Xiaoyu Wang¹, Guojun Xiong², Houwei Cao³, Jian Li², Yong Liu¹
¹New York University, ²Stony Brook University, ³New York Institute of Technology
{wang.xiaoyu, yongliu}@nyu.edu, {guojun.xiong, jian.li.3}@stonybrook.edu,
[email protected]

Abstract

Federated Learning (FL) aims to train a shared model using data and computation power on distributed agents coordinated by a central server. Decentralized FL (DFL) utilizes local model exchange and aggregation between agents to reduce the communication and computation overheads on the central server. However, when agents are mobile, the communication opportunity between agents can be sporadic, largely hindering the convergence and accuracy of DFL. In this paper, we study delay-tolerant model spreading and aggregation enabled by model caching on mobile agents. Each agent stores not only its own model, but also models of agents encountered in the recent past. When two agents meet, they exchange their own models as well as the cached models. Local model aggregation works on all models in the cache. We theoretically analyze the convergence of DFL with cached models, explicitly taking into account the model staleness introduced by caching. We design and compare different model caching algorithms for different DFL and mobility scenarios. We conduct detailed case studies in a vehicular network to systematically investigate the interplay between agent mobility, cache staleness, and model convergence. In our experiments, cached DFL converges quickly, and significantly outperforms DFL without caching.

1 Introduction

1.1 Federated Learning on Mobile Agents

Federated learning (FL) is a type of distributed machine learning (ML) that prioritizes data privacy [mcmahan2017communication]. The traditional FL involves a central server that connects with a large number of agents. The agents retain their data and do not share them with the server. During each communication round, the server sends the current global model to the agents, and a small subset of agents are chosen to update the global model by running stochastic gradient descent (SGD) [robbins1951stochastic] for multiple iterations on their local data. The central server then aggregates the updated parameters to obtain the new global model. FL is a natural fit for the emerging Internet-of-Things (IoT) systems, where each IoT device not only can sense its surrounding environment to collect local data, but also is equipped with computation resources for local model training, as well as communication interfaces to interact with a central server for model aggregation. Many IoT devices are mobile, ranging from mobile phones/tablets, autonomous cars/drones, to self-navigating robots, etc. In the recent research efforts on smart connected vehicles, there has been a focus on integrating vehicle-to-everything (V2X) networks with Machine Learning (ML) tools and distributed decision making [barbieri2022decentralized], particularly in the area of computer vision tasks such as traffic light and signal recognition, road condition sensing, intelligent obstacle avoidance, and intelligent road routing, etc. With FL, vehicles locally train deep ML models and upload only the model parameters (i.e., neural network weights and biases) to the central server. This approach not only reduces bandwidth consumption, as the size of model parameters is much smaller than the size of raw image/video data, but also leverages computing power on vehicles, and protects user privacy.

However, FL on mobile agents still faces communication and computation challenges. The movements of mobile agents, especially at high speed, lead to fast-changing channel conditions on the wireless connections between mobile agents and the central FL server, resulting in long latency in FL [niknam2020federated]. Battery-powered mobile agents also have limited power budget for long-range wireless communications. Meanwhile, the non-i.i.d data distributions on mobile agents make it difficult for local models to converge. As a result, FL on mobile agents to obtain an optimal global model remains an open challenge. Decentralized federated learning (DFL) has emerged as a potential solution where local model aggregations are conducted between neighboring mobile agents using local device-to-device (D2D) communications with high bandwidth, low latency and low power consumption [10251949] . Preliminary studies have demonstrated that decentralized FL algorithms have the potential to significantly reduce the high communication costs associated with centralized FL. However, blindly applying model aggregation algorithms, such as FedSDG [shokri2015privacy] and FedAvg [mcmahan2017communication], developed for centralized FL to distributed FL cannot achieve fast convergence and high model accuracy [liu2022distributed].

1.2 Delay-Tolerant Model Communication and Aggregation Through Caching

D2D model communication between a pair of mobile agents is possible only if they are within each other’s transmission ranges. If mobile agents only meet with each others sporadically, there will not be enough model aggregation opportunity for fast convergence. In addition, with non-i.i.d data distributions on agents, if an agent only meets with agents from a small cluster, there is no way for the agent to interact with models trained by data samples outside of its cluster, leading to disaggregated local models that cannot perform well on the global data distribution. It is therefore essential to achieve fast and even model spreading using limited D2D communication opportunities among mobile agents. A similar problem was studied in the context of Mobile Ad hoc Network (MANET), where wireless communication between mobile nodes are sporadic. The efficiency of data dissemination in MANET can be significantly improved by Delay-Tolerant Networking (DTN) [burleigh2003delay]: a mobile node caches data it received from nodes it met in the past; when meeting with a new node, it not only transfers its own data, but also the cached data of other nodes. Essentially, node mobility forms a new “communication" channel through which cached data are transported through node movement in physical space. It is worth noting that, due to multi-hop caching-and-relay, DTN transmission incurs longer delay than D2D direct transmission. Data staleness can be controlled by caching and relay algorithms to match the target application’s delay tolerance.

Table 1: Notations and Terminologies.

Notation	Description
$N$	Number of agents
$T$	Number of global epochs
$[N]$	Set of integers $\{1,...,N\}$
$K$	Number of local updates
$x_{i}(t)$	Model in the $t^{th}$ epoch on agent $i$
$x(t)$	Global Model in the $t^{th}$ epoch, $x(t)=\mathbb{E}_{i\in[N]}[x_{i}(t)]$
$x_{i}(t,k)$	Model initialized from $x_{t}$ , after $k$ -th local update on agent $i$
$\tilde{x}_{i}(t)$	Model $x_{i}(t)$ after local updates
$\mathcal{D}^{i}$	Dataset on the $i$ -th agent
$\alpha$	Aggregation weight
$t-\tau$	Staleness
$\tau_{max}$	Tolerance of staleness in cache
$\|\|\cdot\|\|$	All the norms in the paper are $l_{2}$ -norms

Motivated by DTN, we propose delay-tolerant DFL communication and aggregation enabled by model caching on mobile agents. To realize DTN-like model spreading, each mobile agent stores not only its own local model, but also local models received from other agents in the recent history. Whenever it meets another agent, it transfers its own model as well as the cached models to the agent through high-speed D2D communication. Local model aggregation on an agent works on all its cached models, mimicking a local parameter server. Compared with DFL without caching, DTN-like model spreading can push local models faster and more evenly to the whole network; aggregating all cached models can facilitate more balanced learning than pairwise model aggregation. While DFL model caching sounds promising, it also faces a new challenge of model staleness: a cached model from an agent is not the current model on that agent, with the staleness determined by the mobility patterns, as well as the model spreading and caching algorithms. Using stale models in model aggregation may slow down or even deviate model convergence.

The key challenge we want to address in this paper is how to design cached model spreading and aggregation algorithms to achieve fast convergence and high accuracy in DFL on mobile agents. Towards this goal, we make the following contributions:

1.

We develop a new decentralized FL framework that utilizes model caching on mobile agents to realize delay-tolerant model communication and aggregation;
2.

We theoretically analyze the convergence of aggregation with cached models, explicitly taking into account the model staleness;
3.

We design and compare different model caching algorithms for different DFL and mobility scenarios.
4.

We conduct a detailed case study on vehicular network to systematically investigate the interplay between agent mobility, cache staleness, and convergence of model aggregation. Our experimental results demonstrate that our cached DFL converges quickly and significantly outperforms DFL without caching.

2 Background

2.1 Global Training Objective

Similar to the standard federated learning problem, the overall objective of mobile DFL is to learn a single global statistical model from data stored on tens to potentially millions of mobile agents. The overall goal is to find the optimal model weights $x^{*}\in\mathbb{R}^{d}$ to minimize the global loss function:

\min_{x}F(x),\,\text{where}\,F(x)=\frac{1}{N}\sum_{i\in[N]}\mathbb{E}_{z^{i}\sim\mathcal{D}^{i}}{f(x;z^{i})},

(1)

where $N$ denotes the total number of mobile agents, and each agent has its own local dataset, i.e., $\mathcal{D}^{i}\neq\mathcal{D}^{j},\forall i\neq j$ . And $z^{i}$ is sampled from the local data $\mathcal{D}^{i}$ .

2.2 DFL Training with Local Model Caching

All agents participate in DFL training over $T$ global epochs. At the beginning of the $t^{th}$ epoch, agent $i$ ’s local model is $x_{i}(t)$ . After $K$ steps of SGD to solve the following optimization problem with a regularized loss function:

\min_{x}\mathbb{E}_{z^{i}\sim D^{i}}f(x;z^{i})+\frac{\rho}{2}||x-x_{i}(t)||^{2},

agent $i$ obtains an updated local model $\tilde{x}_{i}(t)$ . Meanwhile, during the $t^{th}$ epoch, driven by their mobility patterns, each agent meets and exchanges models with other agents. Other than its own model, agent $i$ also stores models it received from other agents encountered in the recent history in its local cache $\mathcal{C}_{i}(t)$ . When two agents meet, they not only exchange their own local models, but also share their cached models with each others to maximize the efficiency of DTN-like model spreading. The models received by agent $i$ will be used to update its model cache $\mathcal{C}_{i}(t)$ , following different cache update algorithms, such as LRU update method (Algorithm 2) or Group-based LRU update method, which will be described in details later. As the cache size of each agent is limited, it is important to design an efficient cache update rule in order to maximize the caching benefit.

After cache updating, each agent conducts local model aggregation using all the cached models with customized aggregation weights $\{\alpha_{j}\in(0,1)\}$ to get the updated local model $x_{i}(t+1)$ for epoch $t+1$ . In our simulation, we take the aggregation weight as $\alpha_{j}=(n_{j}/\sum_{j\in\mathcal{C}_{i}(t)}n_{j})$ , where $n_{j}$ is the number of samples on agent $j$ .

The whole process repeats until the end of $T$ global epochs. The detailed algorithm is shown in Algorithm 1. $z^{i}_{k}$ are randomly drawn local data samples on agent $i$ for the $k$ -th local update, and $\eta$ is the learning rate.

Algorithm 1 Cached Decentralized Federated Learning (DFedCache)

Input: T Global epochs, K local updates and $\{x_{i}(0)\}^{N}_{i=1}$ , Tolerance of staleness $\tau_{max}$

2:function Local Update(

x_{i}(t)

x_{i}(t,0)=x_{i}(t)

4: Define

g_{x(t)}(x;z)=f(x;z)+\frac{\rho}{2}||x-x(t)||^{2}

5: for

k=1,2,\ldots,K

6: Randomly sample

z^{i}_{k}\sim\mathcal{D}^{i}

x_{i}(t,k)=x_{i}(t,k-1)-\eta\nabla g_{x_{i}(t)}(x_{i}(t,k-1);z^{i}_{k})

8: end for

9: return

\tilde{x}_{i}(t)=x_{i}(t,K)

10:end function

2:function Model Aggregation(

\mathcal{C}_{i}(t)

x_{i}(t+1)=\sum_{j\in\mathcal{C}_{i}(t)}\alpha_{j}\tilde{x}_{j}(\tau)

4: return

x_{i}(t+1)

5:end function

1: Main Process:

2:for

t=0,1,\ldots T-1

3: for

i=0,1,\ldots,N-1

\tilde{x}_{i}(t)\leftarrow

Local Update(

x_{i}(t)

)

\mathcal{C}_{i}(t)\leftarrow

Cache Update(

\mathcal{C}_{i}(t-1),\tau_{max}

)

x_{i}(t+1)\leftarrow

Model Aggregation(

\mathcal{C}_{i}(t)

)

7: end for

8:end for

Output: $\{x_{i}(T)\}^{N}_{i=1}$

Remark 1.

Note the $N$ agents communicate with each others in a mobile D2D network. D2D communication can only happen between an agent and its neighbors within a short range (e.g. several hundred meters). Since agent locations are constantly changing (for instance, vehicles continuously move along the road network of a city), D2D network topology is dynamic and can be sparse at any given epoch. To ensure the eventual model convergence, the union graph of D2D networks over multiple epochs should be strongly-connected for efficient DTN-like model spreading. We also assume D2D communications are non-blocking, carried by short-distance high-throughput communication methods such as mmWave or WiGig, which has enough capacity to complete the exchange of cached models before the agents go out of each other’s communication ranges.

Remark 2.

Intuitively, comparing to the DFL without cache (e.g. DeFedAvg [sun2022decentralized]), where each agent can only get new model by averaging with another model, cached DFL uses more models (delayed version) for aggregation, thus bring more underlying information from dataset on more agents. Although DFedCache introduces stale models, it can benefit model convergence, especially in highly heterogeneous data distribution scenarios. Overall, our cached DFL framework allows each agent to act as a local proxy for Centralized Federated Learning with delayed cached models, thus speedup the convergence especially in highly heterogeneous data distribution scenarios, where are challenging for the traditional DFL to converge.

Remark 3.

As mentioned above, our approach inevitably introduces stale models. Intuitively, larger staleness results in greater error in the global model. For the cached models with large staleness $t-\tau$ , we could set a threshold $\tau_{max}$ to kick old models out of model spreading, which is described in the cache update algorithms. The practical value for $\tau_{max}$ should be related to the cache capacity and communication frequency between agents. In our experimental results, we choose $\tau_{max}$ to be 10 or 20 epochs. Results and analysis about the effect of different $\tau_{max}$ can be found in results section 5.

3 Convergence Analysis

We now theoretically investigate the impact of caching, especially the staleness of cached models, on DFL model convergence. We introduce some definitions and assumptions.

Definition 1 (Smoothness).

A differentiable function $f$ is $L$ -smooth if for $\forall x,y,$ $f(y)-f(x)\leq\langle\nabla f(x),y-x\rangle+\frac{L}{2}||y-x||^{2}$ , where $L>0$ .

Definition 2 (Bounded Variance).

There exists constant $\varsigma>0$ such that the global variability of the local gradient of the loss function is bounded $||\nabla F_{j}(x)-\nabla F(x)||^{2}\leq\varsigma^{2},\forall j\in[N],x\in\mathbb{R}^{d}$ .

Definition 3 (L-Lipschitz Continuous Gradient).

There exists a constant $L>0$ , such that $||\nabla F(x)-\nabla F(x^{\prime})||\leq L^{\prime}||x-x^{\prime}||,\forall x,x^{\prime}\in\mathbb{R}^{d}.$

Theorem 4.

Assume that $F$ is $L$ -smooth and convex, and each agent executes $K$ local updates before meeting and exchanging models, after that, then does model aggregation. We also assume bounded staleness $\tau<\tau_{max}$ , as the kick-out threshold. Furthermore, we assume, $\forall x\in\mathbb{R}^{d},i\in[N]$ , and $\forall z\sim\mathcal{D}^{i},||\nabla f(x;z)||^{2}\leq V,||\nabla g_{x^{\prime}}(x;z)||^{2}\leq V,\forall x^{\prime}\in\mathbb{R}^{d}$ , and $\nabla F$ satisfies L-Lipschitz Continuous Gradient. For any small constant $\epsilon>0$ , if we take $\rho>0$ , and satisfying $-(1+2\rho+\epsilon)V+(\rho^{2}-\frac{\rho}{2})||x(t,k-1)-x(t)||^{2}\geq 0,\forall x(t,k-1),x(t)$ , after $T$ global epochs, Algorithm 1 converges to a critical point:

	$\displaystyle\min_{t=0}^{T-1}\mathbb{E}\|\|\nabla F(x(t))\|\|^{2}$	$\displaystyle\leq\frac{\tau_{max}}{\epsilon\eta C_{1}KT}\mathbb{E}[F(x(0))-F(x_{M(T)}(T)]+\mathcal{O}(\frac{\eta\rho K^{2}}{\epsilon C_{1}})$
		$\displaystyle\leq\mathcal{O}(\frac{\tau_{max}}{\epsilon\eta C_{1}KT})+\mathcal{O}(\frac{\eta\rho K^{2}}{\epsilon C_{1}}).$		(2)

3.1 Proof Sketch

We now highlight the key ideas and challenges behind our convergence proof.

Step 1: Similar to Theorem 1 in DBLP:journals/corr/abs-1903-03934, we bound the expected cost reduction after $K$ steps of local updates on the $j$ -th agent, $\forall j\in[N]$ , as

\begin{split}\mathbb{E}[F(\tilde{x}_{j}(t))-F(x_{j}(t))]&=\mathbb{E}[F(x_{j}(t,K))-F(x_{j}(t,0))]\\ &\leq-\eta\epsilon\sum_{k=0}^{K-1}\mathbb{E}||\nabla F(x_{j}(t,k))||^{2}+\eta^{2}\mathcal{O}(\rho K^{3}V).\end{split}

Step 2: For any epoch $t$ , we find the index $M(t)$ of the agent whose model is the “worst", i.e., $M(t)=\arg\max_{j\in[N]}\{F(x_{j}(t))\}$ , and the “worst" model on all agents over the time period of $[t-\tau_{max}+1,t]$ as

\mathcal{T}(t,\tau_{max})=\arg\max_{t\in[t-\tau_{max}+1,t]}\{F(x_{M(t)}(t))\}.

Step 3: We bound the cost reduction of the “worst" model at epoch $t+1$ from the “worst" model in the time period of $[t-\tau_{max}+1,t]$ , i.e., the worst possible model that can be stored in some agent’s cache at time $t$ , as:

	$\displaystyle\mathbb{E}[F\left(x_{M(t+1)}(t+1)\right)$	$\displaystyle-F\left(x_{M(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max}))\right)]$
		$\displaystyle\leq-\epsilon\eta C_{1}\mathrlap{K}\min_{\tau=t}^{t-\tau_{max}+1}\|\|\nabla F(x(\tau))\|\|^{2}+\eta^{2}\mathcal{O}(\rho K^{3}V).$		(3)

Step 4: We iteratively construct a time sequence $\{T^{\prime}_{0},T^{\prime}_{1},T^{\prime}_{2},...,T^{\prime}_{N_{T}}\}\subseteq\{0,1,...,T-1\}$ in the backward fashion so that

		$\displaystyle T^{\prime}_{N_{T}}$	$\displaystyle=T-1;$
		$\displaystyle T^{\prime}_{i}$	$\displaystyle=\mathcal{T}(T^{\prime}_{i+1},\tau_{max})-1,\quad 1\leq i\leq N_{T}-1;$
		$\displaystyle T^{\prime}_{0}$	$\displaystyle=0.$

Step 5: Applying inequality (3.1) at all time instances $\{T^{\prime}_{0},T^{\prime}_{1},T^{\prime}_{2},...,T^{\prime}_{N_{T}}\}$ , after T global epochs, we have,

$\displaystyle\min_{t=0}^{T-1}\mathbb{E}$	$\displaystyle[\|\|\nabla F(x(t))\|\|^{2}]\leq\frac{1}{N_{T}}\sum_{t=T^{\prime}_{0}}^{T^{\prime}_{N_{T}}}\min_{\tau=t}^{t-\tau_{max}+1}\|\|\nabla F(x(\tau))\|\|^{2}$
	$\displaystyle\leq\frac{1}{\epsilon\eta KC_{1}N_{T}}\sum_{t=T^{\prime}_{0}}^{T^{\prime}_{N_{T}}}\mathbb{E}[F(x_{M(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))-F(x_{M(t+1)}(t+1))]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})$
	$\displaystyle\leq\frac{1}{\epsilon\eta KC_{1}N_{T}}\mathbb{E}[F(x(0))-F(x_{M(T)}(T)]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})$
	$\displaystyle\leq\mathcal{O}(\frac{\tau_{max}}{\epsilon\eta KC_{1}T})+\mathcal{O}(\frac{\eta\rho K^{2}}{\epsilon C_{1}}).$	(4)

Step 6: With the results in (3.1) and by leveraging Theorem 1 in yang2021achieving, Algorithm 1 converges to a critical point after $T$ global epochs.

4 Realization of DFedCache

4.1 Datasets

We conduct experiments using three standard FL benchmark datasets: MNIST [deng2012mnist], FashionMNIST [xiao2017fashionmnistnovelimagedataset], CIFAR-10 [krizhevsky2009learning] on $N=100$ vehicles, with CNN, CNN and ResNet-18 [he2016deep] as models respectively. We design three different data distribution settings: non-i.i.d, i.i.d, Dirichlet distribution method. In extreme non-i.i.d, we use a setting similar to su2022boost, data points in training set are sorted by labels and then evenly divided into 200 shards, with each shards containing 1–2 labels out of 10 labels. Then 200 shards are randomly assigned to 100 vehicles unevenly: 10% vehicles receive 4 shards, 20% vehicles receive 3 shards, 30% vehicles receive 2 shards and the rest 40% receive 1 shards. For i.i.d, we randomly allocate all the training data points to 100 vehicles. For Dirichlet distribution, we follow the setting in xiong2024deprl, to take a heterogeneous allocation by sampling $p_{i}\sim Dir_{N}(\pi)$ , where $\pi$ is the parameter of Dirichlet distribution. We take $\pi=0.5$ in our following experiments.

4.2 Evaluation Setup

The baseline algorithm is Decentralized FedAvg [sun2022decentralized], which implements simple decentralized federated optimization. For convenience, we name Decentralized FedAvg as DFL in the following results. We set batch size to 64 in all experiments. For MNIST and FashionMNIST, we use 60k data points for training and 10k data points for testing. For CIFAR-10, we use 50k data points for training and 10k data points for testing. Different from training set partition, we do not split the testset. For MNIST and FashionMNIST, we test local models of 100 vehicles on the 10k data points of the whole test set and get the average test accuracy for the evaluation metric. What’s more, for CIFAR-10, due to the computing overhead, we sample 1000 data points from test set for each vehicle and use the average test accuracy of 100 vehicles as the evaluation metric. For all the experiments, we train for 1,000 global epochs, and implement early stop when the average test accuracy stops increasing for at least 20 epochs. For MNIST and FashionMNIST experiments, we use 10 compute nodes, each with 10 CPUs, to simulate DFL on 100 vehicles. CIFAR-10 results are obtained from 1 compute node with 5 CPUs and 1 A100 NVIDIA GPU. We implement all algorithms in PyTorch [paszke2017automatic] on Python3.

Algorithm 2 LRU Model Cache Update (LRU Update)

Input: Current cache $\mathcal{C}_{i}(t)$ , Cache of agent $j$ : $\mathcal{C}_{j}(t)$ , Model $x_{j}(t)$ from agent $j$ , Current time $t$ , Cache size $\mathcal{C}_{max}$ , Tolerance of staleness $\tau_{max}$

1: Main Process:

2:for

x_{k}(\tau)\in\mathcal{C}_{i}(t)

\mathcal{C}_{j}(t)

3: if

t-\tau\geq\tau_{max}

then

4: Remove

x_{k}(\tau)

from related cache

\mathcal{C}_{i}(t)

\mathcal{C}_{j}(t)

5: end if

6:end for

7:Add or replace

x_{j}(t)

into

\mathcal{C}_{i}(t)

8:for

x_{k}(\tau)\in\mathcal{C}_{j}(t)

9: if

x_{k}\notin\mathcal{C}_{i}(t)

then

10: Add

x_{k}(\tau)

into

\mathcal{C}_{i}(t)

11: else

12: Retrieve

x_{k}(\tau^{\prime})\in\mathcal{C}_{i}(t)

13: if

\tau>\tau^{\prime}

then

14: Replace

x_{k}(\tau^{\prime})

with

x_{k}(\tau)

into

\mathcal{C}_{i}(t)

15: end if

16: end if

17:end for

18:Sort models in

\mathcal{C}_{i}(t)

in descending order of

\tau

19:Keep the first

\mathcal{C}_{max}

models in

\mathcal{C}_{i}(t)

then get

\mathcal{C}_{i}(t+1)

Output: $\mathcal{C}_{i}(t+1)$

4.3 Optimization Method

We use SGD as the optimizer and set the initial learning rate $\eta=0.1$ for all experiments, and use learning rate scheduler named ReduceLROnPlateau from PyTorch, to automatically adjust the learning rate for each training.

4.4 Mobile DFL Simulation

Manhattan Mobility Model maps are generated from the real road data of Manhattan from INRIX¹¹1© 2024 INRIX, Inc.,”INRIX documentation,” INRIX documentation https://docs.inrix.com/., which is shown in Fig.1. We follow the setting of Manhattan Mobility Model from bai2003important: a vehicle is allowed to move along the grid of horizontal and vertical streets on the map. At an intersection, the vehicle can turn left, right or go straight. This choice is probabilistic: the probability of moving on the same street is 0.5, and the probabilities of turn into one of the rest roads are equal. For example, at an intersection with 3 more roads to turn into, then the probability to turn into each road is 0.1667. Follow the setting from su2022boost: each vehicle is equipped with DSRC and mmWave communication components, can communicate to other vehicles within the communication range of 100 meters, while the default velocity of each vehicle is 13.89 m/s. In our experiment, we set up the number of local updates $K=10$ , and the interval between each epoch is $120$ seconds. Within each global epoch, vehicles train their models on local datasets while moving on the road and updating their cache by uploading and downloading models with other encountered vehicles.

Refer to caption — Figure 1: Manhattan Mobility Model Map. The dots represent the intersections while the edges between nodes represent road in Manhattan.

4.5 LRU Cache Update

Algorithm 2, which we name as LRU method for convenience, is the basic cache update method we proposed. Basically, the LRU updating rule aims to fetch the most-up-to-date models and keep as many of them in the cache as possible. At line 13, 18, the metric for making choice among models is the timestamp of models, which is defined as the epoch time when the model was received from the original vehicle, rather than received from the cache. What’s more, to fully utilize the caching mechanism, a vehicle not only fetches other vehicles’ own trained models, but also models in their caches. For instance, at epoch $t$ , vehicle $i$ can directly fetch model $x_{j}(t)$ from vehicle $j$ , at line 7, and also fetch models $x_{k}(\tau)\in\mathcal{C}_{j}(t)$ from the cache of vehicle $j$ , at line 8. This way, each vehicle can not only fetch the models directly from its neighbours, but also indirectly from its neighbours’ neighbours, thus boosting the spreading of the underlying data information from different vehicles, and improving the DFL convergence speed, especially with heterogeneous data distribution. Additionally, at line 2, before updating cache, models with staleness $t-\tau\geq\tau_{max}$ will be removed from each vehicle’s cache.

5 Numerical Results

5.1 Caching vs. Non-caching

To evaluate the performance of DFedCache, we compare the DFL with LRU Cache, Centralized FL (CFL) and DFL on MNSIT, FashionMNIST, CIFAR-10 with three distributions: non-i.i.d, i.i.d, Dirichlet with $\pi=0.5$ . For LRU update, we take the cache size as 10 for MNIST and FashionMNIST, and 3 for CIFAR-10 and $\tau_{max}=5$ based on practical considerations. Given a speed of $13.89m/s$ and and a communication distance of $100m$ , the communication window of two agents driving in opposite directions could be limited. Additionally, considering the size of chosen models and communication overhead, the above cache sizes were chosen. From the results in Fig. 2, we can see that DFedCache boosts the convergence and outperforms non-caching method (DFL) and gains performance much closer to CFL in all the cases, especially in non-i.i.d. scenarios.

5.2 Impact of Cache Size

Then we evaluate the performance gains with different cache sizes from 1 to 30 and $\tau_{max}=10$ , on MNIST and FashionMNIST in Fig. 3. LRU can benefit more from larger cache sizes, especially in non-i.i.d scenarios, as aggregation with more cached models gets closer to CFL and training with global data distribution.

5.3 Impact of Model Staleness

As mentioned in the previous sections, one drawback of model caching is introducing stale models into aggregation, so it is very important to choose a proper staleness tolerance $\tau_{max}$ . First, we statistically calculate the relation between the average number and the average age of cached models at different $\tau_{max}$ from 1 to 20, when epoch time is 30s, 60s, 120s, with unlimited cache size, in Table 2. We can see that, with the fixed epoch time, as the $\tau_{max}$ increases, the average number and average age of cached models increase, approximately linearly. It’s not hard to understand, as every epoch each agent can fetch a limited number of models directly from other agents. Increasing the staleness tolerance $\tau_{max}$ will increase the number of cached models, as well as the age of cached models. What’s more, we can see that the communication frequency or the moving speed of agents will also impact the average age of cached models, as the faster an agent moves, the more models it can fetch within each epoch, which we will further discuss later.

Table 2: Average number and average age of cached models with different

\tau_{max}

and different epoch time: 30s, 60s, 120s. Columns for different

\tau_{max}

, and rows for different epoch time. There are two sub-rows for each epoch time, with the first sub-row be the average number of cached models, the second sub-row be their average age.

	$\tau_{max}$	1	2	3	4	5	10	20
30s	num	0.8549	1.6841	2.6805	3.8584	5.2054	14.8520	44.8272
30s	age	0.0000	0.4904	1.0489	1.6385	2.2461	5.4381	11.5735
60s	num	1.6562	3.8227	6.4792	10.3889	14.0749	43.0518	90.2576
60s	age	0.000	0.5593	1.1604	1.8075	2.4396	5.5832	9.7954
120s	num	3.6936	9.9149	18.5016	31.3576	35.1958	90.2469	98.5412
120s	age	0.0000	0.6189	1.2715	1.9194	1.5081	4.7131	5.2357

And we also compare the performance of DFL with LRU at different $\tau_{max}$ on MNIST under non-i.i.d and i.i.d in Fig.4. Here we pick the epoch time 30s for a clear view. First, for non-i.i.d, the larger $\tau_{max}$ brings faster convergence at the beginning, which is consistent with our previous conclusion, as it allows more cached models which bring more benefits than harm to the training with non-i.i.d. However, for i.i.d scenario, larger $\tau_{max}$ stops improving the performance of models, as introducing more cached models hardly benefits training in i.i.d, and the model staleness will prevent the convergence. What’s more, if we zoom in the final converging phase, which is amplified in the bottom of each figure, we observe that larger $\tau_{max}$ will reduce the final converged accuracy in both non-i.i.d and i.i.d scenarios, as it introduces more staleness. Even with large $\tau_{max}$ and large staleness, LRU method still achieves similar or even better performance than DFL in non-i.i.d. scenarios.

5.4 Mobility’s Impact on Convergence

Vehicle mobility directly determines the frequency and efficiency of cache-based model spreading and aggregation. So we evaluate the performance of cached DFL at different vehicle speeds. We fix cache size at 10, $\tau_{max}=10$ with non-i.i.d. data distribution. We take the our previous speed $v=v_{0}=13.89m/s$ and $K=30$ local updates as the base, named as speedup x1. To speedup, $v$ increases while $K$ reduces to keep the fair comparison under the same wall clock. For instance, for speedup x3, $v=3v_{0}$ and $K=10$ . Results in Fig. 5 shows that when the mobility speed increases, although the number of local updates decreases, the spread of all models in the whole vehicle network is boosted, thus disseminating local models more quickly among all vehicles, leading to faster model convergence.

5.5 Experiments with Grouped Mobility Patterns and Data Distributions

In practice, vehicle mobility patterns and local data distributions may naturally form groups. For example, a vehicle may mostly move within its home area, and collect data specific to that area. Vehicles within the same area meet with each others frequently, but have very similar data distributions. A fresh model of a vehicle in the same area is not as valuable as a slightly older model of a vehicle from another area. So model caching should not just consider model freshness, but should also take into account the coverage of group-based data distributions. For those scenarios, we develop a Group-based (GB) caching algorithm as Algorithm 3. More specifically, knowing that there are $m$ distribution groups, one should maintain balanced presences of models from different groups. One straightforward extension of the LRU caching algorithm is to partition the cache into $m$ sub-caches, one for each group. Each sub-cache is updated by local models from its associated group, using the LRU policy.

Algorithm 3 Group Based Cache Update (GB Update)

Input: Current cache $\mathcal{C}_{i}(t)$ , Cache of device $j$ : $\mathcal{C}_{j}(t)$ , Model $x_{j}(t)$ from device $j$ , Current time $t$ , Cache size $\mathcal{C}_{max}$ , Tolerance of staleness $\tau_{max}$ , Device group mapping list $\mathcal{A}=\{\mathcal{A}_{1},\mathcal{A}_{2},...,\mathcal{A}_{n}\}$ , here $\mathcal{A}_{i}\in\{1,2,...,m\},\forall i\in[N]$ . Group cache size list $\mathcal{R}=\{r_{1},r_{2},...,r_{m}\}$ , here $r_{\mathcal{A}_{i}}$ is number of cache slots reserved for models from devices belong to group $\mathcal{A}_{i}$ .
Here the model $x(i,t^{\prime})$ is local updated at time $t^{\prime}$ , based on car $i$ ’s dataset.

2:function Group_prune_cache(

\mathcal{C}_{i}(t),\mathcal{R},\mathcal{A}

3: Create empty lists of group

\mathcal{L}_{1},\mathcal{L}_{2},...,\mathcal{L}_{m},

4: for

x_{k}(\tau)\in\mathcal{C}_{i}(t)

5: Put

x_{k}(\tau)

into

\mathcal{L}_{\mathcal{A}_{k}}

6: end for

7: Empty

\mathcal{C}_{i}(t)

8: for

k=1,2,...,m

9: Sort models in

\mathcal{L}_{k}

\tau

descending.

10: Take and put first

r_{k}

models in

\mathcal{L}_{k}

into

\mathcal{C}_{i}(t)

11: end for

12: return

\mathcal{C}_{i}(t+1)

13:end function

1:Main Process:

2:for

x_{k}(\tau)\in\mathcal{C}_{i}(t)

\mathcal{C}_{j}(t)

3: if

t-\tau\geq\tau_{max}

then

4: Remove

x_{k}(\tau)

from related cache

\mathcal{C}_{i}(t)

\mathcal{C}_{j}(t)

5: end if

6:end for

7:Add or replace

x_{j}(t)

into

\mathcal{C}_{i}(t)

8:for

x_{k}(\tau)\in\mathcal{C}_{j}(t)

9: if

x_{k}\notin\mathcal{C}_{i}(t)

then

10: Add

x_{k}(\tau)

into

\mathcal{C}_{i}(t)

11: else

12: Retrieve

x_{k}(\tau^{\prime})\in\mathcal{C}_{i}(t)

13: if

\tau>\tau^{\prime}

then

14: Replace

x_{k}(\tau)

with

x_{k}(\tau^{\prime})

into

\mathcal{C}_{i}(t)

15: end if

16: end if

17:end for

18:

\mathcal{C}_{i}(t+1)\leftarrow

GROUP_PRUNE_CACHE(

\mathcal{C}_{i}(t),\mathcal{R},\mathcal{A}

)

Output: $\mathcal{C}_{i}(t+1)$

We now conduct a case study for group-based cache update. As shown in Fig 1, the whole Manhattan road network is divided into 3 areas, downtown, mid-town and uptown. Each area has 30 area-restricted vehicles that randomly moves within that area, and 3 or 4 free vehicles that can move into any area. We set 4 different area-related data distributions: Non-overlap, 1-overlap, 2-overlap, 3-overlap. $n$ -overlap means the number of shared labels between areas is $n$ . We use the same non-i.i.d shards method in the previous section to allocate data points to the vehicles in the same area. On each vehicle, we evenly divide the cache for the three areas. We evaluate our proposed GB cache method on FashionMNIST. As shown in Fig. 6, while vanilla LRU converges faster at the very beginning, it cannot outperform DFL at last. However, the GB update method can solve the problem of LRU update and outperform DFL under different overlap settings.

5.6 Discussions

In general, the convergence rate of DFedCache outperforms non-caching method (DFL), especially for non-i.i.d. distributions. Larger cache size and smaller $\tau_{max}$ make DFedCache closer to the performance of CFL. Empirically, there is a trade-off between the age and number of cached models, it is critical to choose a proper $\tau_{max}$ to control the staleness. What’s more, the choice of $\tau_{max}$ should take into account the diversity data distributions on agents. In general, with non-i.i.d data distributions, the benefits of increasing the number of cached models can outweigh the damages caused by model staleness; while when data distributions are close to i.i.d, it is better to use a small number of fresh models than a large number of stale models. Similarly conclusions can also be drawn from the results of area-restricted vehicles. What’s more, in a system of moving agents, the mobility will also have big impact on the training, usually higher speed and communication frequency improve the model convergence and accuracy.

6 Related Work

Decentralized FL (DFL) has been increasingly applied in vehicular networks, leveraging existing frameworks like vehicle-to-vehicle (V2V) communication [yuan2024decentralized]. V2V FL facilitates knowledge sharing among vehicles and has been explored in various studies [samarakoon2019distributed, pokhrel2020decentralized, yu2020proactive, chen2021bdfl, barbieri2022decentralized, su2022boost]. samarakoon2019distributed studied optimized joint power and resource allocation for ultra-reliable low-latency communication (URLLC) using FL.

su2022boost introduced DFL with Diversified Data Sources to address data diversity issues in DFL, improving model accuracy and convergence speed in vehicular networks. None of the previous studies explored model caching on vehicles. Convergence of asynchronous federated optimization was studied in DBLP:journals/corr/abs-1903-03934. Their analysis focused on pairwise model aggregation between an agent and the parameter server, does not cover decentralized model aggregation with stale cached models in our proposed framework.

7 Conclusion & Future Work

In this paper, we developed a novel decentralized federated learning framework that leverages on model caching on mobile agents for fast and even model spreading. We theoretically analyzed the convergence of DFL with cached models. Through extensive case studies in a vehicle network, we demonstrated that our algorithm significantly outperforms DFL without model caching, especially for agents with non-i.i.d data distributions. We employed only simple model caching and aggregation algorithms in the current study. We will investigate more refined model caching and aggregation algorithms customized for different agent mobility patterns and non-i.i.d. data distributions.

References

Bai et al. [2003] Fan Bai, Narayanan Sadagopan, and Ahmed Helmy. Important: A framework to systematically analyze the impact of mobility on performance of routing protocols for adhoc networks. In IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428), volume 2, pages 825–835. IEEE, 2003.
Barbieri et al. [2022] Luca Barbieri, Stefano Savazzi, Mattia Brambilla, and Monica Nicoli. Decentralized federated learning for extended sensing in 6g connected vehicles. Vehicular Communications, 33:100396, 2022.
Burleigh et al. [2003] Scott Burleigh, Adrian Hooke, Leigh Torgerson, Kevin Fall, Vint Cerf, Bob Durst, Keith Scott, and Howard Weiss. Delay-tolerant networking: an approach to interplanetary internet. IEEE Communications Magazine, 41(6):128–136, 2003.
Chen et al. [2021] Jin-Hua Chen, Min-Rong Chen, Guo-Qiang Zeng, and Jia-Si Weng. Bdfl: A byzantine-fault-tolerance decentralized federated learning method for autonomous vehicle. IEEE Transactions on Vehicular Technology, 70(9):8639–8652, 2021.
Deng [2012] Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Krizhevsky [2009] A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Tront, 2009.
Liu et al. [2022] Ji Liu, Jizhou Huang, Yang Zhou, Xuhong Li, Shilei Ji, Haoyi Xiong, and Dejing Dou. From distributed machine learning to federated learning: A survey. Knowledge and Information Systems, 64(4):885–917, 2022.
Martínez Beltrán et al. [2023] Enrique Tomás Martínez Beltrán, Mario Quiles Pérez, Pedro Miguel Sánchez Sánchez, Sergio López Bernal, Gérôme Bovet, Manuel Gil Pérez, Gregorio Martínez Pérez, and Alberto Huertas Celdrán. Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges. IEEE Communications Surveys & Tutorials, 25(4):2983–3013, 2023. doi: 10.1109/COMST.2023.3315746.
McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
Niknam et al. [2020] Solmaz Niknam, Harpreet S Dhillon, and Jeffrey H Reed. Federated learning for wireless communications: Motivation, opportunities, and challenges. IEEE Communications Magazine, 58(6):46–51, 2020.
Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
Pokhrel and Choi [2020] Shiva Raj Pokhrel and Jinho Choi. A decentralized federated learning approach for connected autonomous vehicles. In 2020 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), pages 1–6. IEEE, 2020.
Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
Samarakoon et al. [2019] Sumudu Samarakoon, Mehdi Bennis, Walid Saad, and Mérouane Debbah. Distributed federated learning for ultra-reliable low-latency vehicular communications. IEEE Transactions on Communications, 68(2):1146–1159, 2019.
Shokri and Shmatikov [2015] Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 1310–1321, 2015.
Su et al. [2022] Dongyuan Su, Yipeng Zhou, and Laizhong Cui. Boost decentralized federated learning in vehicular networks by diversifying data sources. In 2022 IEEE 30th International Conference on Network Protocols (ICNP), pages 1–11. IEEE, 2022.
Sun et al. [2022] Tao Sun, Dongsheng Li, and Bao Wang. Decentralized federated averaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017. URL https://arxiv.org/abs/1708.07747.
Xie et al. [2019] Cong Xie, Sanmi Koyejo, and Indranil Gupta. Asynchronous federated optimization. CoRR, abs/1903.03934, 2019. URL http://arxiv.org/abs/1903.03934.
Xiong et al. [2024] Guojun Xiong, Gang Yan, Shiqiang Wang, and Jian Li. Deprl: Achieving linear convergence speedup in personalized decentralized learning with shared representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16103–16111, 2024.
Yang et al. [2021] Haibo Yang, Minghong Fang, and Jia Liu. Achieving linear speedup with partial worker participation in non-iid federated learning. Proceedings of ICLR, 2021.
Yu et al. [2020] Zhengxin Yu, Jia Hu, Geyong Min, Han Xu, and Jed Mills. Proactive content caching for internet-of-vehicles based on peer-to-peer federated learning. In 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS), pages 601–608. IEEE, 2020.
Yuan et al. [2024] Liangqi Yuan, Ziran Wang, Lichao Sun, S Yu Philip, and Christopher G Brinton. Decentralized federated learning: A survey and perspective. IEEE Internet of Things Journal, 2024.
Zhang et al. [2019] Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back. Advances in neural information processing systems, 32, 2019.

Appendix A Convergence Analysis

We now theoretically investigate the impact of caching, especially the staleness of cached models, on DFL model convergence. We introduce some definitions and assumptions.

Definition 4 (Smoothness).

A differentiable function $f$ is $L$ -smooth if for $\forall x,y,$ $f(y)-f(x)\leq\langle\nabla f(x),y-x\rangle+\frac{L}{2}||y-x||^{2}$ , where $L>0$ .

Definition 5 (Bounded Variance).

There exists a constant $\varsigma>0$ such that the global variability of the local gradient of the loss function is bounded $||\nabla F_{j}(x)-\nabla F(x)||^{2}\leq\varsigma^{2},\forall j\in[N],x\in\mathbb{R}^{d}$ .

Definition 6 (L-Lipschitz Continuous Gradient).

There exists a constant $L>0$ , such that $||\nabla F(x)-\nabla F(x^{\prime})||\leq L^{\prime}||x-x^{\prime}||,\forall x,x^{\prime}\in\mathbb{R}^{d}.$

Theorem 5.

\displaystyle\begin{split}\min_{t=0}^{T-1}\mathbb{E}||\nabla F(x(t))||^{2}\leq&\frac{\tau_{max}}{\epsilon\eta C_{1}KT}\mathbb{E}[F(x(0))-F(x_{M(T)}(T)]+\mathcal{O}(\frac{\eta\tau_{max}K^{2}}{\epsilon C_{1}T})\\ \leq&\mathcal{O}(\frac{\tau_{max}}{\epsilon\eta C_{1}KT})+\mathcal{O}(\frac{\eta\rho K^{2}}{\epsilon C_{1}}).\end{split}

A.1 Proof

Similar to Theorem 1 in [DBLP:journals/corr/abs-1903-03934], we can get the boundary before and after $K$ times local updates on $j$ -th device, $\forall j\in[N]$ :

	$\displaystyle\mathbb{E}[F(\tilde{x}_{j}(t))-F(x_{j}(t))]$	$\displaystyle=\mathbb{E}[F(x_{j}(t,K))-F(x_{j}(t,0))]$
		$\displaystyle\leq-\eta\epsilon\sum_{k=0}^{K-1}\mathbb{E}\|\|\nabla F(x_{j}(t,k))\|\|^{2}+\eta^{2}\mathcal{O}(\rho K^{3}V).$		(5)

For any epoch $t$ , we find the index $M(t)$ of the agent whose model is the “worst", i.e., $M(t)=\arg\max_{j\in[N]}\{F(x_{j}(t))\}$ , then we can get,

$\displaystyle\mathbb{E}[F$	$\displaystyle(x_{i}(t+1))-F(x_{})]\leq\mathbb{E}[F(x_{M(t+1)}(t+1))-F(x^{})]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{[}F\Big{(}\frac{1}{\|C_{M(t+1)}(t)\|}\sum_{j,\tau\in C_{M(t+1)}(t)}\tilde{x}_{j}(\tau)\Big{)}\Big{]}-F(x^{*})$
$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{[}\frac{1}{\|C_{M(t+1)}(t)\|}\sum_{j,\tau\in C_{M(t+1)}(t)}F(\tilde{x}_{j}(\tau))\Big{]}-F(x^{*})$
$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{[}\frac{1}{\|C_{M(t+1)}(t)\|}\sum_{j,\tau\in C_{M(t+1)}(t)}F(x_{j}(\tau))\Big{]}-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V)$
	$\displaystyle-\frac{\epsilon\eta}{\|C_{M(t+1)}(t)\|}\sum_{j,\tau\in C_{M(t+1)}(t)}\sum_{k=0}^{K-1}\mathbb{E}\|\|\nabla F(x_{j}(\tau,k))\|\|^{2}$
$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{[}\frac{1}{\|C_{M(t+1)}(t)\|}\sum_{\tau=t-\tau_{max}+1}^{t}\mathrlap{\Big{(}}\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\mathllap{F(}x_{j}(\tau))+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\mathllap{F(}x_{M(\tau)}(\tau))\Big{)}\Big{]}$
	$\displaystyle-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V).$	(6)

here $\nabla F=\frac{\epsilon\eta}{|C_{M(t+1)}(t)|}\sum_{j,\tau\in C_{M(t+1)}(t)}\sum_{k=0}^{K-1}\mathbb{E}||\nabla F(x_{j}(\tau,k))||^{2}$ .
We can easy know, for given $\tau,\forall j\neq M(\tau)$ , we have $F(x_{M(\tau)}(\tau))>F(x_{j}(\tau))$ , then,

\mathbb{E}\Big{[}\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}F(x_{j}(\tau))\Big{]}=\Big{(}\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot 1\Big{)}h(\tau)F(x_{M(\tau)}(\tau)).

(7)

Here $0<h(\tau)<1$ , so we rearrange the equation above, we can get:

$\displaystyle\mathbb{E}[F$	$\displaystyle(x_{M(t+1)}(t+1))-F(x^{*})]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{[}\sum_{\tau=t}^{t-\tau_{max}+1}\frac{\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot h(\tau)+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\cdot 1}{\|C_{M(t+1)}(t)\|}F(x_{M(\tau)}(\tau))\Big{]}$
	$\displaystyle-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V)$
$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\sum_{\tau=t}^{t-\tau_{max}+1}\frac{\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot h(\tau)+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\cdot 1}{\|C_{M(t+1)}(t)\|}F(x_{M(\tau)}(\tau))\Big{]}$
	$\displaystyle-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V)$
$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\sum_{\tau=t}^{t-\tau_{max}+1}\frac{\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot h(\tau)+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\cdot 1}{\sum_{\tau=t}^{t-\tau_{max}+1}(\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot 1+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\cdot 1)}F(x_{M(\tau)}(\tau))\Big{]}$
	$\displaystyle-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V).$	(8)

We define the “worst" model on all agents over the time period of $[t-\tau_{max}+1,t]$ as

\mathcal{T}(t,\tau_{max})=\arg\max_{t\in[t-\tau_{max}+1,t]}\{F(x_{M(t)}(t))\}.

Then we can get,

$\displaystyle\mathbb{E}[F$	$\displaystyle(x_{M(t+1)}(t+1))-F(x^{*})]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{[}\sum_{\tau=t}^{t-\tau_{max}+1}\frac{\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot h(\tau)+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\cdot 1}{\sum_{\tau=t}^{t-\tau_{max}+1}(\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot 1+\sum_{x_{M(\tau)}(\tau)\in C_{M(t)}(t)}\cdot 1)}$
	$\displaystyle\cdot F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))\Big{]}-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V)$
$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\Big{(}1-\sum_{\tau=t}^{t-\tau_{max}+1}\frac{\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot(1-h(\tau))}{\|C_{M(t+1)}(t)\|}\Big{)}F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))\Big{]}$
	$\displaystyle-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V)$
$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\Big{(}1-\xi(t,\tau_{max})\Big{)}F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))\Big{]}-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V)$
$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{[}F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))\Big{]}-\nabla F-F(x^{*})+\eta^{2}\mathcal{O}(\rho K^{3}V).$	(9)

Here, $\xi(t,\tau_{max})=\sum_{\tau=t}^{t-\tau_{max}+1}\frac{\sum_{\begin{subarray}{c}x_{j}(\tau)\in C_{M(t)}(t)\\ x_{j}(\tau)\neq x_{M(\tau)}(\tau)\end{subarray}}\cdot(1-h(\tau))}{|C_{M(t+1)}(t)|}\in[0,1)$ , so we have,

\mathbb{E}[F(x_{M(t+1)}(t+1))-F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))]\leq-\nabla F+\eta^{2}\mathcal{O}(\rho K^{3}V).

(10)

Then,

$\displaystyle\mathbb{E}[F$	$\displaystyle(x_{M(t+1)}(t+1))-F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))]$
	$\displaystyle\leq-\nabla F+\eta^{2}\mathcal{O}(\rho K^{3}V)$
	$\displaystyle=-\frac{\epsilon\eta}{\|C_{M(t+1)}(t)\|}\sum_{j,\tau\in C_{M(t+1)}(t)}\sum_{k=0}^{K-1}\|\|\nabla F(x_{j}(\tau,k))\|\|^{2}+\eta^{2}\mathcal{O}(\rho K^{3}V).$	(11)

By taking use of the property of L-Lipschitz Continuous Gradient, we can get,

$\displaystyle\|\|\nabla F(x_{j}(\tau,k))-\nabla F(x_{j}(\tau,k-1))\|\|$	$\displaystyle\leq L^{\prime}\|\|x_{j}(\tau,k)-x_{j}(\tau,k-1)\|\|$
	$\displaystyle\leq\eta L^{\prime}\|\|\nabla g_{x_{j}(\tau)}(x_{j}(\tau,k-1))\|\|$
	$\displaystyle\leq\eta L^{\prime}\sqrt{V}.$	(12)

Then, we can get, $\forall k\in[1,K]$ , $||\nabla F(x_{j}(\tau,k))||^{2}\geq||\nabla F(x_{j}(\tau,k-1))||^{2}+\eta^{2}L^{\prime 2}V$ , then

||\nabla F(x_{j}(\tau,k))||^{2}\geq||\nabla F(x_{j}(\tau))||^{2}+\eta^{2}L^{\prime 2}kV.

(13)

So we have,

$\displaystyle\sum_{j,\tau\in C_{M(t+1)}(t)}\sum_{k=0}^{K-1}$	$\displaystyle\|\|\nabla F(x_{j}(\tau,k))\|\|^{2}$
	$\displaystyle\geq\sum_{j,\tau\in C_{M(t+1)}(t)}K\|\|\nabla F(x_{j}(\tau))\|\|^{2}+\eta^{2}L^{\prime 2}K(K-1)V$
	$\displaystyle\geq\sum_{j,\tau\in C_{M(t+1)}(t)}K\|\|\nabla F(x_{j}(\tau))\|\|^{2}+\eta^{2}\mathcal{O}(K^{2}V).$	(14)

We assume the ratio of gradient expectation on the global distribution to gradient expectation on local data distribution of any device is bounded, i.e.,

\displaystyle\mathbb{E}[||\nabla F(x_{j}(t))||^{2}]\geq C_{1}\mathbb{E}[||\nabla F(x(t))||^{2}],

(15)

So we get,

$\displaystyle\sum_{j,\tau\in C_{M(t+1)}(t)}$	$\displaystyle\sum_{k=0}^{K-1}\mathbb{E}[\|\|\nabla F(x_{j}(\tau,k))\|\|^{2}]$
$\displaystyle\geq$	$\displaystyle C_{1}K\|C_{M(t+1)}(t)\|\min_{\tau=t-\tau_{max}+1}^{t}\mathbb{E}[\|\|\nabla F(x(\tau))\|\|^{2}]$
	$\displaystyle+\eta^{2}\|C_{M(t+1)}(t)\|\mathcal{O}(K^{2}V),$	(16)

Inequality (A.1) becomes:

	$\displaystyle\mathbb{E}$	$\displaystyle[F(x_{M(t+1)}(t+1))-F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))]$
		$\displaystyle\leq-\epsilon\eta C_{1}K\min_{\tau=t-\tau_{max}+1}^{t}\|\|\nabla F(x(\tau))\|\|^{2}+\eta^{2}\mathcal{O}(\rho K^{3}V).$		(17)

By rearranging the terms, we have,

	$\displaystyle\min_{\tau=t-\tau_{max}+1}^{t}\|\|\nabla F(x(\tau))\|\|^{2}\leq$	$\displaystyle(\frac{1}{\epsilon\eta C_{1}K})\mathbb{E}[F(x_{(\mathcal{T}(t,\tau_{max}))}(\mathcal{T}(t,\tau_{max})))$
		$\displaystyle-F(x_{M(t+1)}(t+1))]+\mathcal{O}(\frac{\eta\rho K^{2}V}{C_{1}\epsilon}).$		(18)

We iteratively construct a time sequence $\{T^{\prime}_{0},T^{\prime}_{1},T^{\prime}_{2},...,T^{\prime}_{N_{T}}\}\subseteq\{0,1,...,T-1\}$ in the backward fashion so that

		$\displaystyle T^{\prime}_{N_{T}}$	$\displaystyle=T-1;$
		$\displaystyle T^{\prime}_{i}$	$\displaystyle=\mathcal{T}(T^{\prime}_{i+1},\tau_{max})-1,\quad 1\leq i\leq N_{T}-1;$
		$\displaystyle T^{\prime}_{0}$	$\displaystyle=0.$

From the equation above, we can also get the inequality about $N_{T}$ ,

\frac{T}{\tau_{max}}\leq N_{T}\leq T.

(19)

Applying inequality (A.1) at all time instances $\{T^{\prime}_{0},T^{\prime}_{1},T^{\prime}_{2},...,T^{\prime}_{N_{T}}\}$ , after T global epochs, we have,

$\displaystyle\min_{t=0}^{T-1}\mathbb{E}[$	$\displaystyle\|\|\nabla F(x(t))\|\|^{2}]\leq\frac{1}{N_{T}}\sum_{t=T^{\prime}_{0}}^{T^{\prime}_{N}}\min_{\tau=t-\tau_{max}+1}^{t}\|\|\nabla F(x(\tau))\|\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{1}{\epsilon\eta C_{1}KN_{T}}\sum_{t=T^{\prime}_{0}}^{T^{\prime}_{N}}\mathbb{E}[F(x_{M_{T(t,\tau_{max})}}(T(t,\tau_{max})))-F(x_{M(t+1)}(t+1))]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})$
$\displaystyle\leq$	$\displaystyle\frac{1}{\epsilon\eta C_{1}KN_{T}}\mathbb{E}[F(x_{M(T^{\prime}_{0})}(T^{\prime}_{0}))-F(x_{M(T^{\prime}_{N_{T}}+1)}(T^{\prime}_{N_{T}}+1))]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})$
$\displaystyle\leq$	$\displaystyle\frac{1}{\epsilon\eta C_{1}KN_{T}}\mathbb{E}[F(x_{M(0)}(0))-F(x_{M(T)}(T)]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})$
$\displaystyle\leq$	$\displaystyle\frac{1}{\epsilon\eta C_{1}KN_{T}}\mathbb{E}[F(x(0))-F(x_{M(T)}(T)]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})$
$\displaystyle\leq$	$\displaystyle\frac{\tau_{max}}{\epsilon\eta C_{1}KT}\mathbb{E}[F(x(0))-F(x_{M(T)}(T)]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})$
$\displaystyle\leq$	$\displaystyle\frac{\tau_{max}}{\epsilon\eta C_{1}KT}\mathbb{E}[F(x(0))-F(x(T)]+\mathcal{O}(\frac{\eta\rho K^{2}V}{\epsilon C_{1}})$
$\displaystyle\leq$	$\displaystyle\mathcal{O}(\frac{\tau_{max}}{\epsilon\eta C_{1}KT})+\mathcal{O}(\frac{\eta\rho K^{2}}{\epsilon C_{1}}).$	(20)

With the results in (A.1) and by leveraging Theorem 1 in yang2021achieving, Algorithm 1 converges to a critical point after $T$ global epochs.

Appendix B Experimental details

B.1 Data distribution in Group-based LRU Cache Experiment

B.1.1 $n$ -overlap

For MNIST, FashionMNIST with 10 classes labels, we allocate labels into 3 groups as:
Non-overlap:
Area 1: (0,1,2,3), Area 2: (4,5,6), Area 3: (7,8,9)
1-overlap:
Area 1: (9,0,1,2,3), Area 2: (3,4,5,6), Area 3: (6,7,8,9)
2-overlap:
Area 1: (8,9,0,1,2,3), Area 2: (2,3,4,5,6), Area 3: (5,6,7,8,9)
3-overlap:
Area 1: (7,8,9,0,1,2,3), Area 2: (1,2,3,4,5,6), Area 3: (4,5,6,7,8,9)

B.2 Computing infrastructure

For MNIST and FashionMNIST experiments, we use 10 computing nodes, each with 10 CPUs, model: Lenovo SR670, to simulate DFL on 100 vehicles. CIFAR-10 results are obtained from 1 compute node with 5 CPUs and 1 A100 NVIDIA GPU, model: SD650-N V2. They are all using Ubuntu-20.04.1 system with pre-installed openmpi/intel/4.0.5 module, and 100 Gb/s network speed.

B.3 Results with variation

As our metric is the average test accuracy over 100 devices, we also provide the variation of the average test accuracy in our experimental results as following,

Table 3: Average number and average age of cached models with different

\tau_{max}

and different epoch time: 30s, 60s, 120s. Columns for different

\tau_{max}

, and rows for different epoch time. There are four sub-rows for each epoch time, with the first two sub-rows be the average number of cached models and variation, the second two sub-rows be their average age and variation.

	$\tau_{max}$	1	2	3	4	5	10	20
30s	num	0.8549	1.6841	2.6805	3.8584	5.2054	14.8520	44.8272
	$var_{n}$	0.0321	0.0903	0.1941	0.3902	0.6986	6.2270	42.8144
	age	0.0000	0.4904	1.0489	1.6385	2.2461	5.4381	11.5735
	$var_{a}$	0.0000	0.0052	0.0116	0.0203	0.0320	0.1577	1.0520
60s	num	1.6562	3.8227	6.4792	10.3889	14.0749	43.0518	90.2576
	$var_{n}$	0.0865	0.3594	0.9925	2.824	5.1801	32.8912	55.5829
	age	0.000	0.5593	1.1604	1.8075	2.4396	5.5832	9.7954
	$var_{a}$	0.0000	0.0035	0.0089	0.0188	0.0251	0.1448	0.8375
120s	num	3.6936	9.9149	18.5016	31.3576	35.1958	90.2469	98.5412
	$var_{n}$	0.3166	2.3825	5.9696	21.7408	56.8980	28.8958	29.9676
	age	0.0000	0.6189	1.2715	1.9194	1.5081	4.7131	5.2357
	$var_{a}$	0.0000	0.0028	0.0058	0.0212	0.9742	0.1249	0.1808

B.4 Final (hyper-)parameters in experiments

learning rate $\eta=0.1$ , batch size = 64. For all the experiments, we train for 1,000 global epochs, and implement early stop when the average test accuracy stops increasing for at least 20 epochs. Without specific explanation, we take the following parameters as the final default parameters in our experiments. Local epoch $K=10$ , vehicle speed $v=13.59m/s$ , communication distance $100m$ , communication interval between each epoch $120s$ , $\tau_{max}=10$ , number of devices $N=100$ , cache size $10$ .

B.5 NN models

We choose the models from the classical repository ²²2 Federated-Learning-PyTorch , Implementation of Communication-Efficient Learning of Deep Networks from Decentralized Data © 2024 GitHub, Inc.https://github.com/AshwinRJ/Federated-Learning-PyTorch.: we run convolutional neural network (CNN) as shown in Table 4, for MNIST, convolutional neural network (CNN) as shown in Table 5, for FashionMNIST. Inspired by [zhang2019lookahead], we choose ResNet-18 for Cifar-10, as shown in Table 6.

Table 4: CNN Architecture for MNIST

Layer Type	Size
Convolution + ReLU	$5\times 5\times 10$
Max Pooling	$2\times 2$
Convolution + ReLU + Dropout	$5\times 5\times 20$
Dropout(2D)	$p=0.5$
Max Pooling	$2\times 2$
Fully Connected + ReLU	$320\times 50$
Dropout	$p=0.5$
Fully Connected	$50\times 10$

Table 5: CNN Architecture for FashionMNIST

Layer Type	Size
Convolution + BatchNorm + ReLU	$5\times 5\times 16$
Max Pooling	$2\times 2$
Convolution + BatchNorm + ReLU	$5\times 5\times 32$
Max Pooling	$2\times 2$
Fully Connected	$7\times 7\times 32\times 10$

Table 6: ResNet-18 Architecture

Layer Type	Output Size
Convolution + BatchNorm + ReLU	$32\times 32\times 64$
Residual Block 1 (2x BasicBlock)	$32\times 32\times 64$
Residual Block 2 (2x BasicBlock)	$16\times 16\times 128$
Residual Block 3 (2x BasicBlock)	$8\times 8\times 256$
Residual Block 4 (2x BasicBlock)	$4\times 4\times 512$
Average Pooling	$1\times 1\times 512$
Fully Connected	$1\times 1\times 10$

$\displaystyle\|\|\nabla F(x_{j}(\tau,k))-\nabla F(x_{j}(\tau,k-1))\|\|$	$\displaystyle\leq L^{\prime}\|\|x_{j}(\tau,k)-x_{j}(\tau,k-1)\|\|$
	$\displaystyle\leq\eta L^{\prime}\|\|\nabla g_{x_{j}(\tau)}(x_{j}(\tau,k-1))\|\|$
	$\displaystyle\leq\eta L^{\prime}\sqrt{V}.$	(12)

$\displaystyle\sum_{j,\tau\in C_{M(t+1)}(t)}\sum_{k=0}^{K-1}$	$\displaystyle\|\|\nabla F(x_{j}(\tau,k))\|\|^{2}$
	$\displaystyle\geq\sum_{j,\tau\in C_{M(t+1)}(t)}K\|\|\nabla F(x_{j}(\tau))\|\|^{2}+\eta^{2}L^{\prime 2}K(K-1)V$
	$\displaystyle\geq\sum_{j,\tau\in C_{M(t+1)}(t)}K\|\|\nabla F(x_{j}(\tau))\|\|^{2}+\eta^{2}\mathcal{O}(K^{2}V).$	(14)

Decentralized Federated Learning with Model Caching on Mobile Agents

Abstract

1 Introduction

1.1 Federated Learning on Mobile Agents

1.2 Delay-Tolerant Model Communication and Aggregation Through Caching

2 Background

2.1 Global Training Objective

2.2 DFL Training with Local Model Caching

Remark 1.

Remark 2.

Remark 3.

3 Convergence Analysis

Definition 1 (Smoothness).

Definition 2 (Bounded Variance).

Definition 3 (L-Lipschitz Continuous Gradient).

Theorem 4.

3.1 Proof Sketch

4 Realization of DFedCache

4.1 Datasets

4.2 Evaluation Setup

4.3 Optimization Method

4.4 Mobile DFL Simulation

4.5 LRU Cache Update

5 Numerical Results

5.1 Caching vs. Non-caching

5.2 Impact of Cache Size

5.3 Impact of Model Staleness

5.4 Mobility’s Impact on Convergence

5.5 Experiments with Grouped Mobility Patterns and Data Distributions

5.6 Discussions

6 Related Work

7 Conclusion & Future Work

References

Appendix A Convergence Analysis

Definition 4 (Smoothness).

Definition 5 (Bounded Variance).

Definition 6 (L-Lipschitz Continuous Gradient).

Theorem 5.

A.1 Proof

Appendix B Experimental details

B.1 Data distribution in Group-based LRU Cache Experiment

B.1.1 nn-overlap

B.2 Computing infrastructure

B.3 Results with variation

B.4 Final (hyper-)parameters in experiments

B.5 NN models

B.1.1 $n$ -overlap