\svgsetup

inkscapeexe=C:/Inkscape/bin/inkscape

Federated $K$ -Means Clustering via Dual Decomposition-based Distributed Optimization

Vassilios Yfantis, Achim Wagner, Martin Ruskowski V. Yfantis and M. Ruskowski are with the Chair of Machine Tools and Control Sytems, Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau (e-mail: [email protected]).A. Wagner and M. Ruskowski are with the German Research Center for Artificial Intelligence.The authors would like to thank Vinit Hegiste for providing feedback on the manuscript.The research was funded by a project supported by the Federal Ministry for Economic Affairs and Climate Action (BMWK) on the basis of a decision by the German Bundestag.

Abstract

The use of distributed optimization in machine learning can be motivated either by the resulting preservation of privacy or the increase in computational efficiency. On the one hand, training data might be stored across multiple devices. Training a global model within a network where each node only has access to its confidential data requires the use of distributed algorithms. Even if the data is not confidential, sharing it might be prohibitive due to bandwidth limitations. On the other hand, the ever-increasing amount of available data leads to large-scale machine learning problems. By splitting the training process across multiple nodes its efficiency can be significantly increased. This paper aims to demonstrate how dual decomposition can be applied for distributed training of $K$ -means clustering problems. After an overview of distributed and federated machine learning, the mixed-integer quadratically constrained programming-based formulation of the $K$ -means clustering training problem is presented. The training can be performed in a distributed manner by splitting the data across different nodes and linking these nodes through consensus constraints. Finally, the performance of the subgradient method, the bundle trust method, and the quasi-Newton dual ascent algorithm are evaluated on a set of benchmark problems. While the mixed-integer programming-based formulation of the clustering problems suffers from weak integer relaxations, the presented approach can potentially be used to enable an efficient solution in the future, both in a central and distributed setting.

Index Terms:

Distributed Optimization, Federated Learning, Dual Decomposition, Clustering.

I Introduction

Training a machine learning model of any kind on a large set of data usually involves the solution of a challenging optimization problem. If the underlying data set becomes too large, it might not be possible to solve the resulting optimization problem in a reasonable amount of time. Distributed optimization methods can aid in rendering the optimization problem tractable through the use of multiple computational resources. Peteiro-Baral and Guijarro-Bediñas provide an overview of methods for distributed machine learning [1]. To train a global model in a distributed manner a consensus has to be established between the involved nodes and their underlying optimization problems. Forero et al. [2, 3] and Georgopoulos and Hasler [4] demonstrate the distributed training of machine learning models using consensus-based distributed optimization. Tsianos et al. discuss practical issues with a consensus-based approach that arise from the difference between synchronous and asynchronous communication [5]. Nedić provides an overview of distributed gradient methods for convex training problems [6] while Verbraeken et al. give a general survey of distributed machine learning [7].

((a)) Simple architecture for federated learning.

((b)) Federated learning of node clusters.

Figure 1: Examples of federated learning architectures.

While computational performance remains an issue for many machine learning problems, the increase in computing power and in the efficiency of optimization algorithms can render many challenging problems tractable. However, the inability to share data due to confidentiality reasons still necessitates the use of distributed algorithms. Fig. 1(a) shows a setting in which training data is stored across two different nodes. Each node can use its local data to train an individual machine learning model. By including a coordination layer the two training processes can be guided in a way that a global model is trained, without the need to share confidential data. Distributed training of a global model without sharing individual training data is often referred to as federated optimization or federated learning [8]. If the underlying optimization problems are still hard to solve, or if the training data is heterogeneous, the training process can be further divided into subproblems. Fig. 1(b) depicts the situation in which models of different node clusters are trained in a distributed manner which in turn are again coordinated to obtain a global model. This approach is often referred to as clustered federated learning [9].

Most algorithms for federated learning involve an averaging step of the model parameters of the individual nodes [10]. Federated learning methods have been applied in the context of manufacturing [11], healthcare [12], mobile devices [13], and smart city sensing [14]. Li et al. [15] and Liu et al. [16] provide surveys on federated learning while Chamikara et al. examine the privacy aspects related to external attacks [17]. Applying federated learning to heterogeneous data sets can lead to the deterioration of the model quality of individual nodes in regard to their training data, which might hinder their willingness to participate in such a setting. This issue is addressed through personalized federated learning [18, 19].

$K$ -means clustering describes an unsupervised machine learning problem in which a set of observations/data is divided into $K$ disjoint clusters according to a similarity measure [20]. Clustering problems can be found in many practical applications such as image segmentation [21], customer market segmentation [22], or the identification of similar operating points in a production plant [23]. While $K$ -means clustering is a well-studied problem, federated clustering, i.e., clustering of data across multiple nodes, has not been studied extensively in the literature yet. Dennis et al. present a one-shot federated clustering algorithm with heterogeneous data, where the clustering problem of each node is solved by Lloyd’s heuristic algorithm [24, 25]. Kumar et al. apply federated averaging to pre-trained models on separate devices and present an update strategy when new data points are added [26]. Li et al. address the security aspect of federated clustering by encoding the data of each node and applying Lloyd’s algorithm to the encoded data [27]. Stallmann and Wilbik extend fuzzy $c$ -means clustering, i.e., a clustering problem where each data point can be assigned to multiple clusters, to a federated setting [28]. A similar approach to federated $c$ -means clustering was previously presented by Pedrycz [29]. Wang et al. use model averaging and gradient sharing for federated clustering of data in a smart grid [30].

A common feature of the federated clustering approaches described in the literature is the use of a heuristic algorithm to solve the individual clustering problems. In this paper mixed-integer programming is used to solve the individual clustering problems and a dual decomposition-based distributed optimization approach is employed to coordinate the solutions of the different nodes. While an averaging step is still performed to obtain feasible primal solutions, the use of duality enables the computation of valid lower bounds on the objective of the global clustering problem. It should be noted that the mixed-integer programming problems resulting from the $K$ -means clustering problem are very hard to solve due to their weak integer relaxations. Thus, the approach presented in this paper can be regarded as preliminary work that can become a viable option with the continuous improvement of mixed-integer programming solvers [31, 32].

II Notation

We denote vectors with bold lower-case letters ( $\mathbf{x}$ ) and matrices with bold upper-case letters ( $\mathbf{X}$ ). The $j$ th element of a vector $\mathbf{x}$ is denoted by $[\mathbf{x}]_{j}$ . The Euclidean norm is denoted by $\|\cdot\|_{2}$ . The value of a variable $\mathbf{x}$ in iteration $t$ is denoted by $\mathbf{x}^{(t)}$ .

III $K$ -Means clustering

This section presents the mixed-integer programming-based formulation of the $K$ -means clustering training problem. The formulation is subsequently extended to the case of distributedly stored data, which gives rise to a federated learning problem. Consensus constraints are used to couple the training problems of different nodes. These constraints can be dualized such that the federated learning problem can be solved via dual decomposition-based distributed optimization. Since the underlying optimization problem contains integrality constraints it is not convex and thus strong duality does not hold. However, a feasible primal solution can be computed in each iteration through an averaging heuristic.

III-A MIQCP formulation

((a)) Clustering of the complete dataset.

((b)) Clustering of data-subset 1.

((c)) Clustering of data-subset 2.

Figure 2: Illustration of

K

-means clustering both in a centralized and a decentralized setting.

The goal of $K$ -means clustering is to assign a set of observations $\mathbf{y}_{j}\in{\mathbb{R}^{n_{\mathbf{y}}}},\;j\in\mathcal{J}=\{1,\dots,|\mathcal{J}|\}$ to a set of clusters $\mathcal{K}=\{1,\dots,K\}$ and to compute the centroids of each cluster. The number of clusters is a hyper-parameter and is set a priori or in an iterative manner. This problem can be formulated as a mixed-integer nonlinear programming (MINLP) problem [33, 20],


$\displaystyle\underset{w_{jk},\mathbf{m}_{k}}{\min}$	$\displaystyle\sum_{j\in\mathcal{J}}\sum_{k\in\mathcal{K}}w_{jk}\cdot\\|\mathbf{y}_{j}-\mathbf{m}_{k}\\|_{2}^{2},$	(1a)
s. t.	$\displaystyle\sum_{k\in\mathcal{K}}w_{jk}=1,\forall j\in\mathcal{J},$	(1b)
	$\displaystyle w_{jk}\in\{0,1\},\;\forall j\in\mathcal{J},k\in\mathcal{K},\;\mathbf{m}_{k}\in{\mathbb{R}^{n_{\mathbf{y}}}}\;\forall k\in\mathcal{K}.$	(1c)

The binary variables $w_{jk}$ indicate if observation $\mathbf{y}_{j}$ is assigned to cluster $k$ and $\mathbf{m}_{k}$ is the centroid of cluster $k$ . Constraint (1b) enforces that each observation is assigned to exactly one cluster, while the objective is to minimize the sum of the squared Euclidean distances of all observations to the centroids of their assigned clusters. Problem (1) is a nonconvex MINLP which is hard to solve. In practice it is more efficient to use a linearized formulation by introducing the variable $d_{jk}$ , which describes the squared distance between an observation $j$ and the centroid of cluster $k$ [20],


$\displaystyle\underset{w_{jk},d_{jk},\mathbf{m}_{k}}{\min}$	$\displaystyle\sum_{j\in\mathcal{J}}\sum_{k\in\mathcal{K}}d_{jk}$	(2a)
s. t.	$\displaystyle\sum_{k\in\mathcal{K}}w_{jk}=1,\forall j\in\mathcal{J},$	(2b)
	$\displaystyle d_{jk}\geq\\|\mathbf{y}_{j}-\mathbf{m}_{k}\\|_{2}^{2}-M_{j}\cdot(1-w_{jk}),$
	$\displaystyle\forall j\in\mathcal{J},k\in\mathcal{K}$	(2c)
	$\displaystyle w_{jk}\in\{0,1\},d_{jk}\geq 0,$
	$\displaystyle\forall j\in\mathcal{J},k\in\mathcal{K},\;\mathbf{m}_{k}\in{\mathbb{R}^{n_{\mathbf{y}}}}\;\forall k\in\mathcal{K}.$	(2d)

(2) is a mixed-integer quadratically constrained programming (MIQCP) problem with a convex integer relaxation. Constraint (2c) is an epigraph formulation of the squared Euclidean distance if observation $j$ is assigned to cluster $k$ , i.e., when $w_{jk}=1$ . Otherwise, the parameter $M_{j}$ has to be large enough so that the constraint is trivially satisfied for $w_{jk}=0$ . In theory, a common big-M parameter can be used for all constraints described by (2c). However, the parameter should be chosen as small as possible to avoid weak integer relaxations. In the following, the big-M parameter is set as


$\displaystyle M_{j}=$	$\displaystyle\underset{\bm{\chi}\in\mathcal{Y}}{\max}\;\\|\mathbf{y}_{j}-\bm{\chi}\\|_{2}^{2},\;\forall j\in\mathcal{J},$	(3a)
$\displaystyle\mathcal{Y}=$	$\displaystyle\{\mathbf{y}\in{\mathbb{R}^{n_{\mathbf{y}}}}\|\;\underset{j\in\mathcal{J}}{\min}\;[\mathbf{y}_{j}]_{l}\leq[\mathbf{y}]_{l}\leq\underset{j\in\mathcal{J}}{\max}\;[\mathbf{y}_{j}]_{l}$	(3b)
	$\displaystyle l=1.\dots,n_{\mathbf{y}}\}.$	(3c)

Different approaches have been proposed to solve the clustering optimization problem. Bagirov and Yearwood present a heuristic method based on nonsmooth optimization [34], Aloise et al. propose a column generation algorithm [33] and Karmitsa et al. use a diagonal bundle method [35]. Fig. 2(a) illustrates the concept of $K$ -means clustering. The unlabeled data (left) is split into 3 clusters according to their distance to the computed cluster centroid (crosses).

III-B Distributed consensus formulation

Problem (2) describes the case in which the entire data set is accessible from a single node. However, this might not always be the case, especially if the underlying data is confidential. In the following it is assumed that the data set is split across several nodes $\mathcal{I}=\{1,\dots,N_{s}\}$ , with each node having access to the data-subset $\mathcal{J}_{i}\subset\mathcal{J}$ . The MIQCP problem (2) can be extended to the case of multiple nodes,


$\displaystyle\underset{w_{ijk},d_{ijk},\mathbf{m}_{k}}{\min}$	$\displaystyle\sum_{i\in\mathcal{I}}\sum_{j\in\mathcal{J}_{i}}\sum_{k\in\mathcal{K}}d_{ijk}$	(4a)
s. t.	$\displaystyle\sum_{k\in\mathcal{K}}w_{ijk}=1,\forall i\in\mathcal{I},j\in\mathcal{J}_{i},$	(4b)
	$\displaystyle d_{ijk}\geq\\|\mathbf{y}_{j}-\mathbf{m}_{k}\\|_{2}^{2}-M_{j}\cdot(1-w_{ijk}),$
	$\displaystyle\forall i\in\mathcal{I},j\in\mathcal{J}_{i},k\in\mathcal{K},$	(4c)
	$\displaystyle w_{ijk}\in\{0,1\},d_{ijk}\geq 0,\;\forall i\in\mathcal{I},j\in\mathcal{J}_{i},k\in\mathcal{K},$
	$\displaystyle\mathbf{m}_{k}\in{\mathbb{R}^{n_{\mathbf{y}}}},\;\forall k\in\mathcal{K}.$	(4d)

The goal of problem (4) is again to compute a set of cluster centroids $\mathbf{m}_{k}$ and to assign the observations of all nodes to these clusters. However, if the nodes cannot share their data, problem (4) cannot be solved in a centralized manner. A simple distributed approach would be to solve a clustering problem in each node $i$ . This could lead to a situation as depicted in Fig. 2(b) and Fig. 2(c). If each the data set is split across two nodes, each one can solve a clustering problem. However, both nodes will compute different cluster centroids.
The goal of a federated learning approach is to train a global model, i.e., global cluster centroids in the case of $K$ -means clustering, without sharing the local data between the nodes. To this end each node $i$ can compute individual cluster centroids $\mathbf{m}_{ik}$ ,


$\displaystyle\underset{w_{ijk},d_{ijk},\mathbf{m}_{ik}}{\min}$	$\displaystyle\sum_{i\in\mathcal{I}}\sum_{j\in\mathcal{J}_{i}}\sum_{k\in\mathcal{K}}d_{ijk}$	(5a)
s. t.	$\displaystyle\sum_{k\in\mathcal{K}}w_{ijk}=1,\forall i\in\mathcal{I},j\in\mathcal{J}_{i},$	(5b)
	$\displaystyle d_{ijk}\geq\\|\mathbf{y}_{j}-\mathbf{m}_{ik}\\|_{2}^{2}-M_{j}\cdot(1-w_{ijk}),$
	$\displaystyle\forall i\in\mathcal{I},j\in\mathcal{J}_{i},k\in\mathcal{K},$	(5c)
	$\displaystyle\mathbf{m}_{ik}=\mathbf{m}_{i^{\prime}k},\;\forall i\in\mathcal{I},i^{\prime}\in\mathcal{N}_{i},k\in\mathcal{K},$	(5d)
	$\displaystyle w_{ijk}\in\{0,1\},d_{ijk}\geq 0,$
	$\displaystyle\forall i\in\mathcal{I},j\in\mathcal{J}_{i},k\in\mathcal{K},$	(5e)
	$\displaystyle\mathbf{m}_{ik}\in{\mathbb{R}^{n_{\mathbf{y}}}}\;\forall,i\in\mathcal{I},k\in\mathcal{K}.$	(5f)

Since the goal is to obtain global cluster centroids, the individual cluster centroids are coupled through consensus constraints (5d), where $\mathcal{N}_{i}$ contains the set of neighboring nodes of node $i$ . Problem (5) describes a set of $N_{s}$ subproblems coupled through the consensus constraints. In the following subsection dual variables are used to decouple the clustering problems of the different nodes.

IV Dual decomposition-based distributed clustering

This section presents how the consensus formulation (5) of the clustering problem can be decomposed by introducing dual variables. Dual decomposition can be applied to constraint-coupled optimization problems of the form


$\displaystyle\underset{\mathbf{x}}{\min}\;$	$\displaystyle\sum_{i\in\mathcal{I}}f_{i}(\mathbf{x}_{i}),$	(6a)
s. t.	$\displaystyle\sum_{i\in\mathcal{I}}\mathbf{A}_{i}\mathbf{x}_{i}=\mathbf{b},$	(6b)
	$\displaystyle\mathbf{x}_{i}\in\mathcal{X}.$	(6c)

Equation (6) describes an optimization problem consisting of a set of ${\mathcal{I}=\{1,\dots,N_{s}\}}$ subproblems. The subproblems are coupled through the constraints (6b) and each one is described by individual variables $\mathbf{x}_{i}$ and constraints $\mathcal{X}_{i}$ . Dual decomposition is based on the introduction of dual variables for the coupling constraints (6b) and the solution of the resulting dual optimization problem. The idea was first introduced by Everett [36] for problems involving shared limited resources. Problem (5) can also be rewritten as a general constraint-coupled optimization problem by defining the matrix $\mathbf{A}$ describing the connections between the different nodes. In the following only linear network topologies as depicted in Fig. 3 are considered. Note that the discussion in the remainder of this paper can be easily extended to different network topologies.

Figure 3: Illustration of a linear network topology and the resulting consensus constraints.

By defining the vector of stacked cluster centroids of each node $i$ ,

\hat{\mathbf{m}}_{i}\coloneqq\begin{bmatrix}\mathbf{m}_{i,1}\\ \vdots\\ \mathbf{m}_{i,k}\end{bmatrix}\in\mathbb{R}^{K\cdot n_{\mathbf{y}}},

(7)

the consensus constraints can be rewritten as


	$\displaystyle\hat{\mathbf{m}}_{1}-\hat{\mathbf{m}}_{2}=\mathbf{0},$		(8a)
	$\displaystyle\hat{\mathbf{m}}_{2}-\hat{\mathbf{m}}_{3}=\mathbf{0},$		(8b)
	$\displaystyle\vdots$
	$\displaystyle\hat{\mathbf{m}}_{N_{s}-1}-\hat{\mathbf{m}}_{N_{s}}=\mathbf{0}.$		(8c)

Constraints (8) can subsequently be rewritten in matrix form


	$\displaystyle\underbrace{\begin{bmatrix}\mathbf{I}&-\mathbf{I}&\mathbf{0}&\cdots&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\mathbf{I}&-\mathbf{I}&\cdots&\mathbf{0}&\mathbf{0}\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ \mathbf{0}&\mathbf{0}&\mathbf{0}&\cdots&\mathbf{I}&-\mathbf{I}\end{bmatrix}}_{=:\mathbf{A}\in\mathbb{R}^{K\cdot n_{\mathbf{y}}\cdot(N_{s}-1)\times K\cdot n_{\mathbf{y}}\cdot N_{s}}}\cdot\begin{bmatrix}\hat{\mathbf{m}}_{1}\\ \hat{\mathbf{m}}_{2}\\ \hat{\mathbf{m}}_{3}\\ \vdots\\ \hat{\mathbf{m}}_{N_{s}}\end{bmatrix}=\mathbf{0},$		(9a)

or in a more compact way

\sum_{i\in\mathcal{I}}\mathbf{A}_{i}\hat{\mathbf{m}}_{i}=\mathbf{0}

(10)

with $\mathbf{A}_{i}\in\mathbb{R}^{K\cdot n_{\mathbf{y}}\cdot(N_{s}-1)\times K\cdot n_{\mathbf{y}}}$ . By introducing dual variables $\bm{\lambda}\in{\mathbb{R}^{n_{K\cdot n_{\mathbf{y}}\cdot(N_{s}-1)}}}$ for the consensus constraints (10) the Lagrange function of problem (5) can be defined,

\mathcal{L}(w_{ijk},d_{ijk},\mathbf{m}_{ik},\bm{\lambda})=\sum_{i\in\mathcal{I}}\underbrace{\left(\sum_{j\in\mathcal{J}_{i}}\sum_{k\in\mathcal{K}}d_{ijk}+\bm{\lambda}^{T}\mathbf{A}_{i}\hat{\mathbf{m}}_{i}\right)}_{=:\mathcal{L}_{i}(w_{ijk},d_{ijk},\mathbf{m}_{ik},\bm{\lambda})}.

(11)

The minimization of the Lagrange function for a fixed value of the dual variables $\bm{\lambda}$ gives the corresponding value of the dual function.

	$\displaystyle d(\bm{\lambda})\coloneqq\underset{w_{ijk},d_{ijk},\mathbf{m}_{ik}}{\min}$	$\displaystyle\sum_{i\in\mathcal{I}}\mathcal{L}_{i}(w_{ijk},d_{ijk},\mathbf{m}_{ik},\bm{\lambda})$		(12)
	s. t.	(5b), (5c), (5e), (5f)

The dual function has two important properties. First, the value of the dual function is always a lower bound on the solution of its corresponding primal problem, in this case, problem (5) [37]. The problem of finding the dual variables that result in the best lower bound is referred to as the dual optimization problem,

\underset{\bm{\lambda}}{\max}\;d(\bm{\lambda}).

(13)

The resulting dual problem can be solved in a distributed manner by solving the individual clustering problems for the current value of the dual variables,


$\displaystyle\underset{w_{ijk},d_{ijk},\mathbf{m}_{ik}}{\min}$	$\displaystyle\mathcal{L}_{i}(w_{ijk},d_{ijk},\mathbf{m}_{ik},\bm{\lambda})$	(14a)
s. t.	$\displaystyle\sum_{k\in\mathcal{K}}w_{ijk}=1,\;\forall j\in\mathcal{J}_{i},$	(14b)
	$\displaystyle d_{ijk}\geq\\|\mathbf{y}_{j}-\mathbf{m}_{ik}\\|_{2}^{2}-M_{j}\cdot(1-w_{ijk}),$
	$\displaystyle\forall j\in\mathcal{J}_{i},k\in\mathcal{K},$	(14c)
	$\displaystyle w_{ijk}\in\{0,1\},d_{ijk}\geq 0,\;\forall j\in\mathcal{J}_{i},k\in\mathcal{K},$
	$\displaystyle\mathbf{m}_{ik}\in{\mathbb{R}^{n_{\mathbf{y}}}}\;\forall k\in\mathcal{K}.$	(14d)

Second, the dual function (12) is always concave, regardless of whether the primal problem is convex or not [37]. Therefore the dual problem (13) is a convex optimization problem. However, the dual function is usually nondifferentiable due to a changing set of active individual constraints, which means that problem (13) is a nonsmooth optimization problem [38]. The following subsections present some algorithms for the solution of the dual problem, namely the subgradient method, the bundle trust method, and the quasi-Newton dual ascent algorithm.

IV-A Subgradient method

Since the dual function is nondifferentiable a gradient cannot be defined for every value of the dual variables. Instead, a subgradient can be used. A vector $\bm{\xi}\in{\mathbb{R}^{n_{\bm{\chi}}}}$ is a subgradient of a concave function $\phi(\bm{\chi})$ at a point $\bm{\chi}_{0}$ if

\phi(\bm{\chi})\leq\phi(\bm{\chi}_{0})+\bm{\xi}^{T}(\bm{\chi}-\bm{\chi})

(15)

for all $\bm{\chi}\in\text{dom}\;\phi$ . The set of all subgradients at a point $\bm{\chi}_{0}$ comprise the subdifferential $\partial\phi(\bm{\chi}_{0})$ Technically equation (15) defines a supergradient. Nevertheless, the term subgradient is commonly used in the literature for both convex and concave functions.
A subgradient of the dual function for a given value of the dual variables $\bm{\lambda}^{(t)}$ can be computed by evaluating the coupling constraints (10),

\mathbf{g}(\bm{\lambda}^{(t)})=\sum_{i\in\mathcal{I}}\mathbf{A}_{i}\hat{\mathbf{m}}_{i}(\bm{\lambda}^{(t)})\in\partial d(\bm{\lambda}^{(t)}),

(16)

where $\hat{\mathbf{m}}_{i}(\bm{\lambda}^{(t)})$ are the cluster centroids obtained by solving the individual clustering problems (14).
In the subgradient method the dual variables are updated in each iteration $t$ along the direction of the subgradient [39]

\bm{\lambda}^{(t+1)}=\bm{\lambda}^{(t)}+\alpha^{(t)}\mathbf{g}(\bm{\lambda}^{(t)}),

(17)

where $\alpha^{(t)}$ is a step size parameter. The step size parameter plays an important role in the convergence of the algorithm. If it is chosen too large the algorithm might diverge, while a too small choice might significantly slow down its convergence. A common choice to adapt the step size throughout the iterations is

\alpha^{(t)}=\alpha^{(0)}/\sqrt{t},

(18)

with an initial step size $\alpha^{(0)}$ [40].

IV-B Bundle trust method

The subgradient method usually exhibits a slow rate of convergence, since only using information from the current subgradient may not provide an ascent direction for the algorithm. Bundle methods are generally more efficient by utilizing multiple subgradients from previous iterations [41]. To this end the data

\mathcal{B}^{(t)}=\{(\bm{\lambda}^{(l)},\mathbf{g}(\bm{\lambda}^{(l)}),d(\bm{\lambda}^{(l)}))\in{\mathbb{R}^{n_{\bm{\lambda}}}}\times{\mathbb{R}^{n_{\bm{\lambda}}}}\times\mathbb{R}\\ |\;l=t-\tau+1,\dots,t\}

(19)

is stored in each iteration, where $n_{\bm{\lambda}}$ denotes the number of dual variables. $\mathcal{B}^{(t)}$ is referred to as a bundle and it contains the dual variables, subgradients, and values of the dual function from previous iterations. Since storing all information from all previous iterations might cause memory issues, only data from the previous $\tau$ iterations is used.
The idea of bundle methods is to use the collected information to construct a piece-wise linear over-approximation of the nonsmooth dual function $d(\bm{\lambda})$ , a so-called cutting plane model,

\hat{d}^{(t)}(\bm{\lambda})\coloneqq\underset{l\in\{t-\tau+1,\dots,t\}}{\min}\{d(\bm{\lambda}^{(l)})+\mathbf{g}^{T}(\bm{\lambda}^{(l)})(\bm{\lambda}-\bm{\lambda}^{(l)})\}.

(20)

The approximation can be written in an equivalent form as

\hat{d}^{(t)}(\bm{\lambda})=\underset{l\in\{t-\tau+1,\dots,t\}}{\min}\{d(\bm{\lambda}^{(t)})+\mathbf{g}^{T}(\bm{\lambda}^{(l)})(\bm{\lambda}-\bm{\lambda}^{(t)})-\beta^{(l,t)}\},

(21)

with the linearization error

\beta^{(l,t)}=d(\bm{\lambda}^{(t)})-d(\bm{\lambda}^{(l)})-\mathbf{g}^{T}(\bm{\lambda}^{(l)})(\bm{\lambda}^{(t)}-\bm{\lambda}^{(l)}),\\ \forall l\in\{t-\tau+1,\dots,t\}.

(22)

The update direction of the dual variables can then be computed by solving a direction-finding problem


$\displaystyle\underset{\mathbf{s}\in{\mathbb{R}^{n_{\bm{\lambda}}}}}{\max}\;$	$\displaystyle\hat{d}^{(t)}(\bm{\lambda}^{(t)}+\mathbf{s}),$	(23a)
s. t.	$\displaystyle\\|\mathbf{s}\\|_{2}^{2}\leq\alpha^{(t)},$	(23b)

where constraint (23b) represents a trust region. Therefore, this variant of the bundle method is referred to as the bundle trust method (BTM). Other variants include proximal bundle methods, where the trust region is replaced by a regularization term in the objective function [42]. Problem (23) is still a nonsmooth optimization problem and can be transformed into a smooth quadratic direction finding problem by using an epigraph formulation,


$\displaystyle\underset{v\in\mathbb{R},\;\mathbf{s}\in{\mathbb{R}^{n_{\bm{\lambda}}}}}{\max}\;$	$\displaystyle v,$	(24a)
s. t.	$\displaystyle\\|\mathbf{s}\\|_{2}^{2}\leq\alpha^{(t)},$	(24b)
	$\displaystyle\mathbf{g}^{T}(\bm{\lambda}^{(l)})\mathbf{s}-\beta^{(l,t)}\geq v,\;\forall l\in\{t-\tau+1,\dots,t\}.$	(24c)

After computing a direction the dual variables are updated according to

\bm{\lambda}^{(t+1)}=\bm{\lambda}^{(t)}+\mathbf{s}^{(t)}.

(25)

Bundle methods are widely used in machine learning, as nonsmoothness is encountered in many training problems involving regularization terms [43]. Bundle methods can also be used to solve the clustering problem (2) [35]. However, note that in this paper the BTM algorithm is used to solve the nonsmooth dual problem (13).

IV-C Quasi-Newton dual ascent

Since the dual function is always concave it can be locally approximated by a quadratic function. Yfantis et al. recently proposed the quasi-Newton dual ascent (QNDA) algorithm that approximates the dual function by a quadratic function [44, 38],

d_{B}^{(t)}(\bm{\lambda})=\frac{1}{2}(\bm{\lambda}-\bm{\lambda}^{(t)})^{T}\mathbf{B}^{(t)}(\bm{\lambda}-\bm{\lambda}^{(t)})\\ +\mathbf{g}^{T}(\bm{\lambda}^{(t)})(\bm{\lambda}-\bm{\lambda}^{(t)})+d(\bm{\lambda}^{(t)}).

(26)

This follows the idea of Newton methods, where the gradient and Hessian of the function are used within the approximation. However, due to the nonsmoothness of the dual function, the gradient and Hessian are not defined for each value of the dual variable. Instead, the gradient is replaced in eq. (26) by the subgradient and the Hessian is approximated by the matrix $\mathbf{B}^{(t)}$ . The approximated Hessian can be updated in each iteration using a Broyden-Fletcher-Goldfarb-Shanno (BFGS) update,

\mathbf{B}^{(t)}=\mathbf{B}^{(t-1)}+\frac{\mathbf{y}^{(t)}\mathbf{y}^{(t),T}}{\mathbf{y}^{(t),T}\mathbf{s}^{(t)}}-\frac{\mathbf{B}^{(t-1)}\mathbf{s}^{(t)}\mathbf{s}^{(t),T}\mathbf{B}^{(t-1),T}}{\mathbf{s}^{(t),T}\mathbf{B}^{(t-1)}\mathbf{s}^{(t)}},

(27)

where

\mathbf{s}^{(t)}\coloneqq\bm{\lambda}^{(t)}-\bm{\lambda}^{(t-1)}

(28)

is the variation of the dual variables and

\mathbf{y}^{(t)}=\mathbf{g}(\bm{\lambda}^{(t)})-\mathbf{g}(\bm{\lambda}^{(t-1)})

(29)

is the variation of the subgradients.
The approximated dual function $d_{B}(\bm{\lambda})$ is differentiable, while the actual dual function is nonsmooth. This can lead to significant approximation errors and poor update directions. This issue can be addressed by utilizing the same information as in the BTM algorithm. However, instead of using the bundle to construct an over-approximator of the dual function, it is used to further constrain the update of the dual variables,

d_{B}^{(t)}(\bm{\lambda}^{(t+1)})\leq d(\bm{\lambda}^{(l)})+\mathbf{g}^{T}(\bm{\lambda}^{(l)})(\bm{\lambda}^{(t+1)}-\bm{\lambda}^{(l)}),\\ \forall l\in\{t-\tau+1,\dots,t\}.

(30)

Constraints (30) are derived from the definition of the subgradient (15). A violation of these constraints would indicate that the updated dual variables $\bm{\lambda}^{(t+1)}$ are outside the range of validity of the approximated dual function. These constraints are referred to as bundle cuts and they can be summarized as

\mathcal{BC}^{(t)}=\{\bm{\lambda}\in{\mathbb{R}^{n_{\mathbf{b}}}}|\;d_{B}^{(t)}(\bm{\lambda})\leq d(\bm{\lambda}^{(l)})+\mathbf{g}^{T}(\bm{\lambda}^{(l)})(\bm{\lambda}-\bm{\lambda}^{(l)}),\\ \forall l\in\{t-\tau+1,\dots,t\}\}.

(31)

In the QNDA algorithm, the dual variables are updated in each iteration by solving the optimization problem


$\displaystyle\bm{\lambda}^{(t+1)}=\text{arg}\underset{\bm{\lambda}}{\max}\;$	$\displaystyle d_{B}^{(t)}(\bm{\lambda}),$	(32a)
s. t.	$\displaystyle\\|\bm{\lambda}-\bm{\lambda}^{(t)}\\|_{2}^{2}\leq\alpha^{(t)},$	(32b)
	$\displaystyle\bm{\lambda}\in\mathcal{BC}^{(t)}.$	(32c)

To avoid too aggressive update steps the same trust region (32b) as in the BTM algorithm is used.

IV-D Primal heuristics

The following sections provide some additional heuristics related to the primal optimization problem (5), namely an averaging heuristic used to obtain feasible primal solutions, and the addition of symmetry-breaking constraints to the clustering problem.

IV-D1 Averaging heuristic

The $K$ -means clustering problem involves integrality constraints and is therefore nonconvex. While the (optimal) value of the dual function (12) provides a lower bound on the optimal value of the primal problem (5), the feasibility of the primal problem is not guaranteed upon the convergence of a dual decomposition-based algorithm, i.e., the consensus constraints may not be satisfied. Nevertheless, in the case of $K$ -means clustering it is straightforward to compute a feasible primal solution using an averaging step. In each iteration $t$ of a dual decomposition-based algorithm the coordinator communicates the dual variables $\bm{\lambda}^{(t)}$ to the nodes. The nodes in turn solve their clustering problems and communicate their computed cluster centroids $\hat{\mathbf{m}}_{i}(\bm{\lambda}^{(t)})$ to the coordinator. Based on this response the coordinator can compute the average of the primal variables, i.e., the average cluster centroids,

\overline{\mathbf{m}}_{k}(\bm{\lambda}^{(t)})=\frac{1}{N_{s}}\sum_{i\in\mathcal{I}}\mathbf{m}_{ik}(\bm{\lambda}^{(t)})

(33)

which are then communicated back to the nodes. Using the mean cluster centroids the nodes can compute their resulting primal objective value

z_{i}(\bm{\lambda}^{(t)})=\sum_{j\in\mathcal{J}}\underset{k\in\mathcal{K}}{\min}\|\mathbf{y}_{j}-\overline{\mathbf{m}}_{k}(\bm{\lambda}^{(t)})\|_{2}^{2}.

(34)

The primal objective value can be used to compute the relative duality gap in each iteration,

\text{rel. DG}=100\cdot\left(1-\frac{d(\bm{\lambda}^{(t)})}{\sum_{i\in\mathcal{I}}z_{i}(\bm{\lambda}^{(t)})}\right).

(35)

Since the value of the dual function provides a lower bound on the optimal primal objective value the relative duality gap can be used to assess the distance of a found solution to the global optimum. The entire communication process between the coordinator and the nodes is illustrated in Fig. 4. Note that the average cluster centroids are only used to compute the duality gap. They do not influence the update of the dual variables.

((a)) The coordinator sends the current dual variables to the nodes.

((b)) The nodes compute their cluster centroids and send them to the coordinator.

((c)) The coordinator computes the average cluster centroids and sends them to the nodes.

((d)) The nodes compute their objectives based on the received average centroids.

Figure 4: Communication process between the coordinator and the nodes in iteration

t

IV-D2 Symmetry breaking constraints

The clustering problem (4) is highly symmetric, i.e., it contains solutions with the same objective values. This is because the index assigned to a cluster does not influence the objective function. Fig. 5 illustrates the situation of two symmetric solutions.

Figure 5: Example of symmetric clustering solutions. In the two cases, the data points are assigned to different clusters without affecting the objective function.

This symmetry can lead to problems for the averaging heuristic presented in the previous section, as the computed cluster centroids of a single node can switch from one iteration to the next. For instance, while some points are assigned to cluster $k$ in iteration $t$ , they could be assigned to cluster $k^{\prime}$ in iteration $t+1$ by switching the centroids of clusters $k$ and $k^{\prime}$ without affecting the objective.
To prevent this behavior symmetry breaking constraints are added to the optimization problems of the nodes. In the first iteration, one of the nodes acts as the reference node, providing reference centroids $\overline{\mathbf{m}}_{k}^{\text{ref}}$ . In the subsequent iterations the quadratic constraint

\|\mathbf{m}_{ik}-\overline{\mathbf{m}}_{k}^{\text{ref}}\|_{2}^{2}\leq\|\mathbf{m}_{ik^{\prime}}-\overline{\mathbf{m}}_{k}^{\text{ref}}\|_{2}^{2},\forall k,k^{\prime}\in\mathcal{K},

(36)

is added to each node $i$ . This ensures that cluster $k$ of each node $i$ will be the one closest to the reference centroid $\overline{\mathbf{m}}_{k}^{\text{ref}}$ . The choice of the node which provides the reference centroid can be performed arbitrarily, as it does not affect the optimization of the other nodes. Furthermore, the added constraint also does not affect the optimal objective value while rendering all symmetric solutions, except for one, infeasible.

V Numerical analysis of distributed clustering problems

TABLE I: Parameter settings of the distributed optimization algorithms for the clustering benchmark problems.

	Value	Description	Algorithms
$\bm{\lambda}^{(0)}$	$\mathbf{0}$	initial dual variables	All
$\alpha^{(0)}$	$0.5$	initial step size/trust	All
		region parameter
$t_{\max}$	$150$	maximum number of	All
		iterations
$\epsilon_{p}$	$10^{-2}$	primal residual convergence	All
		tolerance
$\epsilon_{DG}$	$0.25\;\%$	relative duality gap	All
		tolerance
$\tau$	$50$	allowed age of data	BTM, QNDA
		points
$\mathbf{B}^{(0)}$	$-\mathbf{I}$	initial approximated	QNDA
		Hessian

The dual decomposition-based distributed clustering approach was evaluated on a set of benchmark problems of varying sizes. The data for each benchmark problem was generated randomly. First, initial cluster centroids $\mathbf{m}_{k}^{0}$ were generated, with $[\mathbf{m}_{k}^{0}]_{l}\in\mathcal{U}_{c}(-1,1),\;l=1,\dots,n_{\mathbf{y}}$ . Then, for each cluster $k$ five random data points were added within a radius of 0.5 from the generated centroid. The parameters of the benchmark problems were varied as follows:

	$\displaystyle\text{Number of nodes: }N_{s}\in\{2,3,4\},$
	$\displaystyle\text{Number of dimensions: }n_{\mathbf{y}}\in\{2,3,4\},$
	$\displaystyle\text{Number of clusters: }K\in\{3,4\}.$

Five benchmark problems were generated for each combination of nodes, dimensions, and clusters, resulting in a total of 90 benchmark problems. A benchmark problem is characterized by its number of nodes, dimension of the data, and number of clusters. For instance, problem $\text{3N2D4K}_{\text{5}}$ is the 5th benchmark problem comprised of 3 nodes with 2-dimensional data sorted into 4 clusters.
The benchmark problems were solved using the subgradient method, the bundle trust method, and the quasi-Newton dual ascent algorithm.

The initial step size (subgradient method)/ trust region (BTM, QNDA) parameter was set to $\alpha^{(0)}=0.5$ and varied according to

\alpha^{(t)}=\alpha^{(0)}/\sqrt{t}.

(37)

The size of the bundle for BTM and QNDA was set to $\tau=50$ points. All algorithms were initialized with $\bm{\lambda}^{(0)}=\mathbf{0}$ and the initial approximated Hessian of the QNDA algorithm was set to the negative identity matrix. The algorithms were terminated either when the Euclidean norm of the primal residual

\|\mathbf{w}_{p}\|_{2}=\left\|\sum_{i\in\mathcal{I}}\mathbf{A}_{i}\hat{\mathbf{m}}_{i}\right\|_{2},

(38)

i.e., the violation of the consensus constraints, lied below a threshold of $\epsilon_{p}=10^{-2}$ or when the relative duality gap (35) reached a value of $\epsilon_{DG}=0.25\;\%$ . The used parameters for the different algorithms are summarized in Tab. I. The MIQCP clustering problems of all nodes were solved using the commercial solver Gurobi [45] and the total computation time was computed as

T_{\text{comp}}=N_{\text{iter}}\cdot T_{\text{comm}}+\sum_{t=1}^{N_{\text{iter}}}(T_{\text{update}}^{(t)}+\underset{i\in\mathcal{I}}{\max}\;T_{\text{sub},i}^{(t)}),

(39)

where $N_{\text{iter}}$ is the number of required iterations, $T_{\text{comm}}=800\;ms$ is the required communication time between the coordinator and the subproblems, which is assumed to be constant, $T_{\text{update}}^{(t)}$ is the time required by the coordinator to update the dual variables in iteration $t$ and $T_{\text{sub},i}^{(t)}$ is the solution time for the clustering problem of node $i$ in itereation $t$ .

TABLE II: Summary of the results for the distributed optimization of the clustering benchmark problems,

\overline{t}

: mean number of iterations until termination,

\overline{\text{rel. DG}}

: mean relative duality gap upon termination (in %),

\overline{T_{\text{comp}}}

: mean computation time (in s).

Algorithm	$\overline{t}$	$\overline{\text{rel. DG}}$	$\overline{T_{\text{comp}}}$
SG	${136.75}$	${2.27}$	${996.28}$
BTM	${57.44}$	${1.86}$	${515.77}$
QNDA	$\mathbf{54.48}$	$\mathbf{1.81}$	$\mathbf{483.22}$

The results for the clustering benchmarks are summarized in Tab. II. Out of the examined algorithms, QNDA shows the best performance in terms of the required number of iterations and computation time as well as in terms of the achieved relative duality gap. The BTM algorithm shows similar performance in terms of the number of iterations and the achieved duality gap. However, in the case of distributed clustering, each iteration is costly due to the underlying MIQCP problems. Therefore, a slight performance increase in the number of iterations results in a substantial performance increase in terms of computation times. More detailed results for the clustering benchmarks are summarized in Tab. III in the appendix.

Figure 6: Evolution of the relative duality gap for benchmark problem

\text{2N2D4K}_{\text{3}}

((a)) Node 1, Iteration 1

((b)) Node 2, Iteration 1

((c)) Node 1, Iteration 4

((d)) Node 2, Iteration 4

Figure 7: Exemplary clusters in different iterations of the QNDA algorithm for benchmark problem

\text{2N2D4K}_{\text{3}}

Fig. 6 shows the evolution of the relative duality gap for benchmark problem $\text{2N2D4K}_{\text{3}}$ . The subgradient method converges rather slowly. In comparison, the BTM and QNDA algorithms exhibit a faster rate of convergence. Between these two algorithms, BTM exhibits an oscillatory behavior before converging. In contrast, the QNDA algorithm does not exhibit oscillations and therefore converges earlier. Additionally, it should be noted that the QNDA algorithm achieves a relative duality gap of $0\;\%$ , i.e., it converges to a proven global optimum.

Fig. 7 further illustrates the results. Fig. 7(a) and 7(b) show the results of the clustering in the first iteration, i.e., the individual global optima. Fig. 7(c) and 7(d) depict the solutions upon convergence of the QNDA algorithm. Each node computes the same cluster centroids corresponding to the globally optimal solution with respect to the entire data set, but not to the individual data sets. It is therefore possible to compute a global model locally in each node while only accessing local data.

VI Comparison to the central solution

Figure 8: Evolution of the relative duality gap of QNDA compared to the relative integrality gap of the central solution using Gurobi for benchmark problem

\text{4N4D4K}_{3}

As shown in the previous section, solving the MIQCP clustering problems is computationally expensive. This is due to the weak integer relaxation of problem (2), which means that the solution of the relaxed problem within the branch-and-bound algorithm is far away from the integer solution. This results in slow-moving relative integrality gaps and slow convergence of the solution algorithm. While the main motivation of the distributed clustering approach is the training of a global model without the exchange of local data, it can also be used to efficiently solve larger clustering problems. Fig. 8 depicts the evolution of the relative duality gap of the QNDA algorithm as well as the evolution of the relative integrality gap of Gurobi for the complete data set of benchmark problem $\text{4N4D4K}_{3}$ . The clustering problems of the individual nodes were solved sequentially in the case of QNDA, also using Gurobi. While the relative gap of the central solution improves very slowly, the QNDA algorithm quickly converges to a solution close to the global optimum. Note, that both relative gaps prove a worst-case distance to the global optimum. Hence, decomposing a large clustering problem into smaller subproblems and coordinating the solutions via a distributed optimization algorithm can offer significant performance improvements compared to a central solution.

VII Conclusion

This paper demonstrated how dual decomposition-based distributed optimization can be applied to the solution of clustering problems. The approach ensures privacy, i.e., enables federated learning, as each node only has access to its local data. A global model can still be obtained by coordinating the solutions of the individual clustering problems. Numerical tests on a large set of benchmark problems demonstrated that the QNDA algorithm outperforms the subgradient method and the BTM algorithm. Furthermore, the distributed optimization approach exhibited superior performance compared to a central solution approach. In the future, the developed algorithms can also be applied to other federated learning problems, like the distributed training of support vector machines.

References

[1] D. Peteiro-Barral and B. Guijarro-Berdiñas, “A survey of methods for distributed machine learning,” Progress in Artificial Intelligence, vol. 2, pp. 1–11, 2013.
[2] P. Forero, A. Cano, and G. Giannakis, “Consensus-based distributed support vector machines.” Journal of Machine Learning Research, vol. 11, no. 5, 2010.
[3] ——, “Distributed clustering using wireless sensor networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 4, pp. 707–724, 2011.
[4] L. Georgopoulos and M. Hasler, “Distributed machine learning in networks by consensus,” Neurocomputing, vol. 124, pp. 2–12, 2014.
[5] K. Tsianos, S. Lawlor, and M. Rabbat, “Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning,” in 50th Annual Allerton Conference on Communication, Control, and Computing. IEEE, 2012, pp. 1543–1550.
[6] A. Nedić, “Distributed gradient methods for convex machine learning problems in networks: Distributed optimization,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 92–101, 2020.
[7] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. Rellermeyer, “A survey on distributed machine learning,” ACM Computing Surveys, vol. 53, no. 2, pp. 1–33, 2020.
[8] J. Konečnỳ, B. McMahan, D. Ramage, and P. Richtárik, “Federated optimization: Distributed machine learning for on-device intelligence,” arXiv preprint arXiv:1610.02527, 2016.
[9] A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient framework for clustered federated learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 19 586–19 597, 2020.
[10] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics. PMLR, 2017, pp. 1273–1282.
[11] V. Hegiste, T. Legler, and M. Ruskowski, “Application of federated machine learning in manufacturing,” in 2022 International Conference on Industry 4.0 Technology (I4Tech). IEEE, 2022, pp. 1–8.
[12] R. Antunes, C. André da Costa, A. Küderle, I. Yari, and B. Eskofier, “Federated learning for healthcare: Systematic review and architecture proposal,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 4, pp. 1–23, 2022.
[13] W. Lim, N. Luong, D. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning in mobile edge networks: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 3, pp. 2031–2063, 2020.
[14] J. Jiang, B. Kantarci, S. Oktug, and T. Soyata, “Federated learning in smart city sensing: Challenges and opportunities,” Sensors, vol. 20, no. 21, p. 6230, 2020.
[15] L. Li, Y. Fan, M. Tse, and K.-Y. Lin, “A review of applications in federated learning,” Computers & Industrial Engineering, vol. 149, p. 106854, 2020.
[16] J. Liu, J. Huang, Y. Zhou, X. Li, S. Ji, H. Xiong, and D. Dou, “From distributed machine learning to federated learning: A survey,” Knowledge and Information Systems, vol. 64, no. 4, pp. 885–917, 2022.
[17] M. Chamikara, P. Bertok, I. Khalil, D. Liu, and S. Camtepe, “Privacy preserving distributed machine learning with federated learning,” Computer Communications, vol. 171, pp. 112–125, 2021.
[18] V. Kulkarni, M. Kulkarni, and A. Pant, “Survey of personalization techniques for federated learning,” in 4th World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4). IEEE, 2020, pp. 794–797.
[19] A. Tan, H. Yu, L. Cui, and Q. Yang, “Towards personalized federated learning,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[20] C. Gambella, B. Ghaddar, and J. Naoum-Sawaya, “Optimization problems for machine learning: A survey,” European Journal of Operational Research, vol. 290, no. 3, pp. 807–828, 2021.
[21] N. Dhanachandra, K. Manglem, and Y. Chanu, “Image segmentation using k-means clustering algorithm and subtractive clustering algorithm,” Procedia Computer Science, vol. 54, pp. 764–771, 2015.
[22] T. Kansal, S. Bahuguna, V. Singh, and T. Choudhury, “Customer segmentation using k-means clustering,” in International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS). IEEE, 2018, pp. 135–139.
[23] K. Rahimi-Adli, P. Schiermoch, B. Beisheim, S. Wenzel, and S. Engell, “A model identification approach for the evaluation of plant efficiency,” in Computer Aided Chemical Engineering. Elsevier, 2019, vol. 46, pp. 913–918.
[24] D. K. Dennis, T. Li, and V. Smith, “Heterogeneity for the win: One-shot federated clustering,” in International Conference on Machine Learning. PMLR, 2021, pp. 2611–2620.
[25] S. Lloyd, “Least squares quantization in pcm,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.
[26] H. Kumar, V. Karthik, and M. Nair, “Federated k-means clustering: A novel edge ai based approach for privacy preservation,” in 2020 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM). IEEE, 2020, pp. 52–56.
[27] S. Li, S. Hou, B. Buyukates, and S. Avestimehr, “Secure federated clustering,” arXiv, 2022. [Online]. Available: https://arxiv.org/abs/2205.15564
[28] M. Stallmann and A. Wilbik, “Towards federated clustering: A federated fuzzy $c$ -means algorithm (FFCM),” arXiv, 2022. [Online]. Available: https://arxiv.org/abs/2201.07316
[29] W. Pedrycz, “Federated FCM: clustering under privacy requirements,” IEEE Transactions on Fuzzy Systems, vol. 30, no. 8, pp. 3384–3388, 2021.
[30] Y. Wang, M. Jia, N. Gao, L. Von Krannichfeldt, M. Sun, and G. Hug, “Federated clustering for electricity consumption pattern extraction,” IEEE Transactions on Smart Grid, vol. 13, no. 3, pp. 2425–2439, 2022.
[31] T. Achterberg and R. Wunderling, “Mixed integer programming: Analyzing 12 years of progress,” Facets of Combinatorial Optimization: Festschrift for Martin Grötschel, pp. 449–481, 2013.
[32] T. Koch, T. Berthold, J. Pedersen, and C. Vanaret, “Progress in mathematical programming solvers from 2001 to 2020,” EURO Journal on Computational Optimization, vol. 10, p. 100031, 2022.
[33] D. Aloise, P. Hansen, and L. Liberti, “An improved column generation algorithm for minimum sum-of-squares clustering,” Mathematical Programming, vol. 131, pp. 195–220, 2012.
[34] A. Bagirov and J. Yearwood, “A new nonsmooth optimization algorithm for minimum sum-of-squares clustering problems,” European Journal of Operational Research, vol. 170, no. 2, pp. 578–596, 2006.
[35] N. Karmitsa, A. Bagirov, and S. Taheri, “New diagonal bundle method for clustering problems in large data sets,” European Journal of Operational Research, vol. 263, no. 2, pp. 367–379, 2017.
[36] H. Everett, “Generalized Lagrange multiplier method for solving problems of optimum allocation of resources,” Operations Research, no. 11 (3), pp. 399–417, 1963.
[37] J. Nocedal and S. Wright, Numerical optimization. Springer Science & Business Media, 2006.
[38] V. Yfantis, S. Wenzel, A. Wagner, M. Ruskowski, and S. Engell, “Hierarchical distributed optimization of constraint-coupled convex and mixed-integer programs using approximations of the dual function,” EURO Journal on Computational Optimization, vol. 11, p. 100058, 2023.
[39] N. Shor, Minimization methods for non-differentiable functions. Springer Science & Business Media, 2012, vol. 3.
[40] D. P. Bertsekas, Nonlinear programming. Athena Scientific, 1999.
[41] M. Mäkelä, “Survey of bundle methods for nonsmooth optimization,” Optimization Methods and Software, vol. 17, no. 1, pp. 1–29, 2002.
[42] A. Bagirov, N. Karmitsa, and M. Mäkelä, Introduction to Nonsmooth Optimization: Theory, Practice and Software. Springer, 2014.
[43] Q. Le, A. Smola, and S. Vishwanathan, “Bundle methods for machine learning,” Advances in Neural Information Processing Systems, vol. 20, 2007.
[44] V. Yfantis and M. Ruskowski, “A hierarchical dual decomposition-based distributed optimization algorithm combining quasi-Newton steps and bundle methods,” in 30th Mediterranean Conference on Control and Automation (MED). IEEE, 2022, pp. 31–36.
[45] Gurobi Optimization, LLC, “Gurobi Optimizer Reference Manual,” 2023. [Online]. Available: https://www.gurobi.com

-A Benchmark Problems

All benchmark problems can be found under: https://github.com/VaYf/Clustering-Benchmark-Problems

TABLE III: Results for the distributed optimization of the clustering benchmark problems,

\overline{t}

: mean number of performed iterations,

\overline{\text{rel. DG}}

: mean relative duality gap (in %),

\overline{T_{\text{comp}}}