A Double Tracking Method for Optimization with Decentralized Generalized Orthogonality Constraints
Abstract
In this paper, we consider the decentralized optimization problems with generalized orthogonality constraints, where both the objective function and the constraint exhibit a distributed structure. Such optimization problems, albeit ubiquitous in practical applications, remain unsolvable by existing algorithms in the presence of distributed constraints. To address this issue, we convert the original problem into an unconstrained penalty model by resorting to the recently proposed constraint-dissolving operator. However, this transformation compromises the essential property of separability in the resulting penalty function, rendering it impossible to employ existing algorithms to solve. We overcome this difficulty by introducing a novel algorithm that tracks the gradient of the objective function and the Jacobian of the constraint mapping simultaneously. The global convergence guarantee is rigorously established with an iteration complexity. To substantiate the effectiveness and efficiency of our proposed algorithm, we present numerical results on both synthetic and real-world datasets.
1 Introduction
Rapid advances in data collection and processing capabilities have paved the way for the utilization of distributed systems in a large number of practical applications, such as dictionary learning [28], statistical inference [16], multi-agent control [15], and neural network training [42, 44]. The large scale and spatial/temporal disparity of data, coupled with the limitations in storage and computational resources, make centralized approaches infeasible or inefficient. Consequently, decentralized algorithms are developed to solve an optimization problem through the collaboration of the agents, where the need to efficiently manage and process vast amounts of distributed data is paramount.
Given a distributed system of agents connected by a communication network, the focus of this paper is on the following decentralized optimization problem with the generalized orthogonality constraint:
(1.1a) | ||||
(1.1b) |
where is a continuously differentiable local function, is a symmetric matrix, and denotes the identity matrix. Moreover, for each , both and are privately owned by agent . For convenience, we denote , which is assumed to be positive definite throughout this paper. The feasible region of problem (1.1), denoted by , is an embedded submanifold of and commonly referred to as the generalized Stiefel manifold [1]. It is noteworthy that, different from classical decentralized optimization problems, the constraint (1.1b) also has a distributed structure across agents, which leads to considerable challenges to solve (1.1) under the decentralized setting. Throughout this paper, we make the following blanket assumption.
Assumption 1.
Each is continuously differentiable, and its gradient is locally Lipschitz continuous in .
Problems of the form (1.1) are found widely in many applications. Below is a brief introduction to an important application of (1.1) in statistics.
Canonical Correlation Analysis (CCA)
CCA is a fundamental and ubiquitous statistical tool that characterizes linear relationships between two sets of variables [19]. Let and be two datasets in the form of matrices, where and are the dimensions of the two datasets respectively, and is the number of samples in each of the two datasets. CCA aims to identify linear combinations of variables within each dataset to maximize their correlations. Mathematically, CCA can be equivalently formulated as the following optimization problem [17],
(1.2) | ||||
where and are two matrices generated by and , and denotes the trace of a given square matrix. Specifically, and have the following form,
respectively.
In this paper, we consider the distributed setting in which the samples contained in and are stored locally in locations, possibly having been collected and owned by different agents. Suppose each agent possess samples and . Let and denote the local data of agent . Then the data matrices and can be divided into blocks respectively, namely, and . Under the aforementioned distributed setting, the optimization model (1.2) of CCA can be recast as the following form:
(1.3) | ||||
where and are given by
respectively. There are other optimization problems in statistics with structures similar to CCA. Interested readers can refer to the references [12, 17] for further details.
1.1 Related Works
Decentralized optimization has experienced significant advancements in recent decades, particularly in the Euclidean space. Various algorithms have been proposed to tackle different types of problems, such as gradient-based algorithms [24, 41, 27, 31], primal-dual frameworks [29, 22, 8, 18], and second-order methods [5, 43, 13]. In general, these algorithms are only capable of handling scenarios where variables are restricted in a convex subset of the Euclidean space. Consequently, these algorithms cannot be directly applied to solve (1.1b), where the feasible region is typically non-convex. Interested readers can refer to some recent surveys [25, 9] for more comprehensive information.
In order to solve decentralized optimization problems on manifolds, many algorithms have adopted geometric tools derived from Riemannian optimization, including tangent spaces and retraction operators. For instance, the algorithms delineated in [11, 14] extend the Riemannian gradient descent method [1] to the decentralized setting, which can be combined with gradient tracking techniques [32] to achieve the exact convergence. Building on this foundation, Chen et al. [10] further introduces the decentralized Riemannian conjugate gradient method. Additionally, there are several algorithms devised to address nonsmooth optimization problems, such as subgradient algorithm [34] and proximal gradient algorithm [38]. It is crucial to underscore that the computation of tangent spaces and retraction operators requires complete information about the matrix , which is unattainable in a decentralized environment. This inherent limitation hinders the application of the aforementioned algorithms to the optimization problems with decentralized generalized orthogonality constraints.
There is another class of methodologies [35, 37, 36, 33] that focuses on employing infeasible approaches, such as augmented Lagrangian methods, to tackle nonconvex manifold constraints. These algorithms do not require each iterate to strictly adhere to manifold constraints, thereby allowing the pursuit of global consensus directly in the Euclidean space. Compared to Riemannian optimization methods, this type of algorithm requires only a single round of communication per iteration to guarantee their global convergence. However, it is noteworthy that these algorithms are tailored for Stiefel manifolds and cannot handle scenarios where the constraints themselves exhibit a distributed structure. Although we can draw inspiration from these algorithms to tackle the problem (1.1), the resulting penalty model remains challenging to solve. On the one hand, the penalty function usually forfeits the distributed structure inherent in the constraint, which is no longer separable across the agents. On the other hand, without full knowledge of the matrix , each agent can not independently compute its own local gradients. Consequently, it is impossible to straightforwardly extend existing algorithms to solve the problem (1.1).
1.2 Contributions
In decentralized optimization, current researches mainly focus on scenarios where the objective function exhibits a distributed structure. However, in problem (1.1), the generalized orthogonal constraint also exhibits a similar structure, which triggers off an enormous difficulty in solving it.
To develop a decentralized algorithm for the problem (1.1), we employ the constraint dissolving operator introduced in [40] to construct an exact penalty model, which can not be solved by existing algorithms. To efficiently minimize our proposed penalty model, we devise an approximate direction for the gradient of the penalty function, which is composed of separable components, including the gradient of the objective function and the Jacobian of the constraint mapping. Then, we propose to track these two components simultaneously across the network to assemble them in the approximate direction. This double-tracking strategy is quite efficient to reach a consensus on the generalized Stiefel manifolds.
Based on the aforementioned techniques, we develop a novel constraint dissolving algorithm with double tracking (CDADT). To the best of our knowledge, this is the first algorithm capable of solving optimization problems with decentralized generalized orthogonality constraints. Under rather mild conditions, we establish the global convergence of CDADT and provide its iteration complexity. Preliminary numerical results demonstrate the great potential of CDADT.
1.3 Organization
The rest of this paper is organized as follows. Section 2 draws into some preliminaries related to the topic of this paper. In Section 3, we develop a constraint dissolving algorithm with double tracking to solve the problem (1.1). The convergence properties of the proposed algorithm are investigated in Section 4. Numerical results are presented in Section 5 to evaluate the performance of our algorithm. Finally, this paper concludes with concluding remarks and key insights in Section 6.
2 Preliminaries
In this section, we first present fundamental notations and network settings considered in this paper. Following this, we revisit the first-order stationarity condition of (1.1) and introduce the concepts of Kurdyka-Łojasiewicz (KŁ) property and constraint dissolving operator.
2.1 Notations
The following notations are adopted throughout this paper. The Euclidean inner product of two matrices with the same size is defined as , where stands for the trace of a square matrix . The identity matrix is represented by . We define the symmetric part of a square matrix as . The Frobenius norm and 2-norm of a given matrix are denoted by and , respectively. The -th entry of a matrix is represented by . The notations and stand for the -dimensional vector of all ones and all zeros, respectively. The notations and represent the smallest and largest singular value of a matrix , respectively. The Kronecker product is denoted by . For any , we denote . We define the distance between a point and a set by . Given a differentiable function , the Euclidean gradient of with respect to is represented by . Further notations will be introduced wherever they occur.
2.2 Network Setting
We assume that the agents are connected by a communication network. And they can only exchange information with their immediate neighbors. The network captures the communication links diffusing information among the agents. Here, is composed of all the agents and represents the set of communication links. Throughout this paper, we make the following assumptions on the network.
Assumption 2.
The communication network is connected. Furthermore, there exists a mixing matrix associated with satisfying the following conditions.
-
(i)
is symmetric and nonnegative.
-
(ii)
.
-
(iii)
if and , and otherwise.
The conditions in Assumption 2 follow from standard assumptions in the literature [39, 25], which is dictated by the underlying network topology. According to the Perron-Frobenius Theorem [26], we find that the eigenvalues of fall within the range , and hence,
(2.1) |
The parameter serves as a key indicator of the network connectivity and is instrumental in the analysis of decentralized methods. Generally speaking, the closer approaches , the poorer the network connectivity becomes.
2.3 Stationarity
In this subsection, we delve into the first-order stationarity condition of the problem (1.1). Towards this end, we introduce some geometric concepts of Riemannian manifolds. For each point , the tangent space to at is referred to as . In this paper, we consider the Riemannian metric on that is induced from the inner product, i.e., . The corresponding Riemannian gradient of a smooth function is given by
which is nothing but the projection of onto under the metric . Finally, the first-order stationarity condition of the problem (1.1) can be stated as follows.
Definition 2.1.
A point is called a first-order stationary point of the problem (1.1) if it satisfies the following condition,
Since this paper focuses on infeasible algorithms for the problem (1.1), we introduce the following definition of -stationary point.
Definition 2.2.
A point is called a first-order -stationary point of the problem (1.1) if it satisfies the following condition,
where is the projection operator onto the generalized Stiefel manifold .
2.4 Kurdyka-Łojasiewicz Property
A part of the convergence results developed in this paper falls in the scope of a general class of functions that satisfy the Kurdyka-Łojasiewicz (KŁ) property [23, 20]. Below, we introduce the basic elements to be used in the subsequent theoretical analysis.
For any , we denote by the class of all concave and continuous functions which satisfy the following conditions,
-
(i)
;
-
(ii)
is continuously differentiable on and continuous at ;
-
(iii)
for any .
Now we define the KŁ property.
Definition 2.3.
Let be a proper and lower semicontinuous function and be the (limiting) subdifferential of .
-
(i)
The function is said to satisfy the KŁ property at if there exists a constant , a neighborhood of and a function , such that for any satisfying , the following KŁ inequality holds,
The function is called a desingularizing function of at .
-
(ii)
We say is a KŁ function if satisfies the KŁ property at each point of .
KŁ functions are ubiquitous in many practical applications, which covers a wealth of nonconvex nonsmooth functions. For example, tame functions constitutes a wide class of KŁ functions, including semialgebraic and real subanalytic functions. We refer interested readers to [6, 2, 3, 4, 7] for more details.
2.5 Constraint Dissolving Operator
The constraint dissolving operator proposed in [40] offers a powerful technique to handle manifold constraints, which can be leveraged to construct an unconstrained penalty model for Riemannian optimization problems. Specifically, a constraint dissolving operator of the generalized orthogonality constraint (1.1b) is given by
Then solving the problem (1.1) can be converted into the unconstrained minimization of the following penalty function,
(2.2) |
where is a penalty parameter.
Xiao et al. [40] have proved that (1.1) and (2.2) share the same first-order stationary points, second-order stationary points, and local minimizers in a neighborhood of . However, it is intractable to solve the problem (2.2) under the decentralized setting. It is worth noting that, in the construction of the penalty function , the constraint (1.1b) is integrated into the original objective function and the quadratic penalty term, resulting in the loss of the separable structure. To the best of our knowledge, there are currently no algorithms equipped to tackle such problems. Therefore, in the next section, we will design a novel algorithm to solve the penalty model under the decentralized setting.
3 Algorithm Development
The purpose of this section is to develop an efficient decentralized algorithm to solve the penalty model (2.2). An approximate direction is first constructed for the gradient of the penalty function, which is easier to evaluate under the decentralized setting. Then, we propose a double-tracking strategy to fabricate the approximate direction across the whole network. The resulting algorithm is capable of reaching a consensus on the generalized Stiefel manifold.
3.1 Gradient Approximation
Under the conditions in Assumption 1, the penalty function in (2.2) is continuously differentiable, the gradient of which takes the following form:
Therefore, each agent can not compute the local gradient individually since the evaluation of requires the accessibility of over all the agents. In light of the property that whenever is not far away from , we propose to approximate the local gradient by . Hereafter, is denoted by for simplicity. As a result, we can obtain the following approximation of :
(3.1) |
where
and
The approximate direction of possesses the following desirable properties.
Lemma 3.1.
Let be a bounded region and be a positive constant. Then, if , we have
for any .
Proof.
For any , we have the inequalities and , and hence,
Then from the formulation of , it holds that
which implies that
Now it can be readily verifies that
(3.2) | ||||
where the last inequality follows from the condition .
Since it holds that , we know that is positive definite and . Then straightforward calculations give rise to that
which further infers that
According to the local Lipschitz continuity of , there exists a constant such that
Moreover, we have
Combining the above relationship with (3.2) yields that
where the last inequality follows from the condition . The proof is completed. ∎
3.2 Double Tracking Strategy
In the construction of , each agent is able to compute the local gradient independently. However, the evaluation of still fails to be distributed into agents since it is not separable. Nevertheless, we find that the components constituting exhibit a separable structure as follows,
(3.3) |
where
This observation inspires us to propose a double-tracking strategy. Specifically, we can first track and separately across the whole network by resorting to the dynamic average consensus [45] protocol. Then, these two components are collected together to assemble a global estimate of . It is worth mentioning that is exactly the Jacobian of the constraint mapping.
3.3 Algorithm Description
In this subsection, we describe the proposed algorithm to solve the problem (2.2). Hereafter, the notation represents the -th iterate of . Our algorithm introduces three auxiliary local variables , , and for each agent at the -th iteration. Specifically, and track and respectively, through the exchange of local information. In addition, aims at estimating the search direction based on the formulation (3.3). The key steps of our algorithm from the perspective of each agent are outlined below.
Step 1: Computing Search Direction.
We first compute an approximate search direction based on (3.3) as follows:
(3.4) | ||||
Step 2: Mixing Local Information.
To ensure that the local estimates ’s asymptotically converge to a common value, we leverage the following consensus protocol. Given the search directions ’s in the previous step, we update the local variable by
(3.5) |
where is a stepsize. The above procedure can be realized in a distributed manner.
Step 3: Tracking Gradient and Jacobian.
Finally, to guarantee that each and track the average of and respectively, we leverage the dynamic average consensus [45] technique. The resulting gradient and Jacobian tracking schemes read as follows,
(3.6) | ||||
(3.7) |
with and .
Then based on the aforementioned steps, we formally present the detailed algorithmic framework in Algorithm 1, named constraint dissolving algorithm with double tracking and abbreviated to CDADT.
In the rest of this subsection, we exhibit the compact form of Algorithm 1. For the sake of convenience, we denote , , , and . It can be readily verified that . The following notations are also used in the sequel.
-
•
, , .
-
•
, , .
-
•
, , .
-
•
, , .
-
•
, , .
-
•
, , .
By the formulation of the above notations, the main iteration loop of Algorithm 1 can be summarized in the following compact form.
Moreover, it is not difficult to check that, for any , the following relationships hold.
(3.8) |
4 Convergence Analysis
In this section, we present the convergence analysis of Algorithm 1. The global convergence guarantee is rigorously established under rather mild conditions, together with an iteration complexity.
4.1 Consensus and Tracking Errors
This subsection is devoted to building the upper bound of consensus errors and tracking errors. We start from the consensus error in the following lemma.
Lemma 4.1.
Proof.
By the update scheme in (3.5) and straightforward calculations, we can attain that
where is a constant. This completes the proof since and . ∎
Next, we proceed to bound the tracking errors and . To facilitate the narrative, we define the bounded region , where is a constant with .
Lemma 4.2.
Proof.
To begin with, straightforward manipulations lead to that
where is a constant. According to the local Lipschitz continuity of , it follows that
Moreover, it can be readily verified that
(4.1) |
which implies that
Combining the above three relationships, we can obtain the bound for . Furthermore, the bound for can be obtained by using the same technique, hence its proof is omitted for simplicity. ∎
The following lemma demonstrates that and are estimates of and for each agent , respectively.
Lemma 4.3.
Proof.
According to the local Lipschitz continuity of , it follows that
which further yields that
Hence, we can conclude that the first assertion of this lemma holds. The second assertion can be proved by using a similar argument. ∎
We conclude this subsection by showing that is an approximation of with the approximation error controlled by the consensus and tracking errors. For convenience, we denote two constants and to be used in the following lemma.
Lemma 4.4.
Proof.
To begin with, we have
By straightforward calculations, we can obtain that
And it can be readily verified that
Moreover, we have
Combining the above three relationships, we can acquire that
where , , , , and are five positive constants. For convenience, we further denote two constants and . Then according to Lemma 4.3, it follows that
(4.2) | ||||
The last thing to do in the proof is to show that
Combining the above two relationships, we complete the proof. ∎
4.2 Boundedness of Iterates
In this subsection, we aim to show that the iterate sequence generated by Algorithm 1 is restricted in a neighborhood of the feasible region. Moreover, the average of local variables is always restricted in the bounded region , which guarantees the usage of Lemma 3.1.
We first prove the following technical lemma.
Lemma 4.5.
Proof.
It follows from the relationship (3.8) that
where and satisfies
Since , we have and , and hence,
(4.3) | ||||
By virtue of the Young’s inequality, we can obtain that
where the last inequality results from the condition that . Moreover, we have
For convenience, we denote . According to the local Lipschitz continuity of , there exists a constant such that
which together with (4.3) infers that
Now we investigate the above relationship in the following two cases.
Case I: .
Since , we have
Case II: .
It can be readily verified that
where the last inequality follows from the conditions . Hence, we arrive at
Combining the above two cases together, we complete the proof. ∎
Based on Lemma 4.5, we can prove the main results of this subsection.
Proposition 4.6.
Proof.
We intend to prove this proposition by mathematical induction. The argument (4.6) directly holds at iteration resulting from the initialization. Now, we assume that this argument holds at iteration , and investigate the situation at iteration .
To begin with, it follows from Lemma 4.4 and the condition that
Hence, according to Lemma 4.5, we know that . Moreover, we have
where and are two constants. Then it can be readily verified that
Combining Lemma 4.1 with the condition leads to that
which together with the condition implies that
As a direct consequence of Lemma 4.2, we can proceed to show that
where the last inequality results from the conditions and . Furthermore, since , we have
Similarly, under the conditions and , we can show that
The proof is completed. ∎
4.3 Sufficient Descent
The purpose of this subsection is to evaluate the descent property of the sequence .
Lemma 4.7.
Proof.
According to Proposition 4.6, we know that the inclusion holds for any . Since is locally Lipschitz continuous, there exist two constants and such that
where the last inequality follows from the condition and the relationship
By virtue of the Young’s inequality, we can obtain that
where is a constant. Moreover, we have
Combing the above relationships, we finally arrive at the assertion of this lemma. ∎
The above lemma indicates that the sequence is not necessarily decreasing in a monotonic manner. To address this issue, we introduce the following merit function,
where is a constant. For convenience, we denote hereafter. The following proposition illustrates that the sequence satisfies a sufficient descent property, and hence, is monotonically decreasing.
Proposition 4.8.
Proof.
Combining Lemmas 4.4 and 4.7 gives rise to that
where the last inequality follows from the condition . Then according to Lemmas 4.1 and 4.2, it follows that
(4.9) | ||||
where is a constant. As a direct consequence of the relationship (4.2), we can proceed to show that
By virtue of the condition , we have
(4.10) | ||||
Then, by combining two relationships (4.9) and (4.10), it can be readily verified that
which together with Lemma 3.1 yields that
Since , we can obtain the desired sufficient descent property. The proof is completed. ∎
4.4 Global Convergence
Based on the sufficient descent property of , we can finally establish the global convergence guarantee of Algorithm 1 to a first-order stationary point of the problem (1.1). Moreover, the iteration complexity is also presented.
Theorem 4.9.
Suppose Assumptions 1 and 2 hold. Let the penalty parameter satisfy the conditions (4.4) and (4.7) and the stepsize satisfy the conditions (4.5) and (4.8). Then the sequence has at least one accumulation point. Moreover, for any accumulation point , there exists a first-order stationary point of the problem (1.1) such that . Finally, the following relationships hold, namely,
(4.11) |
(4.12) |
(4.13) |
where is a constant.
Proof.
According to Proposition 4.6, we know that the sequence is bounded. Then the lower boundedness of is owing to the continuity of . Hence, there exists a constant such that
for any . It follows from Proposition 4.8 that the sequence is convergent and the following relationships hold,
(4.14) |
According to the Bolzano-Weierstrass theorem, it follows that exists an accumulation point, say . Then the relationships in (4.14) imply that there exists a first-order stationary point of the problem (1.1) such that .
The global sublinear convergence rate in Theorem 4.9 guarantees that Algorithm 1 is able to return a first-order -stationary point in at most iterations. Since Algorithm 1 performs three rounds of communication per iteration, the total number of communication rounds required to obtain a first-order -stationary point is also at the most.
Remark 1.
Under the conditions in Theorem 4.9, the tracking errors also asymptotically converges to zero at a sublinear rate as follows,
4.5 Convergence Under Kurdyka-Łojasiewicz Property
In this subsection, we establish the convergence of Algorithm 1 when the problem (1.1) satisfies the KŁ property. In particular, for any , we prove that the entire sequence converges to a stationary point of the problem (1.1) with guaranteed asymptotic convergence rates.
To begin with, we define the following quantity,
Then it can be readily verified from Proposition 4.8 that
(4.15) |
where is a prefixed positive constant.
Next, we give a lower bound of by the norm of in the following lemma.
Lemma 4.10.
Proof.
To begin with, from straightforward calculations, we can obtain that
Notice that and , we have
(4.16) | ||||
where . According to Proposition 4.6, it follows that , and hence,
(4.17) |
Combining (4.16) with (4.17), we can acquire that
which further implies that
(4.18) | ||||
Using a similar argument, we can proceed to prove that
(4.19) |
and
(4.20) |
Then from (4.18)-(4.20), we have
This completes the proof with chosen as . ∎
Let . The following lemma reveals that the distance between and can be controlled by .
Lemma 4.11.
Proof.
It follows from the equality (4.1) that
which further yields that
with . Moreover, we have
and
The above three inequalities imply that
which completes the proof with chosen as . ∎
With Lemma 4.10 and Lemma 4.11, we establish the convergence of the sequence generated by Algorithm 1 when is a KŁ function, as stated in the following theorem.
Theorem 4.12.
Proof.
According to Proposition 4.8, the sequence is nonincreasing and has a lower bound, which implies that the limit exists. Let be the set of all the accumulation points of the sequence . For any , we have
(4.21) |
Thus, is equal to the constant on . If there exists such that , the direct combination of Proposition 4.8 and Lemma 4.11 would imply that . Then a trivial induction shows that the assertion of this theorem is obvious. Since is a nonincreasing sequence, it is clear from (4.21) that . Moreover, for any , there exists such that with . According to Lemma 5 in [7], we know that is a compact and connect set and
Hence, for any , there exists such that with . Summing up all these facts, we can obtain that belongs to the intersection of and for any . By Lemma 6 in [7], there exists a constant and a function such that
for any . Multiplying both sides of (4.15) by yields that
where the second inequality follows from the concavity of . Combining the above relationship with Lemma 4.10 and Lemma 4.11, we have
which further implies that
for any . Letting , we can attain that
Hence, the iterate sequence is a Cauchy sequence and hence is convergent, which infers that is also convergent. Finally, Theorem 4.9 guarantees that the limit point of has the form , where is a first-order stationary point of the problem (1.1). This completes the proof. ∎
When is a semialgebraic function, one important result is that the desingularizing function can be chosen to be of the form
where is a constant and is a parameter impacting the convergence rate. Let be the limit point of the sequence . Using the same line of analysis introduced in [2], we can obtain the following estimations of convergence rates.
-
(i)
If , the sequence converges in a finite number of steps.
-
(ii)
If , there exists and such that
-
(iii)
If , there exists such that
5 Numerical Experiments
In this section, we conduct a series of numerical experiments to demonstrate the efficiency and effectiveness of CDADT, specifically focusing on the CCA problems (1.2). The corresponding experiments are performed on a workstation with dual Intel Xeon Gold 6242R CPU processors (at GHz) and GB of RAM under Ubuntu 20.04. The tested algorithms are implemented in the language with the communication realized by the package.
In the numerical experiments, the following three quantities are collected and recorded at each iteration as performance metrics.
-
•
Stationarity violation: .
-
•
Consensus error: .
-
•
Feasibility violation: .
Furthermore, we generate the Metropolis constant edge weight matrix [29] as the mixing matrix for the tested networks.
5.1 Numerical Results on Synthetic Datasets
The first experiment is to evaluate the performance of CDADT on synthetic datasets. Specifically, in the CCA problem (1.2), the first data matrix (assuming without loss of generality) by its (economy-form) singular value decomposition as follows,
(5.1) |
where both and are orthogonal matrices orthonormalized from randomly generated matrices, and is a diagonal matrix with diagonal entries
(5.2) |
for a parameter that determines the decay rate of the singular values of . The second data matrix is generated in a similar manner but with a different decay rate of the singular values. After construction, the columns of the data matrices and are uniformly distributed into agents.
We test the performances of CDADT with different choices of penalty parameters on the Erdös-Rényi (ER) network. The data matrices and are randomly generated with , , , , and . And the CCA problem (1.2) is tested with and . The corresponding numerical results are provided in Figure 1, which presents the performances of CDADT with . It can be observed that the curves of three performance metrics almost coincide with each other, which corroborates the robustness of CDADT to the penalty parameter in a wide range.



Furthermore, we conduct numerical tests to evaluate the impact of network topologies on the performance of CDADT, focusing on ring networks, grid networks, and ER networks. Figure 2 illustrates the structures of these networks and the corresponding values of defined in (2.1). For our experiment, the data matrices and are randomly generated with , , , and . Then the CCA problem (1.2) is tested with and . We set the algorithmic parameters and in CDADT. Figure 3 depicts the diminishing trend of three performance metrics against the communication rounds on a logarithmic scale, with different networks distinguished by colors. We can observe that, as the network connectivity becomes worse (i.e., approaches ), our algorithm requires more communication rounds to achieve the specified accuracy.






5.2 Numerical Results on Real-world Datasets
Next, we engage in a numerical test to assess the effectiveness of CDADT on two real-world datasets, including MNIST [21] and Mediamill [30]. Specifically, MNIST is a database of handwritten digits, consisting of gray-scale images of size . Every image is split into left and right halves, which are used as the two views with . We employ CCA to learn the correlated representations between left and right halves of the images, which involves the full training set of MNIST containing images. In the Mediamill dataset, each image is a representative keyframe of a video shot containing features, which is annotated with labels. We extract the first samples to test the CCA problem, which is performed to explore the correlation structure between images and labels.
For our testing, we employ CDADT to solve the CCA problem (1.2) on the ER network, where we fix and . And the algorithmic parameters are set to and in CDADT. The corresponding numerical results are presented in Figure 4. It is noteworthy that the effectiveness of CDADT is not limited to synthetic datasets but also extends to real-world applications. Moreover, CDADT demonstrates its potential to solve CCA problems over large-scale networks.


6 Conclusion
Decentralized optimization problems with generalized orthogonality constraints arise in various scientific and engineering applications. Existing algorithms for solving Riemannian optimization problems heavily rely on geometric tools of the underlying Riemannian manifold, such as tangent spaces and retraction operators. However, these algorithms can not be applied to solve 1.3, where the manifold constraints possess an inherently distributed structure. To surmount this intricate challenge, we propose to employ the constraint dissolving operator to build up an exact penalty model of the original problem. Nevertheless, existing algorithms remain unsuitable for solving the newly derived penalty model, as the penalty function is not entirely separable over that agents. To address these challenges, we develop an efficient decentralized algorithm based on the double-tracking strategy. In order to construct a descent direction of the penalty function, our algorithm not only tracks the gradient of the objective function but also maintains a global estimate of the Jacobian of the constraint mapping. The proposed algorithm is guaranteed to converge to a first-order stationary point under mild conditions. We validate its performance through numerical experiments conducted on both synthetic and real-world datasets.
As for future works, we are interested in extending our algorithm to general constraints such that it can find a wider range of applications. Moreover, it is worthy of investigating the performance of CDADT in stochastic and online settings.
Acknowledgement
The third author was supported in part by the National Natural Science Foundation of China (12125108, 12226008, 11991021, 11991020, 12021001, 12288201), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (ZDBS-LY-7022), and CAS–Croucher Funding Scheme for Joint Laboratories “CAS AMSS–PolyU Joint Laboratory of Applied Mathematics: Nonlinear Optimization Theory, Algorithms and Applications”.
References
- Absil et al. [2008] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2008. ISBN 9781400830244.
- Attouch and Bolte [2009] Hedy Attouch and Jérôme Bolte. On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Mathematical Programming, 116:5–16, 2009.
- Attouch et al. [2010] Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
- Attouch et al. [2013] Hedy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Mathematical Programming, 137(1):91–129, 2013.
- Bajovic et al. [2017] Dragana Bajovic, Dusan Jakovetic, Natasa Krejic, and Natasa Krklec Jerinkic. Newton-like method with diagonal correction for distributed optimization. SIAM Journal on Optimization, 27(2):1171–1203, 2017.
- Bolte et al. [2007] Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205–1223, 2007.
- Bolte et al. [2014] Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1):459–494, 2014.
- Chang et al. [2015] Tsung-Hui Chang, Mingyi Hong, and Xiangfeng Wang. Multi-agent distributed optimization via inexact consensus ADMM. IEEE Transactions on Signal Processing, 63(2):482–497, 2015.
- Chang et al. [2020] Tsung-Hui Chang, Mingyi Hong, Hoi-To Wai, Xinwei Zhang, and Songtao Lu. Distributed learning in the nonconvex world: From batch data to streaming and beyond. IEEE Signal Processing Magazine, 37(3):26–38, 2020.
- Chen et al. [2023] Jun Chen, Haishan Ye, Mengmeng Wang, Tianxin Huang, Guang Dai, Ivor W Tsang, and Yong Liu. Decentralized Riemannian conjugate gradient method on the Stiefel manifold. arXiv:2308.10547, 2023.
- Chen et al. [2021] Shixiang Chen, Alfredo Garcia, Mingyi Hong, and Shahin Shahrampour. Decentralized Riemannian gradient descent on the Stiefel manifold. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 1594–1605. PMLR, 2021.
- Chen et al. [2010] Xin Chen, Changliang Zou, and R Dennis Cook. Coordinate-independent sparse sufficient dimension reduction and variable selection. The Annals of Statistics, 38(6):3696–3723, 2010.
- Daneshmand et al. [2021] Amir Daneshmand, Gesualdo Scutari, Pavel Dvurechensky, and Alexander Gasnikov. Newton method over networks is fast up to the statistical precision. In Proceedings of the 38th International Conference on Machine Learning, pages 2398–2409. PMLR, 2021.
- Deng and Hu [2023] Kangkang Deng and Jiang Hu. Decentralized projected Riemannian gradient method for smooth optimization on compact submanifolds. arXiv:2304.08241, 2023.
- Dimarogonas et al. [2011] Dimos V Dimarogonas, Emilio Frazzoli, and Karl H Johansson. Distributed event-triggered control for multi-agent systems. IEEE Transactions on Automatic Control, 57(5):1291–1297, 2011.
- Du and Wang [2024] Jinye Du and Qihua Wang. Empirical likelihood inference over decentralized networks. arXiv:2401.12836, 2024.
- Gao and Ma [2023] Sheng Gao and Zongming Ma. Sparse GCA and thresholded gradient descent. Journal of Machine Learning Research, 24(135):1–61, 2023.
- Hajinezhad and Hong [2019] Davood Hajinezhad and Mingyi Hong. Perturbed proximal primal–dual algorithm for nonconvex nonsmooth optimization. Mathematical Programming, 176(1):207–245, 2019.
- Hotelling [1936] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3–4):321–377, 1936.
- Kurdyka [1998] Krzysztof Kurdyka. On gradients of functions definable in o-minimal structures. Annales de l’institut Fourier, 48(3):769–783, 1998.
- LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Ling et al. [2015] Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro. DLM: Decentralized linearized alternating direction method of multipliers. IEEE Transactions on Signal Processing, 63(15):4051–4064, 2015.
- Lojasiewicz [1963] Stanislaw Lojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. Les Équations aux Dérivées Partielles, 117:87–89, 1963.
- Nedić and Ozdaglar [2009] Angelia Nedić and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
- Nedić et al. [2018] Angelia Nedić, Alex Olshevsky, and Michael G Rabbat. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, 2018.
- Pillai et al. [2005] S Unnikrishna Pillai, Torsten Suel, and Seunghun Cha. The Perron–Frobenius theorem: Some of its applications. IEEE Signal Processing Magazine, 22(2):62–75, 2005.
- Qu and Li [2017] Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3):1245–1260, 2017.
- Raja and Bajwa [2015] Haroon Raja and Waheed U Bajwa. Cloud K-SVD: A collaborative dictionary learning algorithm for big, distributed data. IEEE Transactions on Signal Processing, 64(1):173–188, 2015.
- Shi et al. [2015] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
- Snoek et al. [2006] Cees GM Snoek, Marcel Worring, Jan C Van Gemert, Jan-Mark Geusebroek, and Arnold WM Smeulders. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the 14th ACM International Conference on Multimedia, pages 421–430, 2006.
- Song et al. [2023] Zhuoqing Song, Lei Shi, Shi Pu, and Ming Yan. Optimal gradient tracking for decentralized optimization. Mathematical Programming, pages 1–53, 2023.
- Sun et al. [2022] Ying Sun, Gesualdo Scutari, and Amir Daneshmand. Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation. SIAM Journal on Optimization, 32(2):354–385, 2022.
- Sun et al. [2024] Youbang Sun, Shixiang Chen, Alfredo Garcia, and Shahin Shahrampour. Global convergence of decentralized retraction-free optimization on the Stiefel manifold. arXiv:2405.11590, 2024.
- Wang et al. [2023] Jinxin Wang, Jiang Hu, Shixiang Chen, Zengde Deng, and Anthony Man-Cho So. Decentralized weakly convex optimization over the Stiefel manifold. arXiv:2303.17779, 2023.
- Wang and Liu [2022] Lei Wang and Xin Liu. Decentralized optimization over the Stiefel manifold by an approximate augmented Lagrangian function. IEEE Transactions on Signal Processing, 70:3029–3041, 2022.
- Wang and Liu [2023a] Lei Wang and Xin Liu. Smoothing gradient tracking for decentralized optimization over the Stiefel manifold with non-smooth regularizers. In Proceedings of the 62nd IEEE Conference on Decision and Control (CDC), pages 126–132. IEEE, 2023a.
- Wang and Liu [2023b] Lei Wang and Xin Liu. A variance-reduced stochastic gradient tracking algorithm for decentralized optimization with orthogonality constraints. Journal of Industrial and Management Optimization, 19(10):7753–7776, 2023b.
- Wang et al. [2024] Lei Wang, Le Bao, and Xin Liu. A decentralized proximal gradient tracking algorithm for composite optimization on Riemannian manifolds. arXiv:2401.11573, 2024.
- Xiao and Boyd [2004] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65–78, 2004.
- Xiao et al. [2024] Nachuan Xiao, Xin Liu, and Kim-Chuan Toh. Dissolving constraints for Riemannian optimization. Mathematics of Operations Research, 49(1):366–397, 2024.
- Xu et al. [2015] Jinming Xu, Shanying Zhu, Yeng Chai Soh, and Lihua Xie. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In Proceedings of the 54th IEEE Conference on Decision and Control (CDC), pages 2055–2060. IEEE, 2015.
- Yuan et al. [2021] Kun Yuan, Yiming Chen, Xinmeng Huang, Yingya Zhang, Pan Pan, Yinghui Xu, and Wotao Yin. DecentLaM: Decentralized momentum SGD for large-batch deep training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3029–3039, 2021.
- Zhang et al. [2021] Jiaojiao Zhang, Qing Ling, and Anthony Man-Cho So. A Newton tracking algorithm with exact linear convergence for decentralized consensus optimization. IEEE Transactions on Signal and Information Processing over Networks, 7:346–358, 2021.
- Zhang et al. [2024] Siyuan Zhang, Nachuan Xiao, and Xin Liu. Decentralized stochastic subgradient methods for nonsmooth nonconvex optimization. arXiv:2403.11565, 2024.
- Zhu and Martínez [2010] Minghui Zhu and Sonia Martínez. Discrete-time dynamic average consensus. Automatica, 46(2):322–329, 2010.