Controlling Continuous Relaxation for
Combinatorial Optimization
Abstract
Unsupervised learning (UL)-based solvers for combinatorial optimization (CO) train a neural network that generates a soft solution by directly optimizing the CO objective using a continuous relaxation strategy. These solvers offer several advantages over traditional methods and other learning-based methods, particularly for large-scale CO problems. However, UL-based solvers face two practical issues: (I) an optimization issue, where UL-based solvers are easily trapped at local optima, and (II) a rounding issue, where UL-based solvers require artificial post-learning rounding from the continuous space back to the original discrete space, undermining the robustness of the results. This study proposes a Continuous Relaxation Annealing (CRA) strategy, an effective rounding-free learning method for UL-based solvers. CRA introduces a penalty term that dynamically shifts from prioritizing continuous solutions, effectively smoothing the non-convexity of the objective function, to enforcing discreteness, eliminating artificial rounding. Experimental results demonstrate that CRA significantly enhances the performance of UL-based solvers, outperforming existing UL-based solvers and greedy algorithms in complex CO problems. Additionally, CRA effectively eliminates artificial rounding and accelerates the learning process.
1 Introduction
The objective of combinatorial optimization (CO) problems is to find the optimal solution from a discrete space, and these problems are fundamental in many real-world applications (Papadimitriou and Steiglitz, 1998). Most CO problems are NP-hard or NP-complete; making it challenging to solve large-scale problems within feasible computational time. Traditional methods frequently depend on heuristics to find approximate solutions; however, considerable insights into the specific problems are required. Alternatively, problems can be formulated as integer linear programming (ILP) and solved using ILP solvers. However, ILP lacks scalability when applied to problems with graph structures.
Recently, several studies have used machine learning methods to handle CO problems by learning heuristics. Most of these studies focus on supervised learning (SL)-based solvers (Hudson et al., 2021; Joshi et al., 2019; Gasse et al., 2019; Selsam et al., 2018; Khalil et al., 2016), which require optimal solutions to CO problems as supervision during training. However, obtaining optimal solutions is challenging in practice, and SL-based solvers often fail to generalize well (Yehuda et al., 2020). Reinforcement learning (RL)-based solvers (Yao et al., 2019; Chen and Tian, 2019; Yolcu and Póczos, 2019; Nazari et al., 2018; Khalil et al., 2017; Bello et al., 2016) avoid the need for optimal solutions but often suffer from notoriously unstable training due to poor gradient estimation and hard explorations (Mnih et al., 2015; Tang et al., 2017; Espeholt et al., 2018). Unsupervised learning (UL)-based solvers have recently attracted much attention (Schuetz et al., 2022a; Karalias and Loukas, 2020; Amizadeh et al., 2018). UL-based solvers follow a continuous relaxation approach, training a UL model to output a soft solution to the relaxed CO problem by directly optimizing a differentiable objective function, offering significantly stable and fast training even for large-scale CO problems. Notably, the physics-inspired GNN (PI-GNN) solver (Schuetz et al., 2022a) employs graph neural networks (GNN) to automatically learn instance-specific heuristics and performs on par with or outperforms existing solvers for CO problems with millions of variables without optimal solutions. While these offer some advantages over traditional and other machine learning-based solvers, they face two practical issues. The first issue is an optimization issues where UL-based solvers are easily trapped at local optima. Due to this issue, Angelini and Ricci-Tersenghi (2023) demonstrated that the PI-GNN solver (Schuetz et al., 2022a) could not achieve results comparable to those of the degree-based greedy algorithm (DGA) (Angelini and Ricci-Tersenghi, 2019) on maximum independent set (MIS) problems in random regular graphs (RRG). Wang and Li (2023) also pointed out the importance of using dataset or history, and initializing the GNN with outputs from greedy solvers to help the PI-GNN solver overcome optimization challenges. This issue is a crucial bottleneck to the applicability of this method across various real-world applications. The second issue relates to the inherent ambiguity of the continuous relaxation approach. This approach necessitates artificial rounding from the soft solution, which may include continuous values, back to the original discrete solution, potentially undermining the robustness of the results. While linear relaxation can provide an optimal solution for original discrete problems on bipartite graphs (Hoffman and Kruskal, 2010), it typically leads to solutions with values, which is known to half-integrality (Nemhauser and Trotter Jr, 1974), in which existing rounding methods (Schuetz et al., 2022b; Wang et al., 2022) completely lose their robustness. For NP-hard problems with graph structures, such as the MIS and MaxCut, semidefinite programming (SDP) relaxations have been proposed as effective approximation methods (Lovász, 1979; Goemans and Williamson, 1995). However, these approaches rely on rounding techniques, such as spectral clustering (Von Luxburg, 2007), to transform relaxed solutions into feasible ones, which often fails to obtain optimal solutions.
To address these issues, we propose the Continuous Relaxation Annealing (CRA). CRA introduces a penalty term to control the continuity and discreteness of the relaxed variables, with a parameter to regulate the intensity of this penalty term. When the parameter is small, the relaxed variable tends to favor continuous solutions, whereas a large biases them toward discrete values. This penalty term also effectively eliminates local optimum. Moreover, a small forces the loss function to approach a simple convex function, encouraging active exploration within the continuous space. CRA also includes an annealing process, where is gradually increased until the relaxed variables approach discrete values, eliminating the artificial rounding from the continuous to the original discrete space after learning. In this study, the solver that applies the CRA to the PI-GNN solver is referred to as the CRA-PI-GNN solver. We also demonstrate the benefits of the CRA through experiments on benchmark CO problems, including MIS, maximum cut (MaxCut), and diverse bipartite matching (DBM) problems across graphs of varying sizes and degrees. The experimental results show that the CRA significantly enhances the performance of the PI-GNN solver, outperforming the original PI-GNN solver, other state-of-the-art learning-based baselines, and greedy algorithms. This improvement is achieved by directly optimizing each instance without any history, e.g., previous optimal solutions and the information of other instances. Additionally, these experiments indicate that the CRA accelerates the learning process of the PI-GNN solver. Notably, these results overcome the limitations pointed out by Angelini and Ricci-Tersenghi (2023); Wang and Li (2023), highlighting the further potential of UL-based solvers.
Notation
We use the shorthand expression , where . denotes an identity matrix, denotes the vector , and denotes the vector . represents an undirected graph, where is the set of nodes with cardinality , and denotes the set of edges. For a graph , denotes the adjacency matrix, where if an edge does not exist and if the edge is present.
2 Background
Combinatorial optimization
The goal of this study is to solve the following CO problem.
(1) |
where denotes instance-specific parameters, such as a graph , and represents the set of all possible instances. denotes the cost function. Additionally, is a binary vector to be optimized, and denotes the feasible solution space, typically defined by the following equality and inequality constraints.
Here, for , denotes the inequality constraint, and for , denotes the equality constraint. Following UL-based solvers (Wang et al., 2022; Schuetz et al., 2022a; Karalias and Loukas, 2020), we reformulate the constrained problem into an equivalent unconstrained form using the penalty method (Smith et al., 1997):
(2) |
Here, for all , is the penalty term, which increases when the constraints are violated. For example, the penalty term is defined as follows:
and denotes the penalty parameters that control the trade-off between constraint satisfaction and cost optimization. Note that, as increases, the penalty for constraint violations becomes more significant. In the following, we provide an example of this formulation.
Example: MIS problem
The MIS problem is a fundamental NP-hard problem (Karp, 2010), defined as follows. Given an undirected graph , an independent set (IS) is a subset of nodes where any two nodes are not adjacent. The MIS problem aims to find the largest IS, denoted as . In this study, denotes the IS density, defined as . Following Schuetz et al. (2022a), a binary variable is assigned to each node . The MIS problem can be formulated as follows:
(3) |
where the first term maximizes the number of nodes assigned a value of , and the second term penalizes adjacent nodes assigned according to the penalty parameter .
2.1 Unsupervised learning based solvers
Learning for CO problems involves training an algorithm parameterized by a neural network (NN), where denotes the parameters. For a given instance , this algorithm generates a valid solution and aims to minimize . Several approaches have been proposed to train . This study focuses on UL-based solvers, which do not use a labeled solution during training (Wang et al., 2022; Schuetz et al., 2022a; Karalias and Loukas, 2020; Amizadeh et al., 2018). In the following, we outline the details of the UL-based solvers.
The UL-based solvers employ a continuous relaxation strategy to train NN. This continuous relaxation strategy reformulates a CO problem into a continuous optimization problem by converting discrete variables into continuous ones. A typical example of continuous relaxation is expressed as follows:
where represents a set of relaxed continuous variables, where each binary variable is relaxed to a continuous counterpart , and denotes the relaxed form of such that for . The relation between each constraint and its relaxation is similar for , meaning that for . Wang et al. (2022) and Schuetz et al. (2022a) formulated as the relaxed continuous variables, defined as . In the following discussions, we denote as to make the parametrization of the relaxed variables explicit. Then, is optimized by directly minimizing the following label-independent function:
(4) |
After training, the relaxed solution is converted into discrete variables using artificial rounding , where based on a threshold (Schuetz et al., 2022a), or alternatively, a greedy method (Wang et al., 2022). Two types of schemes for UL-based solvers have been developed based on this formulation.
(Type I) Learning generalized heuristics from history/data
One approach, proposed by Karalias and Loukas (2020), aims to automatically learn effective heuristics from historical dataset instances and then apply these learned heuristics to a new instance , through inference. Note that this method assumes that either the training dataset is easily obtainable or that meaningful data augmentation is feasible. Specifically, given a set of training instances , sampled independently and identically from a distribution , the goal is to minimize the average loss function . However, this method does not guarantee quality for a test instance, . Even if the training instances are extensive and the test instance follows , low average performance may not guarantee a low for on a specific . To address this issue, Wang and Li (2023) introduced a meta-learning approach where NNs aim to provide good initialization for future instances rather than direct solutions.
(Type II) Learning effective heuristics on a specific single instance
Another approach, known as the PI-GNN solver (Schuetz et al., 2022a, b), automatically learns instance-specific heuristics for a single instance using the instance parameter by directly applying Eq. (4). This approach addresses CO problems on graphs, where , and employs GNNs for the relaxed variables . Here, an -layered GNN is trained to directly minimize , taking as input a graph and the embedding vectors on its nodes, and outputting the relaxed solution . A detailed description of GNNs is provided in Appendix E.2. Note that this setting is applicable even when the training dataset is difficult to obtain. The overparameterization of relaxed variables is expected to smooth the objective function by introducing additional parameters to the optimization problem, similar to the kernel method. However, minimizing Eq. 4 for a single instance can be time-consuming compared to the inference process. Nonetheless, for large-scale CO problems, this approach has been reported to outperform other solvers in terms of both computational time and solution quality (Schuetz et al., 2022a, b).
Note that, while both UL-based solvers for multiple instances (Type I) and individual instances (Type II) are valuable, this study focuses on advancing the latter: a UL-based solver for a single instance. Both types of solvers are applicable to cost functions that meet a particular requirement due to their reliance on a gradient-based algorithm to minimize Eq (4).
Assumption 2.1 (Differentiable cost function).
The relaxed loss function and its partial derivative are accessible during the optimization process.
These requirements encompass a nonlinear cost function and interactions involving many-body interactions, extending beyond simple two-body interactions.
3 Continuous relaxation annealing for UL-based solvers
In this section, we discuss the practical issues associated with UL-based solvers and then introduce continuous relaxation annealing (CRA) as a proposed solution.
3.1 Motivation: practical issues of UL-based solvers
UL-based solvers (Type II) (Schuetz et al., 2022a, b) are effective in addressing large-scale CO problems. However, these solvers present following two practical issues, highlighted in several recent studies (Wang and Li, 2023; Angelini and Ricci-Tersenghi, 2023). Additionally, we numerically validate these issues; see Appendix F.1 for detailed results.
(I) Ambiguity in rounding method after learning
UL-based solvers employ a continuous relaxation strategy to train NNs and then convert the relaxed continuous variables into discrete binary values through artificial rounding as discussed in Section 2.1. This inherent ambiguity in continuous relaxation strategy often results in potential discrepancies between the optimal solutions of the original discrete CO problem and those of the relaxed continuous one. Continuous relaxation expands the solution space, often producing continuous values that lower the cost compared to an optimal binary value. Indeed, while linear relaxation can provide an optimal solution for discrete problems on bipartite graphs (Hoffman and Kruskal, 2010), it typically results in solutions with values, which is known to half-integrality (Nemhauser and Trotter Jr, 1974). Existing rounding methods (Schuetz et al., 2022b; Wang et al., 2022) often lose robustness in these scenarios. In practice, PI-GNN solver often outputs values near , underscoring the limitations of current rounding techniques for UL-based solvers.
(II) Difficulty in optimizing NNs
Recently, Angelini and Ricci-Tersenghi (2023) demonstrated that PI-GNN solver falls short of achieving results comparable to those of the degree-based greedy algorithm (DGA) (Angelini and Ricci-Tersenghi, 2019) when solving the MIS problems on RRGs. Angelini and Ricci-Tersenghi (2023) further emphasized the importance of evaluating UL-based solvers on complex CO problems, where greedy algorithms typically perform worse. A representative example is the MIS problems on RRGs with a constant degree , where a clustering transition in the solution space creates barriers that impede optimization. Moreover, Wang and Li (2023) emphasized the importance of using training/historical datasets, , which contain various graphs and initialization using outputs from greedy solvers, such as DGA and RGA for MIS problems. Their numerical analysis indicated that PI-GNN solver tends to get trapped in local optima when directly optimized directly for a single instance without leveraging a training dataset . However, in a practical setting, systematic methods for generating or collecting training datasets to effectively avoid local optima remains unclear. Additionally, training on instances that do not contribute to escaping local optima is time-consuming. Therefore, it is crucial to develop an effective UL-based solver that can operate on a single instance without relying on training data, . Our numerical experiments, detailed in Appendix F.1, also confirmed this optimization issue. They demonstrated that as problem complexity increases, the PI-GNN solver is often drawn into trivial local optima, , in certain problems. This entrapment results in prolonged plateaus that significantly slow down the learning process and, in especially challenging cases, can render learning entirely infeasible. Our numerical experiments, detailed in Appendix F.1, also validated this optimization issue, demonstrating that as the problem complexity increases, PI-GNN solver tends to be absorbed into the trivial local optima in some problems, resulting in prolonged plateaus which significantly decelerates the learning process and, in particularly challenging cases, can render learning entirely infeasible.
3.2 Continuous relaxation annealing
Penalty term to control discreteness and continuity
To address these issues, we propose a penalty term to control the balance between discreteness and continuity in the relaxed variables, formulated as follows:
(5) |
where is a penalty parameter, and the even number denote a curve rate. When is negative, i.e., , the relaxed variables tend to favor the continuous space, smoothing the non-convexity of the objective function due to the convexity of the penalty term . In contrast, when is positive, i.e., , the relaxed variables tend to favor discrete space, smoothing out the continuous solution into discrete solution. Formally, the following theorem holds as approaches .
Theorem 3.1.
Assuming the objective function is bounded within the domain , as , the relaxed solutions converge to the original solutions . Moreover, as , the loss function becomes convex, and the relaxed solution is unique.
For the detailed proof, refer to Appendix B.1. Theorem 3.1 can be generalized for any convex function that has a unique maximum at and achieves a global minimum for all ; an example is binary cross entropy , introduced by Sun et al. (2022); Sanokowski et al. (2024) for the UL-based solvers (Type I). Additionally, the penalty term eliminates the stationary point described in Section 3.1, preventing convergence to a plateau. For UL-based solvers, the penalty term is expressed as follows:
(6) |
where . According to Theorem 3.1, setting a sufficiently large value cases the relaxed variables to approach nearly discrete values. We can also generalize this penalty term , to Potts variables optimization, including coloring problems (Schuetz et al., 2022b), and mixed-integer optimization; refer to Appendix C.1.
Annealing penalty term

We propose an annealing strategy that gradually anneals the penalty parameter in Eq. (6). Initially, a negative gamma value, i.e., , is chosen to leverage the properties, facilitating broad exploration by smoothing the non-convexity of and eliminating the stationary point to avoid the plateau, as discussed in Section 3.1. Subsequently, the penalty parameter is gradually increased to a positive value, , with each update of the trainable parameters (one epoch), until the penalty term approaches zero, i.e., , to automatically round the relaxed variables by smoothing out suboptimal continuous solutions oscillating between or . A conceptual diagram of this annealing process is shown in Fig. 1.
Note that employing the binary cross-entropy is infeasible for UL-based solvers when , as the gradient diverges to at or . In deed, when , most relaxed variables typically approach binary values, with a relatively small number of variables oscillating between and . This gradient divergence issue in makes the learning infeasible without additional techniques, such as gradient clipping. In contrast, the gradient of the penalty term in Eq. 5, , is bounded within for any , preventing the gradient divergence issue seen in . Additionally, by increasing , the absolute value of the gradient near becomes smaller, allowing for control over the smoothing strength toward a discrete solution near .
We also propose an early stopping strategy that monitors both the loss function and the penalty term, halting the annealing and learning processes when the penalty term approaches zero, i.e., . Various annealing schedules can be considered; in this study, we employ the following scheduling: , where the scheduling rate is a small constant, and denotes the update iterations of the trainable parameters. We refer to the PI-GNN solver with this continuous relaxation annealing as CRA-PI-GNN solver. Here, two additional hyperparameters are introduced: the initial scheduling value and the scheduling rate . Numerical experiments suggest that better solutions are obtained when is set to a small negative value and is kept low. The ablation study are presented in Appendix F.5.
4 Related Work
Previous works on UL-based solvers have addressed various problems, such as MaxCut problems (Yao et al., 2019) and traveling salesman problems (Hudson et al., 2021), using carefully tailored problem-specific objectives. Some studies have also explored constraint satisfaction problems (Amizadeh et al., 2018; Toenshoff et al., 2019), but applying these approaches to broader CO problems often requires problem-specific reductions. Karalias and Loukas (2020) proposed Erdős Goes Neural (EGN) solver, an UL-based solver for general CO problems based on Erdős’ probabilistic method. This solver generate solutions through an inference process using training instances. Subsequently, Wang et al. (2022) proposed an entry-wise concave continuous relaxation, broadening the EGN solver to a wide range of CO problems. In contrast, Schuetz et al. (2022a, b) proposed PI-GNN solver, an UL-based solver for a single CO problems that automatically learns problem-specific heuristics during the training process. However, Angelini and Ricci-Tersenghi (2023); Boettcher (2023) pointed out the optimization difficulties where PI-GNN solver failed to achieve results comparable to those of greedy algorithms. Wang and Li (2023) also claimed optimization issues with PI-GNN solver, emphasizing the importance of learning from training data and history to overcome local optima. They then proposed Meta-EGN solvers, a meta-learning approach that updates NN network parameters for individual CO problem instances. Furthermore, to address these optimization issue, Lin et al. (2023); Sun et al. (2022); Sanokowski et al. (2024) proposed annealing strategy similar to simulated annealing (Kirkpatrick et al., 1983).
5 Experiments
We begin by evaluating the performance of CRA-PI-GNN solver on the MIS and the MaxCut benchmark problems across multiple graphs of varying sizes, demonstrating that CRA effectively overcomes optimization challenges without relying on data/history . We then extend the evaluation to the DBM problems, showing the applicability to more practical CO problems. For the objective functions and the detailed explanations, refer to Appendix E.1.
5.1 Experimental settings
Baseline methods
In all experiments, the baseline methods include the PI-GNN solver (Schuetz et al., 2022a) as the direct baseline of a UL-based solver for a single instance. For the MIS problems, we also consider the random greedy algorithm (RGA) and DGA (Angelini and Ricci-Tersenghi, 2019) as heuristic baselines. For the MaxCut problems, RUN-CSP solver (Toenshoff et al., 2019) is considered as an additional baseline, and a standard greedy algorithm and SDP based approximation algorithm (Goemans and Williamson, 1995) are considered as an additional classical baseline. The parameters for the Goemans-Williamson (GW) approximation are all set according to the settings in Schuetz et al. (2022b). The implementation used the open-source CVXOPT solver with CVXPY (Mehta, 2019) as the modeling interface. Note that we do not consider UL-based solvers for learning generalized heuristics (Karalias and Loukas, 2020; Wang et al., 2022; Wang and Li, 2023), which rely on training instances . The primary objective of this study is to evaluate whether CRA-PI-GNN solver can surpass the performance of both PI-GNN solver and greedy algorithms. However, for the MIS problem, EGN solver (Karalias and Loukas, 2020) and Meta-EGN solver (Wang and Li, 2023) are considered to confirm that CRA can overcome the optimization issues without training instances.
Implementation
The objective of the numerical experiments is to compare the CRA-PI-GNN solver with the PI-GNN solver. Thus, we follow the same experimental configuration described as the experiments in Schuetz et al. (2022a), employing a simple two-layer GCV and GraphSAGE (Hamilton et al., 2017) implemented by the Deep Graph Library (Wang et al., 2019); Refer to Appendix D.1 for the detailed architectures.
We use the AdamW (Kingma and Ba, 2014) optimizer with a learning rate of and weight decay of . The GNNs are trained for up to epochs with early stopping, which monitors the summarized loss function and the entropy term with tolerance and patience epochs; Further details are provided in Appendix D.2. We set the initial scheduling value to for the MIS and matching problems, and we set for the MaxCut problems with the scheduling rate and curve rate in Eq. (6). These values are not necessarily optimal, and refining these parameters can lead to better solutions; Refer to Appendix F.5 and Appendix F.6 for an ablation study of these parameters. Once the training process is complete, we apply projection heuristics to map the obtained soft solutions back to discrete solutions using simple projection, where for all , we map into if and into if . However, due to the early stopping in Section 3.2, the CRA-PI-GNN solver ensures that for all benchmark CO problems, the soft solution at the end of the training process became 0 or 1 within the 32-bit Floating Point range in Pytorch GPU; thus, it is robust against a given threshold, which we set to in our experiments. Additionally, no violations of the constraints were observed in our numerical experiments. Thus, following results presented in are feasible solutions.
Evaluation metrics
Following Karalias and Loukas (2020), We use the approximation ratio (ApR), formulated as , where is optimal solution. For the MIS problems, we evaluate the ApRs using the theoretical optimal cost (Barbier et al., 2013) and the independent set density relative to the theoretical results. For the MaxCut problems on RRGs, we adopt the cut ratio against the theoretical upper bound (Parisi, 1980; Dembo et al., 2017); see Appendix E.1 for the details. All the results for the MIS and MaxCut problems are summarized based on 5 RRGs with different random seeds. In the case of the MaxCut Gset problem, the ApR is calculated compared to the known best cost functions. Regarding the DBM problems, we calculate the ApR against the global optimal, identified using Gurobi 10.0.1 solver with default settings.
5.2 MIS problems
Degree dependency of solutions using CRA
First, we compare the performance of the PI-GNN and CRA-PI-GNN solvers using GCV, as in Schuetz et al. (2022a). Fig. 3 shows the independent set density as a function of degree obtained by the PI-GNN and CRA-PI-GNN solvers compared with the theoretical results (Barbier et al., 2013). Across all degrees , the CRA-PI-GNN solver outperformed the PI-GNN solver and approached the theoretical results, whereas the PI-GNN solver fail to find valid solutions, especially for , as pointed by the previous studies (Angelini and Ricci-Tersenghi, 2023; Wang and Li, 2023).
Response to Angelini and Ricci-Tersenghi (2023) and Wang and Li (2023)

MIS problems on RRGs with a degree larger than is known to be hard problems (Barbier et al., 2013). As discussed in Section 3.1, Angelini and Ricci-Tersenghi (2023); Wang and Li (2023) have posted the optimization concerns on UL-based solvers. However, we call these claim into question by substantially outperforming heuristics DGA and RGA for the MIS on graphs with , without training/historical instances , as shown in Table 1. See Appendix 6 for the results of solving all other Gsets, where consistently, CRA-PI-GNN provides better results as well. A comparison of the sampling-based solvers, RL-based solvers, SL-based solvers, Gurobi, and MIS-specific solvers is presented in Appendix F.2.
Acceleration of learning speed
We also compared the learning curve between PI-GNN and CRA-PI-GNN solver to confirmed that the CRA-PI-GNN solver does not become trapped in the plateau, , as discussed in Section 3.1. Fig. 5 shows the dynamics of the cost functions for the MIS problems with across . Across all degrees, CRA-PI-GNN solver achieves a better solution with fewer epochs than PI-GNN solver. Specifically, PI-GNN solver becomes significantly slower due to getting trapped in the plateau even for graphs with low degrees, such as . In contrast, CRA-PI-GNN solver can effectively escape from plateaus through the smoothing and automatic rounding of the penalty term when the negative parameter .
Computational scaling

We next evaluate the computational scaling of the CRA-PI-GNN solver for MIS problems with large-scale RRGs with a node degree of in Fig. 1, following previous studies (Schuetz et al., 2022a; Wang and Li, 2023). Fig. 1 demonstrated a moderate super-linear scaling of the total computational time, approximately for GCN and for GraphSage. This performance is nearly identical to that of the PI-GNN solver (Schuetz et al., 2022a) for problems on RRGs with lower degrees. It is important note that the runtimes of CRA-PI-GNN solver heavily depend on the optimizer for GNNs and annealing rate ; thus this scaling remains largely unchanged for problems other than the MIS on RRG. Additionally, CRA demonstrate that the runtime remains nearly constant as graph order and density increase, indicating effective scalability with denser graphs which is presented in Appendix F.2.
5.3 MaxCut problem
Degree dependency of solutions
We first compare the performances of PI-GNN and CRA-PI-GNN solvers with GCV, following Schuetz et al. (2022a). Fig. 3 shows the cut ratio as a function of the degree compared to the theoretical upper bound (Parisi, 1980; Dembo et al., 2017). Across all degrees , CRA-PI-GNN solver also outperforms PI-GNN solver, approaching the theoretical upper bound. In contrast, PI-GNN solver fails to find valid solutions for as with the case of the MIS problems in Section 5.2.
Standard MaxCut benchmark test

Following Schuetz et al. (2022a), we next conducted additional experiments on standard MaxCut benchmark instances based on the publicly available Gset dataset (Ye, 2003), which is commonly used to evaluate MaxCut algorithms. Here, we provide benchmark results for seven distinct graphs with thousands of nodes, including Erdös-Renyi graphs with uniform edge probability, graphs in which the connectivity decays gradually from node to , -regular toroidal graphs, and a very large Gset instance with nodes. Table 2 shows, across all problems, CRA-PI-GNN solver outperforms both the PI-GNN, RUN-CSP solvers and other greedy algorithm. See Appendix 6 for the results of solving all other Gsets, where consistently, CRA-PI-GNN provides better results as well.
GRAPH | (NODES, EDGES) | GREEDY | SDP | RUN-CSP | PI-GNN | CRA |
---|---|---|---|---|---|---|
G14 | (, ) | |||||
G15 | (, ) | |||||
G22 | (, ) | |||||
G49 | (, ) | |||||
G50 | (, ) | |||||
G55 | (, ) | |||||
G70 | (, ) |
5.4 Diverse bipartite matching
To evaluate the applicability of the CRA-PI-GNN solver to more practical problems not on graphs, we conducted experiments on DBM problems (Ferber et al., 2020; Mulamba et al., 2020; Mandi et al., 2022); refer to Appendix E.1 for details. This problems consists of distinct instances with varying properties, and each instance comprises nodes representing scientific publications, divided into two groups of nodes and . The optimization is formulated as follows:
where represents the likelihood of a link between each pair of nodes, an indicator is set to if article and share the same subject field ( otherwise) , and . The parameters represent the probability of pairs sharing their field and of unrelated pairs, respectively. As in Mandi et al. (2022), we explore two variations of this problem, with being % and %, respectively, and these variations are referred to as Matching-1 and Matching-2, respectively. In this experiment, we set and . Fig 6 shows that the CRA-PI-GNN solver can find better solutions across all instances.
6 Conclusion
This study proposes CRA strategy to address the both optimization and rounding issue in UL-based solvers. CRA strategy introduces a penalty term that dynamically shifts from prioritizing continuous solutions, where the non-convexity of the objective function is effectively smoothed, to enforcing discreteness, thereby eliminating artificial rounding. Experimental results demonstrate that CRA-PI-GNN solver significantly outperforms both PI-GNN solver and greedy algorithms across various complex CO problems, including MIS, MaxCut, and DBM problems. CRA approach not only enhances solution quality but also accelerates the learning process.
Limitation
In this numerical experiments, most hyperparameters were fixed to their default values, as outlined in Section 5.1, with minimal tuning. However, tuning may be necessary for certain problems or to further enhance performance.
References
- Papadimitriou and Steiglitz [1998] Christos H Papadimitriou and Kenneth Steiglitz. Combinatorial optimization: algorithms and complexity. Courier Corporation, 1998.
- Hudson et al. [2021] Benjamin Hudson, Qingbiao Li, Matthew Malencia, and Amanda Prorok. Graph neural network guided local search for the traveling salesperson problem. arXiv preprint arXiv:2110.05291, 2021.
- Joshi et al. [2019] Chaitanya K Joshi, Thomas Laurent, and Xavier Bresson. An efficient graph convolutional network technique for the travelling salesman problem. arXiv preprint arXiv:1906.01227, 2019.
- Gasse et al. [2019] Maxime Gasse, Didier Chételat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. Exact combinatorial optimization with graph convolutional neural networks. Advances in neural information processing systems, 32, 2019.
- Selsam et al. [2018] Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L Dill. Learning a sat solver from single-bit supervision. arXiv preprint arXiv:1802.03685, 2018.
- Khalil et al. [2016] Elias Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
- Yehuda et al. [2020] Gal Yehuda, Moshe Gabel, and Assaf Schuster. It’s not what machines can learn, it’s what we cannot teach. In International conference on machine learning, pages 10831–10841. PMLR, 2020.
- Yao et al. [2019] Weichi Yao, Afonso S Bandeira, and Soledad Villar. Experimental performance of graph neural networks on random instances of max-cut. In Wavelets and Sparsity XVIII, volume 11138, pages 242–251. SPIE, 2019.
- Chen and Tian [2019] Xinyun Chen and Yuandong Tian. Learning to perform local rewriting for combinatorial optimization. Advances in Neural Information Processing Systems, 32, 2019.
- Yolcu and Póczos [2019] Emre Yolcu and Barnabás Póczos. Learning local search heuristics for boolean satisfiability. Advances in Neural Information Processing Systems, 32, 2019.
- Nazari et al. [2018] Mohammadreza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takác. Reinforcement learning for solving the vehicle routing problem. Advances in neural information processing systems, 31, 2018.
- Khalil et al. [2017] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. Advances in neural information processing systems, 30, 2017.
- Bello et al. [2016] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
- Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Tang et al. [2017] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. Advances in neural information processing systems, 30, 2017.
- Espeholt et al. [2018] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, pages 1407–1416. PMLR, 2018.
- Schuetz et al. [2022a] Martin JA Schuetz, J Kyle Brubaker, and Helmut G Katzgraber. Combinatorial optimization with physics-inspired graph neural networks. Nature Machine Intelligence, 4(4):367–377, 2022a.
- Karalias and Loukas [2020] Nikolaos Karalias and Andreas Loukas. Erdos goes neural: an unsupervised learning framework for combinatorial optimization on graphs. Advances in Neural Information Processing Systems, 33:6659–6672, 2020.
- Amizadeh et al. [2018] Saeed Amizadeh, Sergiy Matusevych, and Markus Weimer. Learning to solve circuit-sat: An unsupervised differentiable approach. In International Conference on Learning Representations, 2018.
- Angelini and Ricci-Tersenghi [2023] Maria Chiara Angelini and Federico Ricci-Tersenghi. Modern graph neural networks do worse than classical greedy algorithms in solving combinatorial optimization problems like maximum independent set. Nature Machine Intelligence, 5(1):29–31, 2023.
- Angelini and Ricci-Tersenghi [2019] Maria Chiara Angelini and Federico Ricci-Tersenghi. Monte carlo algorithms are very effective in finding the largest independent set in sparse random graphs. Physical Review E, 100(1):013302, 2019.
- Wang and Li [2023] Haoyu Wang and Pan Li. Unsupervised learning for combinatorial optimization needs meta-learning. arXiv preprint arXiv:2301.03116, 2023.
- Hoffman and Kruskal [2010] Alan J Hoffman and Joseph B Kruskal. Integral boundary points of convex polyhedra. 50 Years of Integer Programming 1958-2008: From the Early Years to the State-of-the-Art, pages 49–76, 2010.
- Nemhauser and Trotter Jr [1974] George L Nemhauser and Leslie E Trotter Jr. Properties of vertex packing and independence system polyhedra. Mathematical programming, 6(1):48–61, 1974.
- Schuetz et al. [2022b] Martin JA Schuetz, J Kyle Brubaker, Zhihuai Zhu, and Helmut G Katzgraber. Graph coloring with physics-inspired graph neural networks. Physical Review Research, 4(4):043131, 2022b.
- Wang et al. [2022] Haoyu Peter Wang, Nan Wu, Hang Yang, Cong Hao, and Pan Li. Unsupervised learning for combinatorial optimization with principled objective relaxation. Advances in Neural Information Processing Systems, 35:31444–31458, 2022.
- Lovász [1979] László Lovász. On the shannon capacity of a graph. IEEE Transactions on Information theory, 25(1):1–7, 1979.
- Goemans and Williamson [1995] Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995.
- Von Luxburg [2007] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17:395–416, 2007.
- Smith et al. [1997] Alice E Smith, David W Coit, Thomas Baeck, David Fogel, and Zbigniew Michalewicz. Penalty functions. Handbook of evolutionary computation, 97(1):C5, 1997.
- Karp [2010] Richard M Karp. Reducibility among combinatorial problems. Springer, 2010.
- Sun et al. [2022] Haoran Sun, Etash K Guha, and Hanjun Dai. Annealed training for combinatorial optimization on graphs. arXiv preprint arXiv:2207.11542, 2022.
- Sanokowski et al. [2024] Sebastian Sanokowski, Wilhelm Berghammer, Sepp Hochreiter, and Sebastian Lehner. Variational annealing on graphs for combinatorial optimization. Advances in Neural Information Processing Systems, 36, 2024.
- Toenshoff et al. [2019] Jan Toenshoff, Martin Ritzert, Hinrikus Wolf, and Martin Grohe. Run-csp: unsupervised learning of message passing networks for binary constraint satisfaction problems. CoRR, abs/1909.08387, 2019.
- Boettcher [2023] Stefan Boettcher. Inability of a graph neural network heuristic to outperform greedy algorithms in solving combinatorial optimization problems. Nature Machine Intelligence, 5(1):24–25, 2023.
- Lin et al. [2023] Xi Lin, Zhiyuan Yang, Xiaoyuan Zhang, and Qingfu Zhang. Continuation path learning for homotopy optimization. In International Conference on Machine Learning, pages 21288–21311. PMLR, 2023.
- Kirkpatrick et al. [1983] Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983.
- Mehta [2019] Hermish Mehta. Cvx graph algorithms. https://github.com/hermish/cvx-graph-algorithms, 2019.
- Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
- Wang et al. [2019] Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, et al. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019.
- Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Barbier et al. [2013] Jean Barbier, Florent Krzakala, Lenka Zdeborová, and Pan Zhang. The hard-core model on random graphs revisited. In Journal of Physics: Conference Series, volume 473, page 012021. IOP Publishing, 2013.
- Parisi [1980] Giorgio Parisi. A sequence of approximated solutions to the sk model for spin glasses. Journal of Physics A: Mathematical and General, 13(4):L115, 1980.
- Dembo et al. [2017] Amir Dembo, Andrea Montanari, and Subhabrata Sen. Extremal cuts of sparse random graphs. 2017.
- Ye [2003] Y. Ye. The gset dataset. https://web.stanford.edu/~yyye/yyye/Gset/, 2003.
- Ferber et al. [2020] Aaron Ferber, Bryan Wilder, Bistra Dilkina, and Milind Tambe. Mipaal: Mixed integer program as a layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1504–1511, 2020.
- Mulamba et al. [2020] Maxime Mulamba, Jayanta Mandi, Michelangelo Diligenti, Michele Lombardi, Victor Bucarey, and Tias Guns. Contrastive losses and solution caching for predict-and-optimize. arXiv preprint arXiv:2011.05354, 2020.
- Mandi et al. [2022] Jayanta Mandi, Vıctor Bucarey, Maxime Mulamba Ke Tchomba, and Tias Guns. Decision-focused learning: through the lens of learning to rank. In International Conference on Machine Learning, pages 14935–14947. PMLR, 2022.
- Bayati et al. [2010] Mohsen Bayati, David Gamarnik, and Prasad Tetali. Combinatorial approach to the interpolation method and scaling limits in sparse random graphs. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 105–114, 2010.
- Coja-Oghlan and Efthymiou [2015] Amin Coja-Oghlan and Charilaos Efthymiou. On independent sets in random graphs. Random Structures & Algorithms, 47(3):436–486, 2015.
- Alidaee et al. [1994] Bahram Alidaee, Gary A Kochenberger, and Ahmad Ahmadian. 0-1 quadratic programming approach for optimum solutions of two scheduling problems. International Journal of Systems Science, 25(2):401–408, 1994.
- Neven et al. [2008] Hartmut Neven, Geordie Rose, and William G Macready. Image recognition with an adiabatic quantum computer i. mapping to quadratic unconstrained binary optimization. arXiv preprint arXiv:0804.4457, 2008.
- Deza and Laurent [1994] Michel Deza and Monique Laurent. Applications of cut polyhedra—ii. Journal of Computational and Applied Mathematics, 55(2):217–247, 1994.
- Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
- Gilmer et al. [2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pages 1263–1272. PMLR, 2017.
- Scarselli et al. [2008] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
- Qiu et al. [2022] Ruizhong Qiu, Zhiqing Sun, and Yiming Yang. DIMES: A differentiable meta solver for combinatorial optimization problems. In Advances in Neural Information Processing Systems 35, 2022.
- Sun et al. [2023] Haoran Sun, Katayoon Goshvadi, Azade Nova, Dale Schuurmans, and Hanjun Dai. Revisiting sampling for combinatorial optimization. In International Conference on Machine Learning, pages 32859–32874. PMLR, 2023.
- Goshvadi et al. [2023] Katayoon Goshvadi, Haoran Sun, Xingchao Liu, Azade Nova, Ruqi Zhang, Will Grathwohl, Dale Schuurmans, and Hanjun Dai. Discs: A benchmark for discrete sampling. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 79035–79066. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/f9ad87c1ebbae8a3555adb31dbcacf44-Paper-Datasets_and_Benchmarks.pdf.
- Hoos and Stützle [2000] Holger H Hoos and Thomas Stützle. Satlib: An online resource for research on sat. Sat, 2000:283–292, 2000.
Appendix A Overview
This supplementary material provides extended explanations, implementation details, and additional results.
Appendix B Derivation
B.1 Proof of Theorem 3.1
First, we present three lemmas, and then we demonstrate Theorem 3.1 based on these lemmas.
Lemma B.1.
For any even natural number , the function defined on achieves its maximum value of when and its minimum value of when or .
Proof.
The derivative of relative to is , which is zero when . This is a point where the function is maximized because the second derivative . In addition, this function is concave and symmetric relative to because is an even natural number, i.e., , thereby achieving its minimum value of when or . ∎
Lemma B.2.
For any even natural number , if , minimizing the penalty term enforces that the for all , is either or and, if , the penalty term enforces .
Proof.
From Lemma B.1, as , is minimal value when, for any , or . As , is minimal value when, for any , . ∎
Lemma B.3.
For any even number , is concave when and is a convex function when .
Proof.
Note that is separable across its components . Thus, it is sufficient to prove that each is concave or convex in because the sum of the concave or convex functions is also concave (and vice versa). Thus, we consider the second derivative of with respect to :
If , the second derivative is negative for all , and this completes the proof that is a concave function when is positive (and vice versa). ∎
Theorem B.4.
Under the assumption that the objective function is bounded within the domain , as , the soft solutions converge to the original solutions . In addition, as , the loss function becomes convex, and the soft solution is unique.
Proof.
As , the penalty term dominates the loss function . According to Lemma B.2, this penalty term forces the optimal solution to a binary vector whose components, for all that are either or because any non-binary value results in an infinitely large penalty. This effectively restricts the feasible region to the vertices of the unit hypercube, which correspond to the binary vector in . Thus, as , the solutions to the relaxed problem converge to those of the original problem. As , the penalty term also dominates the loss function and the convex function from Lemma B.3. According to Lemma B.2, this penalty term forces the optimal solution . ∎
The theorem holds for the cross entropy penalty given by
(7) |
in the UL-based solver using data or history [Sun et al., 2022, Sanokowski et al., 2024] because can similarly satisfy Lemma B.1, Lemma B.2 and Lemma B.3.
Corollary B.5.
Theorem B.4 holds for the following penalty term:
(8) |
Appendix C Generalization of CRA
C.1 Generalization for to Potts variable optimization
This section generalize the penalty term introduced for binary variables to -Potts variables. -Potts variable is the Kronecker delta which equas one whenever and zero otherwise and a decision variable takes on different values, . For example, graph -coloring problems can be expressed as
(9) |
For Potts variables, the output of the GNN is and the penalty term can be generalized as follows.
(10) |
Appendix D Additional implementation details
D.1 Architecture of GNNs
We describe the details of the GNN architectures used in all numerical experiments. The first convolutional layer takes -dimensional node embedding vectors, for each node, as input, yielding -dimensional feature vectors . Then, the ReLU function is applied as a component-wise nonlinear transformation. The second convolutional layer takes the -dimensional feature vectors, , as input, producing a -dimensional vector . Finally, a sigmoid function is applied to the -dimensional vector , and the output is the soft solution . As in [Schuetz et al., 2022a], we set or , and for both GCN and GraphSAGE .
D.2 Training settings
For all numerical experiments, we use the AdamW [Kingma and Ba, 2014] optimizer with a learning rate of and weight decay of , and we train the GNNs for up to epochs with early stopping set to the absolute tolerance and patience . As discussed in Schuetz et al. [2022a], the GNNs are initialized with five different random seeds for a single instance because the results are dependent on the initial values of the trainable parameters; thus selecting the best solution.
Appendix E Experiment details
E.1 Benchmark problems
MIS problems
There are some theoretical results for MIS problems on RRGs with the node degree set to , where each node is connected to exactly other nodes. The MIS problem is a fundamental NP-hard problem [Karp, 2010] defined as follows. Given an undirected graph , an independent set (IS) is a subset of nodes where any two nodes in the set are not adjacent. The MIS problem attempts to find the largest IS, which is denoted . In this study, denotes the IS density, where . To formulate the problem, a binary variable is assigned to each node . Then the MIS problem is formulated as follows:
(11) |
where the first term attempts to maximize the number of nodes assigned , and the second term penalizes the adjacent nodes marked according to the penalty parameter . In our numerical experiments, we set , following Schuetz et al. [2022a], no violation is observed as in [Schuetz et al., 2022a]. First, for every , a specific value , which is dependent on only the degree , exists such that the independent set density converges to with a high probability as approaches infinity [Bayati et al., 2010]. Second, a statistical mechanical analysis provides the typical MIS density , as shown in Fig. 3, and we clarify that for , the solution space of undergoes a clustering transition, which is associated with hardness in sampling [Barbier et al., 2013] because the clustering is likely to create relevant barriers that affect any algorithm searching for the MIS . Finally, the hardness is supported by analytical results in a large limit, which indicates that, while the maximum independent set density is known to have density , to the best of our knowledge, there is no known algorithm that can find an independent set density exceeding [Coja-Oghlan and Efthymiou, 2015].
MaxCut problems
The MaxCut problem is also a fundamental NP-hard problem [Karp, 2010] with practical application in machine scheduling [Alidaee et al., 1994], image recognition [Neven et al., 2008] and electronic circuit layout design [Deza and Laurent, 1994]. The MaxCut problem is also a fundamental NP-hard problem [Karp, 2010] Given an graph , a cut set is defined as a subset of the edge set between the node sets dividing . The MaxCut problems aim to find the maximum cut set, denoted . Here, the cut ratio is defined as , where is the cardinality of the cut set. To formulate this problem, each node is assigned a binary variable, where indicates that node belongs to , and indicates that the node belongs to . Here, holds if the edge . As a result, we obtain the following:
(12) |
This study has also focused on the MaxCut problems on -RRGs, for which several theoretical results have been established. Specifically, for each , the maximum cut ratio is given by , where with a high probability as approaches infinity [Parisi, 1980, Dembo et al., 2017]. Thus, we take as an upper bound for the maximum cut ratio in the large limit.
DBM problems
Here, the topologies are taken from the Cora citation network [Sen et al., 2008], where each node has bag-of-words features, and each edge represents likelihood, as predicted by a machine learning model. Mandi et al. [2022] focused on disjoint topologies within the given topology, and they created distinct instances with varying properties. Each instance comprises nodes representing scientific publications, divided into two groups of nodes and . The optimization task is to find the maximum matching, where diversity constraints ensure connections among papers in the same field and between papers of different fields. This is formulated using a penalty method as follows.
(13) |
where represents the likelihood of a link between each pair of nodes, an indicator is set to if article and share the same subject field ( otherwise) , and . The parameters represent the probability of pairs sharing their field and of unrelated pairs, respectively. As in Mandi et al. [2022], we explore two variations of this problem, with being % and %, respectively, and these variations are referred to as Matching-1 and Matching-2, respectively. In this experiment, we set and .
E.2 GNNs
A GNN [Gilmer et al., 2017, Scarselli et al., 2008] is a specialized neural network for representation learning of graph-structured data. GNNs learn a vectorial representation of each node in two steps, i.e., the aggregate and combine steps. The aggregate step employs a permutation-invariant function to generate an aggregated node feature, and in the combine step, the aggregated node feature is passed through a trainable layer to generate a node embedding, known as "message passing" or the "readout phase." Formally, for a given graph , where each node feature is attached to each node , the GNN updates the following two steps iteratively. First, the aggregate step at each -th layer is defined as follows:
(14) |
where the neighborhood of is denoted , is the node feature of the neighborhood, and is the aggregated node feature of the neighborhood. Then, the combined step at each -th layer is defined as follows:
(15) |
where denotes the node representation at the -th layer. Here, the hyperparameters for the total number of layers and the intermediate vector dimension are determined empirically. Although numerous implementations of GNN architectures have been proposed to date, the most basic and widely used architecture is the GCN [Scarselli et al., 2008], which is given as follows:
(16) |
where and are trainable parameters, serves as a normalization factor, and is some component-wise nonlinear activation function, e.g., the sigmoid or ReLU function.
Appendix F Additional experiments
F.1 Numerical validation of practical issues presented in Section 3.1
In this sectioin, we will examine the issue (I) with continuous relaxations and the issue (II), the difficulties of optimization, as pointed out by previous studies [Wang and Li, 2023, Angelini and Ricci-Tersenghi, 2023], in the NP-hard problems of MIS and the MaxCut problem. Therefere, we conducted numerical experiments using the PI-GNN solver for MIS and MaxCut problems on RRGs with higher degrees. To ensure the experimental impartiality, we adhered to the original settings of the PI-GNN solver [Schuetz et al., 2022b]. Refer to Section E for the detailed experimental settings. Fig. 7 (top) shows the solutions obtained by the PI-GNN solver as a function of the degree for the MIS and MaxCut problems with varying system sizes . These results indicate that finding independent and cut sets becomes unfeasible as the RRG becomes denser. In addition, to clarify the reasons for these failures, we analyzed the dynamics of the cost function for MIS problems with , with a specific focus on a graph with degrees and , as depicted in Fig. 7 (bottom). For the case, the cost function goes over the plateau of with , as investigated in the histogram, eventually yielding a solution comparable to those presented by Schuetz et al. [2022a]. Conversely, in the case, the cost function remains stagnant on the plateau of with , thereby failing to find any independent nodes. Interpreting this phenomenon, we hypothesize that the representation capacity of the GNN is sufficiently large, leading us to consider the optimization of and as a variational optimization problem relative to . In this case, satisfies the first-order variational optimality conditions , which implies a potential reason for absorption into the plateau. However, this does not reveal the conditions for the convergence to the fixed point during the early learning stage or the condition to escape from the fixed point . Thus, an extensive theoretical evaluation through stability analysis remains an important topic for future work.

In summary, UL-based solver, minimizing can be challenging and unstable. In particular, the PI-GNN solver, which is one of the UL-based solvers employing GNNs, fails to optimize due to a local solution in complex CO problems on relatively dense graphs where the performance of greedy algorithms worsens. This issues can be potential bottleneck for more practical and relatively dense problems, making it challenging to employ the PI-GNN solver confidently.
F.2 Additional results of MIS
Method | Type | ER-[700-800] | ER-[9000-11000] | ||
---|---|---|---|---|---|
ApR | Time | ApR | Time | ||
KaMIS | OR | 1.000 | 52.13m | 1.000 | 7.6h |
Gurobi | OR | 0.922 | 50.00m | N/A | N/A |
Intel | SL+TS | 0.865 | 20.00m | N/A | N/A |
SL+G | 0.777 | 6.06m | 0.746 | 5.02m | |
DGL | SL+TS | 0.830 | 22.71m | N/A | N/A |
LwD | RL+S | 0.918 | 6.33m | 0.907 | 7.56m |
DIMES | RL+G | 0.852 | 6.12m | 0.841 | 5.21m |
RL+S | 0.937 | 12.01m | 0.873 | 12.51m | |
iSCO | fewer steps | 0.998 | 1.38m | 0.990 | 9.38m |
more steps | 1.006 | 5.56m | 1.008 | 1.25h | |
CRA | UL-based | 0.928 | 47.30m | 0.963 | 1.03h |
We evaluate our method using the MIS benchmark dataset from recent studies [Goshvadi et al., 2023, Qiu et al., 2022], which includes graphs from SATLIB [Hoos and Stützle, 2000] and Erdős–Rényi graphs (ERGs) of varying sizes. Following Sun et al. [2023], our test set consists of SATLIB graphs, each containing between and clauses with up to nodes and edges, ERGs with to nodes each, and ERGs with to nodes each. We conducted numerical experiments on PQQA using four different configurations: parallel runs with or and shorter steps (3000 steps) or longer steps (30000 steps), similar to the approach in iSCO [Sun et al., 2023]. Table 3 presents the solution quality and runtime results. The results show that CRA, which optimizes the relaxed variables as an optimization of GNN parameters, takes extra time for smaller ER-[700-800] instances due to the smaller number of decision variables. However, for larger instances, CRA achieves results comparable to iSCO. Although limited space makes it difficult to present other benchmark results employed by iSCO, such as MaxCut and MaxClique, numerical experiments on these benchmarks also show that CRA is less effective for small problems. However, for larger problems, the results are comparable to or slightly inferior to those of iSCO.
We also investigated the relationship between the order of the graph and the solving time of the solver, and the results are shown in Table 4 and 5. The results demonstrate that the runtime remains nearly constant as graph order and density increase, indicating effective scalability with denser graphs.
Problem | ApR (CRA) | ApR (PI) | Time (CRA) | Time (PI) |
---|---|---|---|---|
0.95 | 0.78 | 108 (s) | 98 (s) | |
0.95 | 0.56 | 103 (s) | 92 (s) | |
0.94 | 0.00 | 102 (s) | 88 (s) | |
0.93 | 0.00 | 101 (s) | 82 (s) | |
0.92 | 0.00 | 102 (s) | 82 (s) | |
0.91 | 0.00 | 101 (s) | 91 (s) | |
0.91 | 0.00 | 101 (s) | 86 (s) | |
0.91 | 0.00 | 102 (s) | 93 (s) | |
0.93 | 0.77 | 436 (s) | 287 (s) | |
0.95 | 0.74 | 413 (s) | 280 (s) | |
0.95 | 0.00 | 419 (s) | 283 (s) | |
0.94 | 0.00 | 429 (s) | 293 (s) | |
0.94 | 0.00 | 418 (s) | 324 (s) | |
0.93 | 0.00 | 321 (s) | 302 (s) | |
0.92 | 0.00 | 321 (s) | 325 (s) | |
0.92 | 0.00 | 330 (s) | 305 (s) |
Problem | CRA (ApR) | PI (ApR) | Time (CRA) | Time (PI) | Time (KaMIS) |
---|---|---|---|---|---|
0.97 | 0.01 | 103 (s) | 98 (s) | 100 (s) | |
0.95 | 0.00 | 100 (s) | 98 (s) | 210 (s) | |
0.94 | 0.00 | 100 (s) | 92 (s) | 315 (s) | |
0.91 | 0.00 | 99 (s) | 88 (s) | 557 (s) | |
0.93 | 0.00 | 98 (s) | 82 (s) | 733 (s) | |
0.90 | 0.00 | 98 (s) | 82 (s) | 1000 (s) | |
0.92 | 0.00 | 99 (s) | 91 (s) | 1000 (s) | |
0.91 | 0.00 | 97 (s) | 86 (s) | 1000 (s) |
F.3 Additional results of Gset
We conducted experiments across the additional GSET collection to further validate that including CRA enhances PI-GNN results beyond previously achievable in Table 6.
GRAPH | (NODES, EDGES) | GREEDY | SDP | RUN-CSP | PI-GNN | CRA |
---|---|---|---|---|---|---|
G1 | (, ) | |||||
G2 | (, ) | |||||
G3 | (, ) | |||||
G4 | (, ) | |||||
G5 | (, ) | |||||
G14 | (, ) | |||||
G15 | (, ) | |||||
G16 | (, ) | |||||
G17 | (, ) | |||||
G22 | (, ) | |||||
G23 | (, ) | |||||
G24 | (, ) | |||||
G25 | (, ) | |||||
G26 | (, ) | |||||
G35 | (, ) | |||||
G36 | (, ) | |||||
G37 | (, ) | |||||
G38 | (, ) | |||||
G43 | (, ) | |||||
G44 | (, ) | |||||
G45 | (, ) | |||||
G46 | (, ) | |||||
G47 | (, ) | |||||
G48 | (, ) | |||||
G49 | (, ) | |||||
G50 | (, ) | |||||
G51 | (, ) | |||||
G52 | (, ) | |||||
G53 | (, ) | |||||
G54 | (, ) | |||||
G55 | (, ) | |||||
G58 | (, ) | |||||
G60 | (, ) | |||||
G63 | (, ) | |||||
G70 | (, ) |
F.4 Additional results of TSP
We conducted additional experiments on several TSP problems from the TSPLIB dataset111http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/, presenting results that illustrate the -dependency.
Experiments calculated the ApR as the ratio of the optimal value to the CRA result, with the ApR representing the average and standard deviation over 3 seeds. The “–” symbol in PI-GNN denotes cases where most variables are continuous values and where no solution satisfying the constraint was found within the maximum number of epochs. The same GNN and optimizer settings were used as in the main text experiments in Section 5.1.
Table 7 shows that the CRA approach yielded solutions with an ApR exceeding 0.9 across various instances. Notably, for the burma14 problem, our method identified the global optimal solution (3,323) multiple times. However, the optimal value can vary based on the specific GNN architecture and problem structure, indicating that a more comprehensive ablation study could provide valuable insights in future work.


burma14 | ulysses22 | st70 | gr96 | |
ApR () | ||||
ApR () | ||||
ApR () | ||||
ApR () | ||||
ApR (PI) | – | – | – | |
Optimum | 3,323 | 7,013 | 675 | 5,5209 |
F.5 Ablation over initial scheduling value and scheduling rate
We conducted an ablation study focusing on the initial scheduling value and scheduling rate . This numerical experiment was conducted under the configuration described in Section 5 and E. Fig. 8 shows the IS density of for MIS problems on a -RRG as a function of the initial scheduling value and the scheduling rate using the CRA-PI-GNN with both GraphSage and GCV. As can be seen, smaller initial scheduling and scheduling rate values typically yield better solutions. However, the convergence time increases progressively as the initial scheduling and scheduling rate values become smaller. In addition, GraphSage consistently produces better solutions even with relatively larger initial scheduling and scheduling rate values, which implies that the GNN architecture influences both the solution quality and the effective regions of the initial scheduling and scheduling rate values for the annealing process.
F.6 Ablation over curve rate
Next, we investigated the effect of varying the curvature in Eq. 5. Numerical experiments were performed on MIS problems with nodes and the degrees of and , as well as MaxCut problems with nodes and the degrees of and . The GraphSAGE architecture was employed, with other parameters set as Section 5 adn E. As shown in Fig. 9, we observed that consistently yielded the best scores across these problems.