This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Promoting Generalization for Exact Solvers via Adversarial Instance Augmentation

Haoyang Liu, Yufei Kuang, Jie Wang,  Xijun Li, Yongdong Zhang,  and Feng Wu H. Liu, Y. Kuang, J. Wang, X. Li, Y. Zhang, and F. Wu are with: a) CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China, Hefei 230027, China; b) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230091, China. E-mail: [email protected], ,[email protected], [email protected], [email protected], [email protected]. X. Li is with Huawei Noah’s Ark Lab. E-mail: [email protected]. Manuscript received August, 2023.
Abstract

Machine learning has been successfully applied to improve the efficiency of Mixed-Integer Linear Programming (MILP) solvers. However, the learning-based solvers often suffer from severe performance degradation on unseen MILP instances—especially on large-scale instances from a perturbed environment—due to the limited diversity of training distributions. To tackle this problem, we propose a novel approach, which is called Adversarial Instance Augmentation and does not require to know the problem type for new instance generation, to promote data diversity for learning-based branching modules in the branch-and-bound (B&B) Solvers (AdaSolver). We use the bipartite graph representations for MILP instances and obtain various perturbed instances to regularize the solver by augmenting the graph structures with a learned augmentation policy. The major technical contribution of AdaSolver is that we formulate the non-differentiable instance augmentation as a contextual bandit problem and adversarially train the learning-based solver and augmentation policy, enabling efficient gradient-based training of the augmentation policy. To the best of our knowledge, AdaSolver is the first general and effective framework for understanding and improving the generalization of both imitation-learning-based (IL-based) and reinforcement-learning-based (RL-based) B&B solvers. Extensive experiments demonstrate that by producing various augmented instances, AdaSolver leads to a remarkable efficiency improvement across various distributions.

Index Terms:
Mixed-Integer Linear Programming, Branching, Graph Neural Network, Generalization and Robustness.

1 Introduction

Combinatorial optimization (CO) problem is one of the most fundamental optimization problems that widely originate from scheduling [1], chip design [2], route planning [3], and other industrial scenarios. Generally, solving CO problems is time-consuming and computationally expensive due to their NP-hard nature. In practice, a large family of CO problems can be formulated as Mixed-integer Linear Programmings (MILPs). Exact MILP solvers, such as the B&B solvers SCIP [4] and Gurobi [5], usually rely on hand-craft heuristics that require considerable manual tunings and complex working flows [6, 7]. To speed up the solvers, recent attempts have leveraged fast-inference learning-based models to approximate some complex heuristics [8, 9, 10] or explore more effective heuristics in solvers [11, 12] and achieved promising in-distribution efficiency improvement.

However, the generalization of learning-based solving models to out-of-distribution (OOD) scenarios remains a significant challenge for their real-world applications. First, the solvers are often trained on a given dataset and need to process instances that vary widely in scales in real-world scenarios. The learning-based solvers may process instances with much larger scales than the training instances in practice. Second, the environment changes often perturb the instance structures. For example, on staff scheduling problems, emergencies occur and present extra restrictions to the jobs. We thus need to consider extra kinds of constraints representing the restrictions that are not included in the previous MILP problems, leading to slight differences in the problem structures. Third, acquiring or generating more data from other distributions or with larger scales may not be feasible due to information security/privacy concerns or high data acquisition costs [13, 14]. For example, in a company’s daily job-shop scheduling problems, confidential business information, such as production capacity, can be estimated from the MILP instances. Thus, the company is likely to provide limited and anonymous training data to avoid privacy disclosure.

Existing work attempts to improve the generalization of learning-based solvers by generating instances from different distributions and then employing domain-randomization type approaches [15, 16]. These approaches have severe limitations when it comes to anonymous datasets, since no prior knowledge of the problem type to perform data generation algorithms. Therefore, this paper focuses on this restrictive practical setting where training data only comes from limited training domains, and acquiring a significant of new instances within a short time is intractable. Experiments in such a setting have demonstrated that the solving time could be much longer than traditional heuristics under distributional shifts, especially on larger-scale MILP instances, which hinders the applicability of neural solvers to real-world scenarios.

In such a practical setting, we represent MILP instances as bipartite graphs [10] and propose to use graph data augmentation—a data manipulation technique—that can promote the diversity of graph structures in training distributions. Unlike domain randomization, data augmentation perturbs the graph structures and does not require more training instances or information on problem types. Moreover, we theoretically justify that data augmentation can reduce the generalization bound to the unseen distributions in Section 7. We employ a learnable function as an augmentation policy network to explore effective augmentation strategies. However, two challenges in the augmentation procedure remain to address. First, it is intractable to train the augmentation policy with the gradient-based method since modifying the graph structures is generally a non-differentiable operation, posing challenges for backward propagation. Second, the augmentation policy network receives significantly less training data than the learning-based solver, since the augmentation policy takes MILP instances as training data and each augmented instance can generate a sequence of branching samples as training data for the learning-based solver. The training of the augmentation policy thus falls behind with the solvers. These challenges of non-differentiability and unbalanced data require a practicable and sample-efficient training framework.

To address the aforementioned challenges, we propose Adversarial Instance Augmentation for Solver (AdaSolver), a robust optimization framework for learning-based B&B solvers. To the best of our knowledge, AdaSolver is the first general framework for understanding and improving the robustness and generalization ability of both the IL and RL B&B solvers under real-world restrictions. Specifically, we focus on the competitive graph neural network (GNN) approach for learning branching heuristics, resulting in two variants AdaSolver-IL and AdaSolver-RL in the IL and RL settings. AdaSolver comprises two components: (1) a GNN branching policy for branching variable selection and (2) an adversarial augmentation network (augmenter) to simulate the real-world perturbations by modifying the structures of MILPs’ bipartite graph representations and promoting structural diversity in newly augmented instances. To train the augmenter, we formulate the adversarial graph augmentation as a contextual bandit problem, which offers two key advantages. First, the contextual bandit formulation enables the gradient-based training of the augmenter that controls the non-differentiable graph structure augmentation. Second, under this formulation, we can propose a sample-efficient decision model adapted from [17] that can learn fast with a limited amount of data. Furthermore, to make sure that the augmented instances do not deviate severely from the training distributions, we leverage a discriminator network to filter out the bad instances to avoid disrupting the model training.

We conduct experiments under five benchmarks and compare our approach with baselines in the in-distribution and out-of-distribution testing instances. Our experiments demonstrate that AdaSolver significantly enhances solvers’ robustness to environmental changes and consistently improves the solving efficiency on various distributions.

2 Related Work

2.1 Machine Learning for Combinatorial Optimization

Learning to optimize CO has attracted more and more interest in the operation research and machine learning fields [13, 18]. Existing methods can be roughly classified into two categories. One focuses on solving specific CO problems in an end-to-end style, such as Traveling Salesperson Problem (TSP) [19, 20, 21], Satisfiability Problem (SAT) [22, 23], Vehicle Routing Problem (VRP) [21, 24] and Maximum Cut Problem (MCP) [25]. While the other category employs learning-based methods to approximate heuristics in the exact B&B solvers, such as branching [9, 10, 26, 27], node selection [8], and cut selection [28, 12]. Built on the exact solvers, these methods have optimal guarantees and can be applied to solve general MILP instances. This paper concentrates on the branching module in the exact solvers since the computational efficiency and quality of the selected variables during branching have a great impact on the overall B&B solving efficiency [29].

Recently, researchers have paid more attention to the generalization and robustness of end-to-end neural solvers For example, existing works improve the generalization of end-to-end neural solvers by curriculum learning [15] and [30], group distributionally robust optimization [31], knowledge distillation [32], generative adversarial training [33], [16] and so on. These methods require the knowledge of problem type for instance generation, and most of them focus on the routing problems since instance generation for these problems is not costly. [34] also proposes label-preserving augmentations specified for SAT problems. However, due to the restrictive practical setting mentioned in the introduction, these approaches are not applicable to the exact B&B solvers when the instances are anonymous or hard for generation. For popular exact CO solvers, [35] propose an adversarial attack approach to evaluate solvers’ robustness. [36] aims to promote the generalization across heterogeneous MILPs in the IL setting. Nonetheless, designing learning-based robust exact solvers for both IL and RL solvers remains largely unexplored.

2.2 Machine Learning for Branching

Existing works on learning to branch include imitation-learning-based (IL-based) and reinforcement-learning-based (RL-based) algorithms. IL-based methods assume a branching expert and leverage a machine learning model to mimic the branching strategy of this expert. For example, [9] uses support vector machine (SVM) to imitate a strong branching expert; [10, 27] represent each branching state as a bipartite graph and employ GNNs to improve the branching performance; [36] mimics the given optimal solutions as oracles with tree search. RL-based methods do not require a branching expert and explore the branch-and-branch trees to improve the heuristics automatically. For example, [37] first uses reinforcement learning for variable selection; other works use evolution strategy [38], retrospective branching trajectories [39], superior Q-network trained with the mixture data [40] and tree Markov decision process [11] to further improve the performance of RL-based branching policy.

3 Background

3.1 Branch-and-Bound Algorithm

In practice, most of the CO problems can be formulated as Mix-Integer Linear Programmings (MILPs) in the form of

min𝐱n𝐜𝐱, s.t. 𝐀𝐱𝐛,𝐥𝐱𝐮, 𝐱p×np,\displaystyle\underset{{\mathbf{x}}\in\mathbb{R}^{n}}{\min}\,{\mathbf{c}}^{\top}{\mathbf{x}},\text{ s.t. }{\mathbf{A}\mathbf{x}}\circ{\mathbf{b}},{\mathbf{l}}\leq{\mathbf{x}}\leq{\mathbf{u}},\text{ }{\mathbf{x}}\in\mathbb{Z}^{p}\times\mathbb{R}^{n-p}, (1)

where 𝐜n{\mathbf{c}}\in\mathbb{R}^{n}, 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n}, 𝐛m{\mathbf{b}}\in\mathbb{R}^{m}, {,=,}m\circ\in\{\leq,=,\geq\}^{m}, 𝐥({})n{\mathbf{l}}\in(\mathbb{R}\cup\{-\infty\})^{n} and 𝐮({+})n{\mathbf{u}}\in(\mathbb{R}\cup\{+\infty\})^{n} denote the lower and upper variable bound, respectively. The branch-and-bound (B&B) algorithm solves MILP instances with a divide-and-conquer approach. B&B recursively performs the following steps. First, it solves the linear relaxation of a selected subproblem, starting with the original MILP. It identifies an integer variable xix_{i} with fractional value xix_{i}^{*} in the current LP solution and generates two subproblems by adding constraints xixi,xixix_{i}\leq\lfloor x_{i}^{*}\rfloor,x_{i}\geq\lceil x_{i}^{*}\rceil to the current subproblem, respectively. B&B then chooses a new subproblem to explore next. As the algorithm proceeds, the subproblems form a binary searching tree, where the root node represents the original MILP and the left and right children of a subproblem are two new subproblems created in the previous step. If a subproblem is infeasible or does not contain optimal solutions, the corresponding subtree is pruned.

A key factor influencing the efficiency of the B&B algorithm is how to select a fractional variable to branch on. So far, the strong branching policy [41] has been found to produce the smallest B&B trees among the existing heuristics. Strong branching selects a branching variable by evaluating the expected bound improvement for each candidate variable.

3.2 Instance Bipartite Graph and Branching Samples

We represent each MILP instance (1) as a weighted bipartite graph G=(WV,E)G=(W\cup V,E) as follows. The vertex sets WW and VV represent the MILP’s constraint set and variable set, respectively. Each node in WW or VV has its own features 𝐰\mathbf{w} or 𝐯\mathbf{v} containing the information of the corresponding constraint or variable. The adjacency matrix EE is the constraint matrix 𝐀\mathbf{A}. We let 𝒢\mathcal{G} denote the set of all such bipartite graphs, called instance bipartite graphs.

Given a MILP instance, the B&B solver recursively solves a sequence of subproblems, which are also MILPs. We can thus represent each subproblem as a bipartite graph as we did for the MILP instances. Furthermore, we would like to involve some solver statistics in the features of these bipartite graphs, such as the reduced cost and incumbent values, to fully observe the solving states [10]. We call these observed bipartite graphs with solver statistics branching samples, denoted by 𝐬\mathbf{s}.

Refer to caption
Figure 1: A simple example of MILP instance bipartite graph. The ithi^{th} constraint node features 𝐰i\mathbf{w}_{i} is in the form of (bi,i)(b_{i},\circ_{i}). The jthj^{th} variable node features 𝐯i\mathbf{v}_{i} is (cj,lj,uj,τj)(c_{j},l_{j},u_{j},\tau_{j}), where τj\tau_{j} is an integer indicator such that τj=1\tau_{j}=1 if xjx_{j} is an integer variable and τj=0\tau_{j}=0 otherwise. In practice, we prefer to use the more complex instance bipartite graph representations with hand-craft features (please refer to the MILP Bipartite class in [42] for more details).

4 IL- and RL-based Branching Formulation

The sequential branching decisions made during B&B can be formulated as a Markov decision process (MDP) [8], where the solver and branching module are the environment and agent, respectively.

We consider the solving process as a finite-horizon MDP, called branching MDP, with maximal time TT, specified by the tuple (𝒮,𝒜,,𝒯,r)(\mathcal{S},\mathcal{A},\mathcal{I},\mathcal{T},r). Here 𝒮\mathcal{S} represents the state space of branching samples, 𝒜\mathcal{A} is the action space consisting of candidate branching variables, and r(s):𝒮r(s):\mathcal{S}\to\mathbb{R} is the reward function. The initialization function :𝒢\mathcal{I}:\mathbb{R}\to\mathcal{G} takes in a random noise ξ\xi and samples from p(G)p(G) to produce the instance G=(ξ)G=\mathcal{I}(\xi), where we view the instance graph GG as the initial state 𝐬0\mathbf{s}_{0}. The transition function 𝒯:𝒮×𝒜𝒮\mathcal{T}:\mathcal{S}\times\mathcal{A}\to\mathcal{S} is deterministic since the branching process rigorously follows the universal mathematical rules. A policy πΠ:𝒮𝒜\pi\in\Pi:\mathcal{S}\to\mathcal{A} maps an observed state to an action using the corresponding branching strategy.

At time tt, the branching module observes a branching sample 𝐬t\mathbf{s}_{t} corresponding to the current subproblem. The branching module then selects a variable ata_{t} in the branching candidates 𝒜\mathcal{A} according to the branching policy π\pi. Finally, the solver splits the current subproblem using ata_{t} to generate another two subproblems and decides the next subproblem 𝐬t+1=𝒯(𝐬t,at)\mathbf{s}_{t+1}=\mathcal{T}(\mathbf{s}_{t},a_{t}) to perform branching.

In practice, we only have access to a finite set of training instances {Gi}i=1n\{G_{i}\}_{i=1}^{n}, and the testing instances may be drawn from an unknown environment p(G)p^{\prime}(G) different from the training distribution p(G)p(G). The distributional shifts influence the initialization function in the testing environment, which we denote by \mathcal{I}^{\prime}.

To describe the relationship between training and testing distributions, we first define a cost function c:𝒢×𝒢c:\mathcal{G}\times\mathcal{G}\to\mathbb{R} to measure the discrepancy between two instance graphs. We make the following assumptions. (A1) The transition functions in the training and testing environment remain the same since the solver follows rigorous mathematical rules and node selection policy. (A2) The transition function 𝒯\mathcal{T} is Lip𝒯1\text{Lip}_{\mathcal{T}_{1}}-Lipschitz with the first variable and Lip𝒯2\text{Lip}_{\mathcal{T}_{2}}-Lipschitz with the second variable, that is

𝒯(𝐬,a)𝒯(𝐬,a)\displaystyle\|\mathcal{T}(\mathbf{s},a)-\mathcal{T}(\mathbf{s}^{\prime},a)\| Lip𝒯1𝐬𝐬,𝐬,𝐬 and a;\displaystyle\leq\text{Lip}_{\mathcal{T}_{1}}\|\mathbf{s}-\mathbf{s}^{\prime}\|,\forall\mathbf{s},\mathbf{s}^{\prime}\text{ and }a;
𝒯(𝐬,a)𝒯(𝐬,a)\displaystyle\|\mathcal{T}(\mathbf{s},a)-\mathcal{T}(\mathbf{s},a^{\prime})\| Lip𝒯2aa,a,a and 𝐬,\displaystyle\leq\text{Lip}_{\mathcal{T}_{2}}\|a-a^{\prime}\|,\forall a,a^{\prime}\text{ and }\mathbf{s},

where the norm \|\cdot\| is a general and simplified notation that can represent the norms in the state space or action space. (A3) Each policy πΠ\pi\in\Pi is Lipschitz continuous with Lipschitz constant Lipπ\text{Lip}_{\pi}.

4.1 IL-based branching formulation

To solve the branching MDP mentioned above, we can employ imitation learning or reinforcement learning to train the branching agent. IL-based branching does not require the reward signal, instead, it assumes a strong branching expert with policy πE\pi_{E} that is considered to be near optimal. A predefined dataset of branching samples and branching variables {(𝐬k,aE,k)}k=1K\{(\mathbf{s}_{k},a_{E,k})\}_{k=1}^{K} is obtained by running the expert on the training instances {Gi}i=1n\{G_{i}\}_{i=1}^{n}. Given the dataset, IL clones the behavior of the expert by enforcing the learned agent to make the same decision with the expert under the same observation,

minπ𝔼Gp(G)𝔼(𝐬,aE)DπE,G,𝒯[(π(𝐬),aE)],\displaystyle\min_{\pi}\mathbb{E}_{G\sim p(G)}\mathbb{E}_{(\mathbf{s},a^{E})\sim D_{\pi_{E},G,\mathcal{T}}}\left[\ell(\pi(\mathbf{s}),a_{E})\right], (2)

where (,)\ell(\cdot,\cdot) is the loss function. DπE,G,𝒯D_{\pi_{E},G,\mathcal{T}} is the joint distribution of the sequence of states and action when running the expert πE\pi_{E} on instance GG, i.e.,

DπE,G,𝒯=p(G)πE(a0|G)t=1T1p(𝐬t+1|𝐬t,at)πE(at|𝐬t),\displaystyle D_{\pi_{E},G,\mathcal{T}}=p(G)\pi_{E}(a_{0}|G)\prod_{t=1}^{T-1}p(\mathbf{s}_{t+1}|\mathbf{s}_{t},a_{t})\pi_{E}(a_{t}|\mathbf{s}_{t}),

which is kept fixed during training. To simplify the notations, we let L(π,G,𝒯):=𝔼(𝐬,aE)DπE,G,𝒯[(π(𝐬),aE)]L(\pi,G,\mathcal{T}):=\mathbb{E}_{(\mathbf{s},a_{E})\sim D_{\pi_{E},G,\mathcal{T}}}\left[\ell(\pi(\mathbf{s}),a_{E})\right].

4.2 RL-based branching formulation

The agent needs to explore the B&B tree using the current policy π\pi and tries to improve the policy by minimizing the negative accumulative reward,

minπ𝔼Gp(G)𝔼𝐬Dπ,G,𝒯[t=0T1γtr(𝐬t)],\displaystyle\min_{\pi}\mathbb{E}_{G\sim p(G)}\mathbb{E}_{\mathbf{s}\sim D_{\pi,G,\mathcal{T}}}\left[\sum_{t=0}^{T-1}-\gamma^{t}r(\mathbf{s}_{t})\right], (3)

where we should notice that the state distribution Dπ,G,𝒯D_{\pi,G,\mathcal{T}} depends on the policy π\pi, instance GG and transition function 𝒯\mathcal{T}. We let R(π,G,𝒯):=𝔼𝐬Dπ,G,𝒯[t=0T1γtr(𝐬t)]R(\pi,G,\mathcal{T}):=\mathbb{E}_{\mathbf{s}\sim D_{\pi,G,\mathcal{T}}}\left[\sum_{t=0}^{T-1}-\gamma^{t}r(\mathbf{s}_{t})\right].

5 From Robust Optimization to Instance Augmentation

Given a branching policy π\pi and a training instance distribution p(G)p(G), we formulate the distributionally robust training as the following optimization problems

minπsupp~𝔼G~p~(G)L(π,G~,𝒯),\displaystyle\min_{\pi}\sup_{\tilde{p}}\mathbb{E}_{\tilde{G}\sim\tilde{p}(G)}L(\pi,\tilde{G},\mathcal{T}), 𝒲𝒢(p,p~)ρ;\displaystyle\mathcal{W}_{\mathcal{G}}(p,\tilde{p})\leq\rho; (IL)
minπsupp~𝔼G~p~(G)R(π,G~,𝒯),\displaystyle\min_{\pi}\sup_{\tilde{p}}\mathbb{E}_{\tilde{G}\sim\tilde{p}(G)}R(\pi,\tilde{G},\mathcal{T}), 𝒲𝒢(p,p~)ρ,\displaystyle\mathcal{W}_{\mathcal{G}}(p,\tilde{p})\leq\rho, (RL)

where 𝒲𝒢(p,p~)\mathcal{W}_{\mathcal{G}}(p,\tilde{p}) is the Wasserstein distance that measures the distance between distributions p(G)p(G) and p~(G)\tilde{p}(G). Specifically, we define 𝒲𝒢(p,p~):=infμ𝔼μ[c(G,G~)]\mathcal{W}_{\mathcal{G}}(p,\tilde{p}):=\inf_{\mu}\mathbb{E}_{\mu}\left[c(G,\tilde{G})\right], where μ\mu is the coupling distribution of p(G)p(G) and p~(G)\tilde{p}(G). We let F(π,G,𝒯)F(\pi,G,\mathcal{T}) represent LL or RR to simplify the notations. However, computing the supremum over the environment variable is intractable, so we use a Lagrangian relaxation with a penalty β\beta instead:

minπsupp~𝔼G~p~(G)F(π,G~,𝒯)β𝒲𝒢(p,p~).\displaystyle\min_{\pi}\sup_{\tilde{p}}\mathbb{E}_{\tilde{G}\sim\tilde{p}(G)}F(\pi,\tilde{G},{\mathcal{T}})-\beta\mathcal{W}_{\mathcal{G}}(p,\tilde{p}).

[43] shows that we can define surrogate loss functions

φ(π,G,𝒯):=supG~𝒢[F(π,G,𝒯)βc(G,G~)]\displaystyle\varphi(\pi,G,\mathcal{T}):=\sup_{\tilde{G}\in\mathcal{G}}[F(\pi,G,\mathcal{T})-\beta c(G,\tilde{G})]

such that the gradients of the surrogate functions

θφ(πθ,G,𝒯)=θF(πθ,G,𝒯),\displaystyle\nabla_{\theta}\varphi(\pi_{\theta},G,\mathcal{T})=\nabla_{\theta}F(\pi_{\theta},G^{*},\mathcal{T}),

where the augmented instance GG^{*} is

G=argmaxG~𝒢[F(πθ,G~,𝒯)βc(G,G~)].G^{*}=\text{argmax}_{\tilde{G}\in\mathcal{G}}[F(\pi_{\theta},\tilde{G},{\mathcal{T}})-\beta c(G,\tilde{G})].

This implies that in order to compute the gradient of φ\varphi, we just need to compute the gradient of the objective on augmented instances GG^{*}. Thus, we turn to find an effective augmentation strategy.

We first specify the optimal policy π\pi^{*} in the testing distribution, which is the best policy to generalize in the hypothesis policy class Π\Pi, i.e.,

π=arg min πΠ𝒥p(π),𝒥p(π)=𝔼Gp[F(π,G,𝒯)].\displaystyle\pi^{*}=\underset{\pi\in\Pi}{\text{arg\,min }}\mathcal{J}_{p^{\prime}}(\pi),\qquad\mathcal{J}_{p^{\prime}}(\pi)=\mathbb{E}_{G^{\prime}\sim p^{\prime}}\left[F(\pi,G^{\prime},\mathcal{T})\right]. (4)

In practice, we can only get access to finite instances 𝒟={Gi}i=1n\mathcal{D}=\{G_{i}\}_{i=1}^{n} sampled from the distribution to estimate the objective, and obtain the best policy π^\hat{\pi} on the dataset,

π^=arg min πΠ𝒥^p(π),\displaystyle\hat{\pi}=\underset{\pi\in\Pi}{\text{arg\,min }}\hat{\mathcal{J}}_{p}(\pi), (5)

where the empirical risks

𝒥^p(π)=1ni=1n1Tj=0T1(π(𝐬ti),aE,ti);\displaystyle\hat{\mathcal{J}}_{p}(\pi)=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{T}\sum_{j=0}^{T-1}\ell(\pi(\mathbf{s}^{i}_{t}),a^{i}_{E,t}); (IL)
𝒥^p(π)=1ni=1n1Tj=0T1γtr(𝐬ti).\displaystyle\hat{\mathcal{J}}_{p}(\pi)=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{T}\sum_{j=0}^{T-1}\gamma^{t}r(\mathbf{s}^{i}_{t}). (RL)

We then derive the objective using data augmentation, and denote the distribution of augmented instances by Φ(p)\Phi(p) and the best policy on the augmented dataset by π^Φ\hat{\pi}_{\Phi},

π^Φ=arg min πΠ𝒥^Φ(p)(π),\displaystyle\hat{\pi}_{\Phi}=\underset{\pi\in\Pi}{\text{arg\,min }}\hat{\mathcal{J}}_{\Phi(p)}(\pi), (6)

in which 𝒥^Φ(p)(π)\hat{\mathcal{J}}_{\Phi(p)}(\pi) is the empirical risks using augmented instances.

6 AdaSolver

In this section, we present the robust optimization framework for learning-based exact MILP solvers using adversarial instance augmentation, namely Adversarial Instance Augmentation for the exact MILP Solver (AdaSolver).

Refer to caption
Figure 2: Overview of AdaSolver framework. AdaSolver jointly optimizes a learning-based branching policy and an adversarial graph augmenter. In each training epoch, we transform a subset of MILP instances into instance bipartite graphs and send them to the augmenter. Then the augmenter decides which nodes and edges to mask. After that, we transform the augmented graphs back to instances. We also leverage a discriminator for data selection. To train the branching policy, we collect branching samples from the original and selected augmented MILP instances. The branching policy records the average loss over samples generated from each augmented instance and returns the loss to compute reward signals for the augmenter. For training efficiency, we use a replay buffer to perform experience replay for the augmenter.

6.1 Adversarial Augmentation

The overview of the proposed framework on the learn-to-branch task is depicted in Figure 2, which consists of two components: a learning-based branching policy network and an adversarial augmenter. The adversarial augmenter employs data augmentation to promote the diversity of graph structures in training data, while the learning-based branching policy network tries to minimize the training loss on samples from both the original and augmented instances to extract task-relevant features across distributions.

6.1.1 Instance Graph Augmentation Operators

Inspired by the fact that the MILP problem structures are included in the instance graph structures, we propose to augment the instances by masking variables, constraints, and edges in instance bipartite graphs to promote the diversity of graph structures. These operators are also reasonable from a real-world perspective. Notice that many MILP problems arise from real-world scheduling scenarios. These problems often feature variables that represent available workers, machines, vehicles, or other material resources, and they feature constraints that formulate the scheduling restrictive relationship among these resources. Therefore, these simple and reliable augmentation operators can also simulate perturbations of real-world resources, i.e., certain materials are inaccessible or some restrictions are lifted. Moreover, recent studies have shown that GNNs struggle to process certain “foldable” MILPs, as stated in [44]. In real-world scenarios, CO instances are highly structural and symmetric, resulting in numerous such structures in the training data that GNNs are unable to distinguish. This issue exacerbates the challenges posed by the limited diversity of training data. By augmenting the graph structures, AdaSolver effectively breaks the symmetric structures in training instances and promotes the diversity of instance graph structures. Thus, instance augmentation can improve not only the generalization ability of branching policy but also the sample efficiency, in which we can use fewer instances to attain comparable performance as we show in Section 8.3.

In practice, we control the usage of augmentation operators and masking proportion so that the augmented instances preserve a similar problem structure and computational hardness with the original instances, preventing harmful effects on training.

6.1.2 Contextual Bandit for Augmentation

The instance augmenter is a function Φ:𝒢𝒢\Phi:\mathcal{G}\to\mathcal{G} mapping an instance bipartite graph GG to the augmented instance graph Φ(G)\Phi(G) by applying augmentation operators on GG. However, two challenges for the instance augmentation remain to address. First, employing structural augmentation on a bipartite graph is generally a non-differentiable process. It is thus intractable to train the adversarial augmenter since the non-differentiable mapping Φ\Phi poses challenges for backward propagation and the gradient-based update fails. Second, the training efficiency of the augmenter and branching policy network is different. The training of the augmenter is much slower than the branching policy network since the augmenter will receive much less training data than the branching policy network. The adversarial augmenter operates on MILP instance graphs {Gk}k\{G_{k}\}_{k} to obtain augmented instances {Φ(Gk)}k\{\Phi(G_{k})\}_{k}. Later the expert policy runs on each augmented instance to generate a sequence of branching samples containing solving states and actions {(𝐬,πE(𝐬))}\{({\mathbf{s}},\pi_{E}({\mathbf{s}}))\} in the IL setting, or the agent runs on the augmented instances to collect samples {(𝐬,a,r(𝐬,a))}\{({\mathbf{s}},a,r({\mathbf{s}},a))\} in the RL setting to calibrate the branching policy. Consequently, the adversarial augmenter receives significantly less training data {Φ(Gk)}k\{\Phi(G_{k})\}_{k} than the branching policy. The issues of non-differentiability and unsatisfactory training efficiency require a sample-efficient training strategy under sparse supervised signals. Therefore, we resort to a contextual bandit version of the proximal policy optimization (PPO) algorithm [17].

To formulate the graph augmentation as a contextual bandit problem, we view the solver as the environment and the adversarial augmenter as the agent. A contextual bandit problem can be represented as the tuple (𝒢,𝒜CB,rCB)(\mathcal{G},\mathcal{A}^{\text{CB}},r^{\text{CB}}). To avoid misleading notations with branching MDP, here we use the superscript CB to specify the contextual bandit formulation. Then we specify the state space 𝒢{\mathcal{G}}, the action space 𝒜CB\mathcal{A}^{\text{CB}} and reward function rCB:𝒢×𝒜CBr^{\text{CB}}:{\mathcal{G}}\times\mathcal{A}^{\text{CB}}\to\mathbb{R} for a MILP instance as follows.

  • The state space 𝒢\mathcal{G}, or context vector space, also known as the context vector space, is the set of graph representations of MILP instances.

  • An action aCB=(VCB,ECB,WCB)𝒜CBa^{\text{CB}}=(V^{\text{CB}},E^{\text{CB}},W^{\text{CB}})\in\mathcal{A}^{\text{CB}}, where VCBV^{\text{CB}}, ECBE^{\text{CB}} and WCBW^{\text{CB}} are the subsets of variables, edges, and constraints to mask, respectively. The agent applies action aCBa^{\text{CB}} to an instance graph GG to obtain an augmented instance graph Φ(G)\Phi(G). After that, we collect branching samples on it.

  • The reward rCB(G,aCB)r^{\text{CB}}(G,a^{\text{CB}}) is defined to be the average loss of branching policy under the samples collected from the augmented graph Φ(G)\Phi(G) from GG.

We describe the training process for the augmenter policy network Φη\Phi_{\eta} with parameters η\eta under the contextual bandit formulation as follows. In practice, the augmenter policy network takes as input an instance graph and outputs the masking probability of each node and edge. We select an action and record the predicted probability, i.e., aCB=maxaΦη(a|G)a^{\text{CB}}=\max_{a}\Phi_{\eta}(a|G). We leverage a learnable state-value function Vα:𝒢V_{\alpha}:\mathcal{G}\to\mathbb{R} with parameters α\alpha to calculate the variance-reduced estimation of the advantage function A^=rCBVα(G)\hat{A}=r^{\text{CB}}-V_{\alpha}(G). In order to improve the training efficiency of the augmenter, we use a replay buffer \mathcal{B} consisting of the tuples (G,A^,Φη(aCB|G))(G,\hat{A},\Phi_{\eta}(a^{\text{CB}}|G)) under the history augmentation policy for performing experience replay. We use λ(η)=Φηcurrent(aCB|G)Φηold(aCB|G)\lambda(\eta)=\frac{\Phi_{\eta_{\text{current}}}(a^{\text{CB}}|G)}{\Phi_{\eta_{\text{old}}}(a^{\text{CB}}|G)} to represent the probability proportion of the current policy and old policies sampled from the buffer. Finally, we reach the training objective of the adversarial augmenter, which includes the policy network and state-value function, by minimizing

LV(α)=𝔼[A^2] and\displaystyle\small L_{V}(\alpha)=\mathbb{E}_{\mathcal{B}}\left[\hat{A}^{2}\right]\quad\text{ and } (7)
LΦ(η)=𝔼[min(λ(η),clip(λ(η),1ϵ,1+ϵ))A^)],\displaystyle L_{\Phi}(\eta)=\mathbb{E}_{\mathcal{B}}\left[\min(\lambda(\eta),\text{clip}(\lambda(\eta),1-\epsilon,1+\epsilon))\hat{A})\right],

where ϵ\epsilon is the clip ratio. We show in Table VII that the sample-efficient bandit algorithm plays a vital role in the overall training process.

6.1.3 Discriminator for Data Augmentation

The adversarial augmenter tends to produce instances quite different from the original distribution, leading to some extremely easy or hard instances. This is because the training objective is to maximize the loss of the GNN on the augmented data, but the augmenter does not receive any restrictions on the augmentation. Thus, the augmenter may ignore the distribution information and just produce some hard OOD instances, which might break the MILP problem structures. To address this problem, we leverage a discriminator network DψD_{\psi} to identify and remove those instances that severely deviate from the original distribution.

We instantiate the discriminator using a graph convolutional encoder and an MLP decoder. The graph encoder extracts the structure information of the MILP instance and the decoder predicts the probability that the instance is from the original distribution or not. To train the discriminator, we minimize the objective

minψLD(ψ)=𝔼G[Dψ(Φ(G))]+𝔼G[1Dψ(G)].\displaystyle\min_{\psi}L_{D}(\psi)=\mathbb{E}_{G}[D_{\psi}(\Phi(G))]+\mathbb{E}_{G}[1-D_{\psi}(G)].

When inference, the discriminator takes as input the augmented graph and decides to use the instance for training the GNN when D(G)>0.5D(G)>0.5 and discard it otherwise. Moreover, we use an additional regularized reward for the GNN, i.e.,

rCB=F(πθ,Φ(G),𝒯)βD(Φ(G))\displaystyle r^{\text{CB}}=F(\pi_{\theta},\Phi(G),\mathcal{T})-\beta D(\Phi(G))

to control the augmentation, where β=0.01\beta=0.01 is a constant.

We provide the pseudo-code for the AdaSolver framework in Algorithm 1.

INITIALIZE: learning-based branching policy πθ\pi_{\theta}, adversarial graph augmenter (Φη,Vα,Dψ)(\Phi_{\eta},V_{\alpha},D_{\psi}), CO instances 𝒟\mathcal{D}, original training dataset 𝒟train\mathcal{D}_{\text{train}}, training epoch NN, learning rate κ\kappa.
for NN epochs do
       Randomly sample KK MILPs {Gk}k=1K\{G_{k}\}_{k=1}^{K} from 𝒟\mathcal{D}
       for k=1,,Kk=1,\dots,K do
             Perform augmentation on GkG_{k} to obtain the augmented instance Φ(Gk)\Phi(G_{k}) with policy Φη\Phi_{\eta}
             Train πθ\pi_{\theta} using IL or RL and record the average loss rkCBr^{\text{CB}}_{k} on GkG_{k}
             Record (Gk,A^k,Φη(akCB|Gk))(G_{k},\hat{A}_{k},\Phi_{\eta}(a_{k}^{\text{CB}}|G_{k})) in the buffer
            
      Train πθ\pi_{\theta} using IL or RL on 𝒟\mathcal{D}
       Compute the LΦ(η)L_{\Phi}(\eta), LV(α)L_{V}(\alpha) and LD(ψ)L_{D}(\psi) over the replay buffer
       Update the parameters η\eta, α\alpha and ψ\psi
      
OUTPUT: parameters of branching policy θ\theta^{*}
Algorithm 1 Training of AdaSolver.

7 Theoretical Analysis for Adversarial Augmentation Framework

In this section, we provide a detailed theoretical analysis of the proposed instance augmentation technique. We analyze the generalization bound for data augmentation in both IL and RL settings. We suppose that the augmented instances are drawn from a perturbed distribution Φ(p)\Phi(p).

7.1 IL Setting

Since the branching samples can be shuffled and collected by interval sampling, the transition function 𝒯\mathcal{T} may have little effect on the training dataset. We thus omit the transition 𝒯\mathcal{T} in the IL setting and view the collected branching samples drawn independently from the same distribution. For the hypothesis policy class Π\Pi, we use the classical Rademacher complexity m(Π)\mathfrak{R}_{m}(\ell\circ\Pi)

m(Π)=𝔼𝒟𝔼σ[supπΠ1Kk=1Kσk(π(𝐬k),πE(𝐬k)]\displaystyle\mathfrak{R}_{m}(\ell\circ\Pi)=\mathbb{E}_{\mathcal{D}}\mathbb{E}_{\sigma}\left[\sup_{\pi\in\Pi}\frac{1}{K}\sum_{k=1}^{K}\sigma_{k}\ell(\pi(\mathbf{s}_{k}),\pi_{E}({\mathbf{s}_{k}})\right]

and mΦ(Π)\mathfrak{R}_{m}^{\Phi}(\ell\circ\Pi) under the instance augmentation

mΦ(Π)=𝔼𝒟𝔼σ[supπΠ1Kk=1Kσk(π(𝐬~k),πE(𝐬~k))]\displaystyle\mathfrak{R}_{m}^{\Phi}(\ell\circ\Pi)=\mathbb{E}_{\mathcal{D}}\mathbb{E}_{\sigma}\left[\sup_{\pi\in\Pi}\frac{1}{K}\sum_{k=1}^{K}\sigma_{k}\ell(\pi(\tilde{\mathbf{s}}_{k}),\pi_{E}({\tilde{\mathbf{s}}_{k}}))\right]

to bound the generalization gap, where 𝐬DπE,p\mathbf{s}\sim D_{\pi_{E},p} and 𝐬~DπE,Φ(p)\tilde{\mathbf{s}}\sim D_{\pi_{E},\Phi(p)}. Furthermore, to simplify the notations, we denote the quantity log(1/δ)2k\sqrt{\frac{\log(1/\delta)}{2k}} by Ω(k)\Omega(k).

Now we derive the IL generalization bound between the empirical risks in the training dataset and the expected risk in the testing dataset. We can find the classical proof of the generalization bound with Rademacher complexity in [45] and the in-distribution generalization bound for data augmentaion in [46]. We now derive the OOD version of the generalization bound under data augmentation in the IL setting (9).

Theorem 1.

Suppose that the loss function (,)\ell(\cdot,\cdot) is bounded by 1 and is 12\frac{1}{2}-Lipschitz continuous. With probability as least 1δ1-\delta and for any πΠ\pi\in\Pi, we have the generalization bounds in the out-of-distribution imitation learning setting as follows.

  • The generalization bound for empirical risk minimization

    𝒥p(π^)𝒥^p(π^)\displaystyle\mathcal{J}_{p^{\prime}}(\hat{\pi})-\hat{\mathcal{J}}_{p}(\hat{\pi})\leq 𝒲(DπE,p(𝐬),DπE,p(𝐬))\displaystyle\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},p}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s})) (8)
    +2K(Π)+Ω(K).\displaystyle+2\mathfrak{R}_{K}(\ell\circ\Pi)+\Omega(K).
  • The generalization bound for data augmentation

    𝒥p(π^)𝒥^Φ(p)(π^)\displaystyle\mathcal{J}_{p^{\prime}}(\hat{\pi})-\hat{\mathcal{J}}_{\Phi(p)}(\hat{\pi}) 𝒲(DπE,Φ(p)(𝐬),DπE,p(𝐬))\displaystyle\leq\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},\Phi(p)}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s})) (9)
    +2KΦ(Π)+Ω(K).\displaystyle+2\mathfrak{R}_{K}^{\Phi}(\ell\circ\Pi)+\Omega(K).

The notation Dπ,p~(𝐬)D_{\pi,\tilde{p}}(\mathbf{s}) denotes the distribution over branching samples that are collected from instances drawn from distribution p~\tilde{p} using policy π\pi. 𝒲\mathcal{W}_{\|\cdot\|} is the Wasserstein distance between two distributions with respect to the norm \|\cdot\| over the branching sample space. Furthermore, we suppose that instance augmentation close the distribution between training and testing data, i.e., for some α<1\alpha<1,

𝒲(DπE,Φ(p),DπE,p)α𝒲(DπE,p,DπE,p).\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},\Phi(p)},D_{\pi_{E},p^{\prime}})\leq\alpha\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},p},D_{\pi_{E},p^{\prime}}).

For α1/3\alpha\leq 1/3, the generalization bound in Equation 9 is lower than that in Equation 8, that is

𝒲(DπE,Φ(p)(𝐬),DπE,p(𝐬))+2KΦ(Π)\displaystyle\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},\Phi(p)}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))+2\mathfrak{R}_{K}^{\Phi}(\ell\circ\Pi) (10)
𝒲(DπE,p(𝐬),DπE,p(𝐬))+2K(Π).\displaystyle\leq\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},p}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))+2\mathfrak{R}_{K}(\ell\circ\Pi).

Thus, Theorem 1 shows that data augmentation is beneficial for out-of-distribution generalization settings.

7.2 RL Setting

We now explore the theory of instance augmentation for OOD RL settings. The transition function 𝒯\mathcal{T} in the RL setting cannot be omitted since we use the parameterized policy to collect branching samples sequentially. We thus need to consider the sequential property to analyze the generalization. We denote the Rademacher complexity for policy class Π\Pi by

n(RΠ)=𝔼𝒟𝔼σ[supπΠ1nj=1nσjR(π,𝒯,G)].\displaystyle\mathfrak{R}_{n}(R\circ\Pi)=\mathbb{E}_{\mathcal{D}}\mathbb{E}_{\sigma}\left[\sup_{\pi\in\Pi}\frac{1}{n}\sum_{j=1}^{n}\sigma_{j}R(\pi,\mathcal{T},G)\right].

For instance augmentation,

nΦ(RΠ)=𝔼𝒟𝔼σ[supπΠ1nj=1nσjR(π,𝒯,Φ(G))].\displaystyle\mathfrak{R}_{n}^{\Phi}(R\circ\Pi)=\mathbb{E}_{\mathcal{D}}\mathbb{E}_{\sigma}\left[\sup_{\pi\in\Pi}\frac{1}{n}\sum_{j=1}^{n}\sigma_{j}R(\pi,\mathcal{T},\Phi(G))\right].

We derive the following generalization bound and generalization error for RL setting in Theorem 2.

Notice that the sample distribution Dπ,G,𝒯D_{\pi,G,\mathcal{T}} is closely related to π\pi and changes with different π\pi, posing difficulties in analyzing the generalization gap. Thus, we consider the reparameterizable RL proposed in [47]. Notice that given an instance GG and deterministic policy π\pi, the solving process is deterministic since the transition 𝒯\mathcal{T} is deterministic. Reparameterizable RL supposes that we can use a random variable ξ\xi to model the random noise and employ the reparameterization trick such that

𝔼Gp(G)R(π,𝒯,G)=𝔼ξp(ξ)R(π,𝒯,,ξ).\displaystyle\mathbb{E}_{G\sim p(G)}R(\pi,\mathcal{T},G)=\mathbb{E}_{\xi\sim p(\xi)}R(\pi,\mathcal{T},\mathcal{I},\xi).

The RL process thus samples from a deterministic distribution p(ξ)p(\xi) in training and testing environments to produce instances G=(ξ)G=\mathcal{I}(\xi) and models the distributional shifts in different initialization functions. This presents significant technical convenience for our analysis.

Theorem 2.

Consider the solvers that can be formulated as the reparameterizable RL with horizon TT and parameter ξ\xi. Suppose that the expected accumulated cost function RR is normalized and bounded by 1. The state reward function r:𝒮r:\mathcal{S}\to\mathbb{R} is Lipr\text{Lip}_{r}-Lipschitz. With probability at least 1δ1-\delta, we have

𝒥p(π^Φ)\displaystyle\mathcal{J}_{p^{\prime}}(\hat{\pi}_{\Phi}) 𝒥p(π)2nΦ(RΠ)+2Ω(n)\displaystyle-\mathcal{J}_{p^{\prime}}(\pi^{*})\leq 2\mathfrak{R}_{n}^{\Phi}(R\circ\Pi)+2\Omega(n) (11)
+2t=0T1Lipr(Lip𝒯2Lipπ+Lip𝒯1)t𝒲𝒢(Φ(p),p).\displaystyle+2\sum_{t=0}^{T-1}\text{Lip}_{r}(\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}+\text{Lip}_{\mathcal{T}_{1}})^{t}\mathcal{W}_{\mathcal{G}}(\Phi(p),p^{\prime}).

Similar to Theorem 1, we have the bound (11) is lower than that without instance augmentation.

8 Experiments

In this part, we conduct comparative experiments on multiple well-recognized benchmarks to demonstrate the effectiveness of our proposed AdaSolver framework in terms of solving efficiency and sample efficiency. Meanwhile, we also perform ablation studies and extensive studies to provide further insight into AdaSolver.

8.1 Experiment Setup

8.1.1 Benchmarks

We evaluate the performance of our proposed AdaSolver on four widely used benchmarks of NP-hard synthesis problem families and a challenging real-world Anonymous dataset. The synthesis datasets include the set covering, combinatorial auctions, capacitated facility location and maximum independent set. We follow the conventional data generation settings in [10], [26] and [27]. In each benchmark, we generate 1000010000 training instances and 20002000 validation instances for AdaSolver. For the final evaluation, we generate an in-distribution testing set of 32 instances D1D_{1}, and five transfer sets D2D6D_{2}-D_{6} of 32 instances with larger scales. We emphasize that the instance scales in the transfer datasets increase in order and the maximum scale of the dataset is much larger than that in the previous works. The Anonymous dataset is a well-recognized real-world dataset consisting of a curated subset of the challenge MILPLIB library. It consists of large-scale instances with 98 for training, 20 for validation and 20 for testing. The instances in the Anonymous dataset have an average of 49603 constraints and 37881 variables.

We report the instances’ scale for each benchmark in Table I. Due to the limited space, More information about benchmarks and instance generation for the synthesis datasets can be found in Appendix C.

TABLE I: Statistical information for the four synthesis benchmarks. Notice that the largest distributions are at least five times larger than the training distributions, which is much more larger than the previous works.
Set Covering Combinatorial Auctions
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6} D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Constraint number 500 1000 2000 4000 3000 8000 193.03 386.25 576.75 771.65 963.56 1156.59
Variable number 1000 1000 1000 1000 1000 1000 500 1000 1500 2000 2500 3000
Capacitated Facility Location Maximum Independent Set
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6} D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Constraint number 10201 20301 40501 60701 80901 161701 1952.34 3946.56 5942.34 7936.40 9934.12 11931.65
Variable number 10100 20100 40100 60100 80100 160100 500 1000 1500 2000 2500 3000

8.1.2 Metric

We report the average solving time (Time, lower is better) and primal-dual integral (PD integral, lower is better) for different branching approaches for the comparative experiments. For more details on the PD integral metric, please see Appendix A.3. Moreover, we provide more results in terms of another two well-recognized metrics, i.e., the average branching nodes (Node) and the average primal-dual gap (PD gap, lower is better) in Appendix E.1.

Following [12], we mainly focus on the Time metric in the distributions where the solver can solve all the instances within the time limit since the solving time is the most direct metric to reflect the solving efficiency. However, the average solving time is almost the same for the distributions where the solvers cannot solve out most of the instances within the limited time. As a result, it is hard to tell which method has the best performance thus the solving time metric is not suitable in these distributions. So we focus on another well-recognized metric primal-dual integral in these distributions instead. Given a time limit, the lower primal-dual integral implies the better quality of the found feasible solutions by the solver. Specifically, we focus on the Time metric on D1D2D_{1}-D_{2} for the synthesis benchmarks, D3D_{3} for the combinatorial auctions and capacitated facility location benchmarks. We focus on the primal-dual integral on the other distributions.

The Node metric reflects the size of B&B searching trees. When the solver solves out an instance, fewer explored nodes imply a more efficient searching policy.

8.1.3 Baselines

Our baselines include a SCIP branching rule and six learning-based branchers containing the competitive IL-based and RL-based approaches. The SCIP branching rule is the traditional state-of-the-art reliability pseudo-cost (RPB) branching heuristic. We then introduce the learning-based branching baselines. SVMRank [9] uses a support vector machine for learning to select the branching variables. Trees [48] is a branching policy using ExtraTrees. LMart [49] is based on LambdaMART. tMDP+DFS is a reinforcement learning method based on tree MDP [11]. TreeGate is a tree search branching approach [36] trained with imitation learning. GNN is the graph neural network approach for learning to branch [10]. Finally, we build our AdaSolver framework on the IL-based GNN branching policy [10] (GNN) and RL-based GNN branching policy (tMDP+DFS) [11], resulting in two methods AdaSolver-IL and AdaSolver-RL. As stated in [11], the RL-based method performs worse than RPB or the IL-based baselines. For RL-based setting, we thus mainly focus on the improvement of AdaSovler-RL over the underlying RL-based solvers.

8.1.4 Implementation

Throughout all experiments, the backend solver is the state-of-the-art open-source solver SCIP 7.0.3. The solving time limit on a testing instance is 900s. Similar to [10, 26, 27], we deactivate solver restarts and allow employing cutting plane generation module only at the root node. We keep all the other SCIP parameters to default so as to better observe the improvements from AdaSolver and make comparisons of these approaches as fair.

For model training, we use Adam optimizer [50] to optimize the model parameters. When training AdaSolver, the adversarial augmenter randomly chooses training instances to perform augmentation in each training epoch. We then collect branching samples from the augmented instances. We choose the augmentation operators that perform the best for each benchmark. More details for these baselines and our AdaSolver are available in Appendix D.

8.2 Comparative Experiments on Solving Effciency

8.2.1 Scale Generalization

TABLE II: The generalization results on multiple datasets. We report the arithmetic mean of solving time and PD integral. The metrics we mainly focus on are marked with \downarrow, and the lower values imply better performance. The best performance is marked in bold.
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
RPB 10.98 114.91 87.62 666.215 831.41 7152.10 900.00 11494.93 900.00 14757.09 900.00 23564.46
SVMRank 11.40 116.21 158.29 989.67 890.66 8546.88 900.03 13444.79 900.05 17018.42 900.05 24535.92
Trees 14.86 131.23 210.26 1238.87 899.87 8690.80 900.01 13226.00 900.02 16772.03 900.02 24599.38
LambdaMart 9.46 106.92 127.34 812.27 875.87 7797.14 900.01 12927.04 900.02 16668.55 900.03 24525.11
tMDP+DFS 20.73 159.59 444.38 2892.12 900.006 11559.79 900.00 15393.24 900.00 18384.06 900.00 24387.4
TreeGate 25.16 187.99 271.82 1790.70 900.00 10391.53 900.00 14389.55 900.00 17747.06 900.02 24118.51
GNN 9.25 116.22 95.57 716.25 844.74 7310.17 900.00 11899.40 900.00 16067.00 900.00 23786.51
AdaSolver-RL 18.87 152.26 433.97 2885.97 900.00 11330.87 900.00 14466.95 900.00 17550.63 900.00 24319.00
AdaSolver-IL 6.28 80.67 73.78 515.92 839.80 7079.13 900.00 10691.05 900.00 14452.71 900.00 21792.69
Set Covering
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
RPB 2.84 32.20 23.20 156.62 224.66 156.62 838.77 1622.46 900.00 2313.35 900.004 2657.51
SVMRank 2.59 26.32 51.12 194.46 427.01 625.01 881.33 1730.40 900.07 2293.06 900.00 2580.31
Trees 2.76 26.44 87.05 221.68 790.01 994.48 900.01 1782.20 900.01 2323.16 900.01 2535.18
LambdaMart 1.90 24.69 28.08 156.10 287.69 446.49 862.31 1563.85 900.02 2136.37 900.02 2455.08
tMDP+DFS 2.18 24.76 27.98 146.15 312.93 518.45 860.76 1871.73 900.00 2664.34 900.00 3103.08
TreeGate 5.75 37.74 84.44 303.53 623.25 1168.37 891.50 2486.65 900.00 3302.70 900.00 3771.89
GNN 2.20 30.48 20.30 167.60 195.68 384.17 819.28 1300.99 900.00 1931.54 900.00 2293.41
AdaSolver-RL 2.04 106.83 17.28 117.28 323.94 517.91 846.32 1685.69 900.00 2379.82 900.00 2980.48
AdaSolver-IL 1.63 22.44 16.34 117.13 178.32 314.32 817.57 1154.79 900.00 1716.55 900.00 2107.19
Combinatorial Auctions
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
RPB 65.67 170.34 287.52 719.26 689.45 3333.72 886.54 10762.83 892.90 20102.78 897.51 27527.24
SVMRank 56.98 274.04 207.16 1011.28 592.15 3913.06 884.87 13898.43 873.08 22201.25 887.00 33649.01
Trees 77.41 285.65 262.54 1038.46 652.15 3895.60 887.00 13649.01 878.81 22433.54 889.54 34909.74
LambdaMart 51.30 281.76 201.06 988.92 591.41 3927.82 884.29 13908.60 872.67 22455.29 886.84 34976.25
tMDP+DFS 73.35 275.38 270.71 1108.57 683.42 4253.40 886.07 12950.94 888.14 23675.80 893.70 34995.77
TreeGate 47.96 180.17 192.29 757.39 661.64 4238.80 873.98 10553.59 886.34 19564.65 897.63 27529.18
GNN 51.63 290.91 235.07 1107.29 660.25 4418.54 854.27 10297.25 868.93 19033.90 890.53 32302.74
AdaSolver-RL 64.10 257.10 233.64 892.77 587.02 3332.38 852.53 7779.28 849.93 14727.70 852.51 27298.25
AdaSolver-IL 34.51 196.56 147.47 738.36 555.66 3324.63 731.81 6644.35 819.31 13818.34 884.48 28231.23
Capacitated Facility Location
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
RPB 16.46 86.57 302.43 1045.88 883.79 3818.32 900.00 4634.32 900.00 5067.01 900.00 5683.46
SVMRank 18.37 76.17 456.19 879.49 883.18 2539.03 900.07 3640.36 900.09 4313.94 900.08 5105.40
Trees 26.04 91.01 701.12 1353.35 900.01 2828.39 900.02 3729.98 900.02 4353.62 900.03 5013.31
LambdaMart 12.80 61.79 445.80 826.64 877.65 2475.11 900.02 3570.20 900.02 4221.15 900.04 5042.00
tMDP+DFS 9.45 54.00 294.04 644.69 781.44 2201.37 897.91 3628.92 900.00 4613.32 900.01 5314.64
TreeGate 67.48 229.32 844.92 2711.83 900.03 3939.96 900.03 4520.56 900.01 4906.46 900.01 5299.72
GNN 11.09 65.73 254.12 558.31 817.04 2284.32 900.00 3532.03 900.01 4387.27 900.01 5168.77
AdaSolver-RL 8.43 51.46 282.54 627.81 803.75 2117.01 879.81 3318.08 900.02 4148.24 900.00 5099.92
AdaSolver-IL 7.39 44.56 210.36 460.80 806.78 2355.11 900.00 3591.38 900.00 4375.48 900.00 5152.28
Maximum Independent Set

We evaluate the generalization ability of AdaSolver and compare it to other baselines on four benchmarks of generated datasets (see Table II). According to Table II, our AdaSolver-IL consistently outperforms the baselines in the four synthesis benchmarks in terms of Time or PD Integral metric. AdaSolver-RL has a significant improvement over the underlying RL-based solver. Please refer to Appendix E.1 for results in terms of Nodes and PD gap metrics.

In the easier distributions D1D_{1} and D2D_{2}, AdaSolver-IL achieves the shortest solving time and outperforms all the baselines. The GNN is a competitive baseline that outperforms other learning-based baselines on much of the distributions D1D_{1} and D2D_{2} of the four benchmarks. While LambdaMart is another promising learning-based baseline with stable performance. Compared to GNN, AdaSolver-IL achieves 27.66% improvement of solving time on the D1D_{1} and D2D_{2}, leading to a total of 39.06% acceleration over the RPB heuristics. AdaSolver-RL improves the solving time of tMDP+DFS up to 12.12%. The results demonstrate that the AdaSolver is effective in improving the performance in the in-distribution or minor out-of-distribution situations.

In the harder distributions D3D6D_{3}-D_{6}, AdaSolver still significantly outperforms other baselines with the lowest PD integral, leading to an improvement of 11.12% over GNN and 13.15% over tMDP+DFS. This demonstrates that AdaSolver can consistently enhance the generalization ability of the underlying solvers to instances with over five times to ten times larger scales.

We summarize the improvement of AdaSolver over the underlying learning-based solver GNN and tMDP+DFS in Table III, demonstrating the effectiveness of AdaSolver.

TABLE III: Improvement of AdaSolver over the underlying solvers on the four synthesis benchmarks.
Benchmark Improvement (%)
AdaSolver-IL AdaSolver-RL
Set Covering 14.44 4.02
Combinatorial Auctions 14.12 10.95
Capacitated Facility Location 26.95 23.35
Maximum Independent Set 7.73 6.87

8.2.2 Perturbation Robustness (Real-World Dataset)

To evaluate the performance in real-world applications, we deploy AdaSolver to a large-scale real-world Anonymous dataset, a widely-used subset of the MILPLIB library. The instances’ constraint number in the Anonymous dataset ranges from 3375 to 159823, and the variable number ranges from 1613 to 92261. The wide range of problem scales as well as real-world perturbations pose significant challenges for the learning-based methods. The results are summarized in Table IV, which demonstrates that AdaSolver is able to improve the performance of the underlying solvers both in IL and RL settings. AdaSolver-IL can reach the best PD gap and AdaSolver-RL has the best PD integral metric. The performance of AdaSolver sheds light on the generalizable neural solvers in real-world applications.

TABLE IV: The results in the real-world dataset show that AdaSolver significantly outperforms the underlying solvers.
Method Time(s) Nodes PD integral \downarrow PD gap \downarrow
RPB 866.68 43907.95 76018.92 5.50e+19
SVMRank 899.93 14519.66 65522.04 4.83e+19
Trees 900.02 9658.55 76143.65 4.66e+19
LambdaMart 830.45 16229.83 62415.32 4.66e+19
tMDP+DFS 763.81 40933.30 44679.11 5.00e+18
GNN 823.88 37469.58 45483.01 3.50e+18
AdaSolver-RL 754.08 46746.05 42179.25 5.00e+18
AdaSolver-IL 830.50 57156.53 42579.31 3.33e+18

8.3 Improving Sample Efficiency of the Branching Policy

We compare the performance of AdaSolver-IL and the underlying GNN solver trained on only 100 instances, one percent amount of instances of that in Section 8.2.1. The results in Table V demonstrate that the underlying learning-based solvers GNN face significant performance degradation when trained on fewer instances. Specifically, the performance of GNN decrease by 49.70% on the set covering benchmark and 76.45% on the combinatorial auctions benchmark. In such a situation, AdaSolver-IL effectively relieves the performance degradation and outperforms GNN, which implies that AdaSolver is much more sample-efficient. In some real-world scenarios, we often have only a few training instances, and the limited availability of instances poses severe challenges for the training of learning-based solvers. Thus, the performance under fewer training instances significantly reflects the potential of practical applicability.

TABLE V: Sample efficiency of our proposed AdaSolver-IL. The GNN baseline faces significant performance degradation when the training instances are few. The results demonstrate that AdaSolver-IL significantly improves the sample efficiency.
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
GNN 13.41 138.13 278.75 1694.85 900.00 10049.36 900.00 14158.38 900.00 16706.91 900.00 24062.04
AdaSolver-IL 7.35 85.06 237.97 1289.08 900.00 8840.34 900.00 12381.27 900.00 15240.56 900.00 22754.06
Set Covering
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
GNN 3.28 35.51 40.05 185.77 577.08 820.29 900.00 1981.16 900.00 2582.09 900.00 3014.64
AdaSolver-IL 2.62 35.69 24.54 182.54 217.18 428.28 830.51 1349.77 900.00 1934.11 900.00 2326.64
Combinatorial Auctions

8.4 Ablation Studies

We conduct ablation studies on the four synthesis benchmarks to show the contributions of the adversarial augmentation strategy. Specifically, we compare the performance of AdaSolver with a random augmentation approach (RA) without the adversarial graph augmenter. Please see Table VI for the results in terms of Time, Nodes and PD integral metrics. An interesting observation is that RA has a slight superiority over AdaSolver in terms of solving time on the distribution D1D_{1} of the set covering, capacitated facility location and maximum independent set benchmarks. But AdaSolver consistently performs better in terms of other benchmarks. RA may be more effective in improving the in-distribution performance of branching policy but fails to capture and produce the graph structures that are useful for generalization to larger instances. The results demonstrate that the adversarial graph augmenter plays an important role in finding the worst-case instances that are beneficial for the generalization of the branching policy networks.

TABLE VI: Ablation study of adversarial augmentation strategy, where RA denotes the random strategy.
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
RA 6.13 79.31 85.01 572.71 852.54 7299.03 900.00 15603.95 900.00 18496.45 900.00 24254.16
AdaSolver-IL 6.28 80.67 73.78 515.92 839.80 7079.13 900.00 10691.05 900.00 14452.71 900.00 21792.69
RA 18.75 149.36 436.97 2931.59 900.00 11646.88 900.00 15478.89 900.00 18691.82 900.00 25966.91
AdaSolver-RL 18.87 152.26 433.97 2885.97 900.00 11330.87 900.00 14466.95 900.00 17550.63 900.00 24319.00
Set Covering
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
RA 1.70 22.93 22.90 127.69 464.55 662.24 829.49 1327.54 900.00 1889.85 900.00 2299.41
AdaSolver-IL 1.63 22.44 16.34 117.13 178.32 314.32 817.57 1154.79 900.00 1716.55 900.00 2107.19
RA 2.18 24.76 18.80 118.94 301.43 489.06 845.88 1751.81 900.00 2548.19 900.00 3247.74
AdaSolver-RL 2.04 106.83 17.28 117.28 323.94 517.91 846.32 1685.69 900.00 2379.82 900.00 2980.48
Combinatorial Auctions
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
RA 33.24 215.65 140.40 752.18 564.12 3548.71 865.28 11576.80 871.82 18585.50 877.97 32716.55
AdaSolver 34.51 196.56 147.47 738.36 555.66 3324.63 731.81 6644.35 819.31 13818.34 884.48 28231.23
RA 68.96 238.76 243.00 936.84 637.58 3953.49 877.48 8579.21 892.98 21496.21 893.97 39560.20
AdaSolver-RL 64.10 257.10 233.64 892.77 587.02 3332.38 852.53 7779.28 849.93 14727.70 852.51 27298.25
Capacitated Facility Location
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
RA 7.26 45.85 534.59 996.39 871.08 2999.36 900.00 5684.19 900.00 6331.45 900.00 6833.94
AdaSolver-IL 7.39 44.56 210.36 460.80 806.78 2355.11 900.00 3591.38 900.00 4375.48 900.00 5152.28
RA 8.00 48.94 275.73 623.85 790.42 2153.45 890.97 3686.33 900.00 4687.60 900.00 5625.48
AdaSolver-RL 8.43 51.46 282.54 627.81 803.75 2117.01 879.81 3318.08 900.02 4148.24 900.00 5099.92
Maximum Independent Set

8.5 Extensive Studies

8.5.1 Different Training Methods of the Augmenter

We compare AdaSolver with its variance that uses reinforce algorithm to train the augmenter (AdaSolver-REINFORCE) in Table VII. The AdaSolver outperforms AdaSolver-REINFORCE in all distributions of the set covering and combinatorial auctions benchmarks. It is well-known that PPO is much more sample-efficient than reinforce, and AdaSolver further improves the performance using the sample-efficient training method.

TABLE VII: Performance of AdaSolver-IL and AdaSolver-REINFORCE. The results show that AdaSolver-IL outperforms AdaSolver-REINFORCE in all distributions.
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
AdaSolver-REINFORCE 1.79 24.66 26.65 151.13 493.92 737.42 882.36 1472.78 900.00 1864.20 900.00 2189.29
AdaSolver-IL 1.63 22.44 16.34 117.13 178.32 314.32 817.57 1154.79 900.00 1716.55 900.00 2107.19
Combinatorial Auctions
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
AdaSolver-REINFORCE 6.99 91.23 104.30 712.53 872.44 7749.23 900.00 11105.89 900.00 14739.92 900.00 23590.80
AdaSolver-IL 6.28 80.67 73.78 515.92 839.80 7079.13 900.00 10691.05 900.00 14452.71 900.00 21792.69
Set Covering

8.5.2 Sensitivity Analysis on the Masking proportions

We analyze the sensitivity of AdaSolver’s performance to the masking proportion on set covering and combinatorial auctions benchmarks, and we report the results in Table VIII. The results show that the performance of AdaSolver is not sensitive to different masking proportions and outperforms the default GNN policy when the proportion is limited to 4%. Another observation is that the lower masking proportions tend to provide a better-learned policy since lower masking proportions promote data diversity and do not significantly break the problem structures, avoiding biased data for training.

TABLE VIII: Sensitivity of the masking proportion. The results show that AdaSolver-IL is not sensitive to the masking proportion when the proportion is limited to 4%.
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
AdaSolver-IL 0.01 1.63 22.44 16.34 117.13 178.32 314.32 817.57 1154.79 900.00 1716.55 900.00 2107.19
AdaSolver-IL 0.02 1.96 27.68 19.21 155.75 194.64 375.73 819.28 1254.26 900.00 1743.97 900.00 2051.01
AdaSolver-IL 0.03 1.92 26.14 17.95 132.11 202.41 360.52 820.05 1240.96 900.00 1720.49 900.00 2054.11
AdaSolver-IL 0.04 2.22 30.55 20.35 156.33 193.72 361.55 820.80 1209.92 900.00 1700.80 900.00 1968.02
AdaSolver-IL 0.05 2.36 32.73 19.78 161.43 196.50 384.36 823.93 1292.45 900.00 1769.48 900.00 2116.08
Combinatorial Auctions
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Time(s) \downarrow PD integral Time(s) \downarrow PD integral Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow Time(s) PD integral \downarrow
AdaSolver-IL 0.01 6.28 80.67 73.78 515.92 839.80 7079.13 900.00 10691.05 900.00 14452.71 900.00 21792.69
AdaSolver-IL 0.02 7.09 87.37 90.46 671.65 872.50 7392.75 900.00 11415.87 900.00 14788.13 900.00 23364.03
AdaSolver-IL 0.03 7.05 87.98 90.39 672.11 730.26 7396.32 900.00 11282.07 900.00 15107.76 900.00 23151.86
AdaSolver-IL 0.04 9.05 98.56 99.42 737.68 869.89 8088.66 900.00 12128.56 900.00 15376.10 900.00 24009.53
AdaSolver-IL 0.05 9.20 98.94 107.61 794.04 876.45 8432.83 900.00 12479.55 900.00 16414.10 900.00 23640.26
Set Covering

8.5.3 Analysis on the GNN Embeddings in the IL setting

We use T-SNE [51] to visualize the learned embeddings of the GNN branching policy trained on the distribution D1D_{1}. The results on the set covering benchmark are illustrated in Figure 3, where different colors represent samples from various distributions on the set covering benchmark. In order to investigate the robustness of environmental perturbations, we select ten distributions with different instance scales and nonzero entry proportions of constraint matrices. The instance bipartite graphs across distributions thus have different local graph structures.

We first observe that before training, GNN’s embeddings on different distributions are separated from each other with little overlap 3(a). The GNN model without AdaSolver cannot capture task-relevant features across distributions. This means that the GNN model does not extract common task-relevant features across distributions on the same problem family. The predicted scores using these embeddings will become unstable across distributions, and the generalization ability of the GNN model is poor. We observe in Figure 3(b) that the embeddings of GNN do not scatter around but still appear loose. Finally, we can see that our proposed AdaSolver framework leads to more concentrative and stable embeddings in Figure 3(c), providing evidence that AdaSolver is not so sensitive to the distributional shifts and perturbations.

Refer to caption
(a) Before training.
Refer to caption
(b) After training.
Refer to caption
(c) AdaSolver.
Figure 3: T-SNE visualization of GNNs’ embeddings on the learn-to-branch task under the IL setting. Different colors represent the samples from different distributions. The left: embeddings before training. The medium: embeddings after training. The right: embeddings of GNN under the AdaSolver framework.

8.5.4 Analysis on the Imitation Accuraccy

We report the imitation learning accuracy on different distributions on the set covering and combinatorial auctions benchmarks in Table IX. For in-distribution D1D_{1} imitation learning accuracy, AdaSolver-IL achieves the comparable or higher imitation accuracy of GNN. An interesting observation is that on D4D6D_{4}-D_{6} of the set covering benchmark, the imitation learning accuracy of AdaSolver-IL is slightly lower than that of GNN, but AdaSolver-IL consistently outperforms GNN on these distributions. Thus, the imitation learning accuracy is not a definite factor of solving performance, and the higher imitation learning accuracy does not imply better performance.

TABLE IX: Imitation learning accuracy on the testing distributions of the set covering and combinatorial auctions benchmarks.
Set Covering
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D5D_{5}
GNN 70.02 66.36 54.07 48.03 48.05 16.08
AdaSolver-IL 70.03 66.39 54.02 52.01 44.08 16.00
Combinatorial Auctions
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
GNN 67.63 72.42 67.86 72.07 76.02 47.03
AdaSolver-IL 68.55 72.57 66.92 62.00 74.02 44.06

9 Conclusion

This paper focuses on the generalization and robustness issues of the learning-based MILP solvers under restrictive training distributions. We observe from empirical results that the learned embeddings of GNN are unstable across distributions on the learn-to-branch task, and we infer the overfitting risks to the homogeneous graph structures. To tackle this challenge, we propose a distributionally robust optimization framework (AdaSolver) for neural solvers to promote the diversity of instance graph structures in training distributions by adversarial instance augmentation. Specifically, we formulate the adversarial instance graph augmentation as a contextual bandit problem and employ a sample efficient decision model to enable efficient training of the augmenter under insufficient training data. Experiments show that AdaSolver significantly improves the generalization of neural solvers on five well-recognized benchmarks.

Acknowledgments

The authors would like to thank all the anonymous reviewers for their insightful comments. This work was supported by the National Key R&D Program of China under contract 2022ZD0119801, National Nature Science Foundations of China grants U19B2026, U19B2044, 61836011, 62021001, and61836006, and the Fundamental Research Funds for the Central Universities grant WK3490000004.

References

  • [1] Z.-L. Chen, “Integrated production and outbound distribution scheduling: review and extensions,” Operations research, vol. 58, no. 1, pp. 130–148, 2010.
  • [2] K. Ma, L. Xiao, J. Zhang, and T. Li, “Accelerating an fpga-based sat solver by software and hardware co-design,” Chinese Journal of Electronics, vol. 28, no. 5, pp. 953–961, 2019.
  • [3] S. Liu, J. M. Pinto, and L. G. Papageorgiou, “A tsp-based milp model for medium-term planning of single-stage continuous multiproduct plants,” Industrial & Engineering Chemistry Research, vol. 47, no. 20, pp. 7733–7743, 2008.
  • [4] K. Bestuzheva, M. Besançon, W.-K. Chen, A. Chmiela, T. Donkiewicz, J. van Doornmalen, L. Eifler, O. Gaul, G. Gamrath, A. Gleixner et al., “The scip optimization suite 8.0,” arXiv preprint arXiv:2112.08872, 2021.
  • [5] L. Gurobi Optimization, “Gurobi optimizer reference manual,” 2021.
  • [6] A. Colorni, M. Dorigo, F. Maffioli, V. Maniezzo, G. Righini, and M. Trubian, “Heuristics from nature for hard combinatorial optimization problems,” International Transactions in Operational Research, vol. 3, no. 1, pp. 1–21, 1996.
  • [7] J. Hromkovič, Algorithmics for hard problems: introduction to combinatorial optimization, randomization, approximation, and heuristics.   Springer Science & Business Media, 2013.
  • [8] H. He, H. Daume III, and J. M. Eisner, “Learning to search in branch and bound algorithms,” Advances in Neural Information Processing Systems, vol. 27, 2014.
  • [9] E. Khalil, P. Le Bodic, L. Song, G. Nemhauser, and B. Dilkina, “Learning to branch in mixed integer programming,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016.
  • [10] M. Gasse, D. Chételat, N. Ferroni, L. Charlin, and A. Lodi, “Exact combinatorial optimization with graph convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [11] L. Scavuzzo, F. Chen, D. Chetelat, M. Gasse, A. Lodi, N. Yorke-Smith, and K. Aardal, “Learning to branch with tree mdps,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 18 514–18 526. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/756d74cd58592849c904421e3b2ec7a4-Paper-Conference.pdf
  • [12] Z. Wang, X. Li, J. Wang, Y. Kuang, M. Yuan, J. Zeng, Y. Zhang, and F. Wu, “Learning cut selection for mixed-integer linear programming via hierarchical sequence model,” in The Eleventh International Conference on Learning Representations, 2023.
  • [13] Y. Bengio, A. Lodi, and A. Prouvost, “Machine learning for combinatorial optimization: a methodological tour d’horizon,” European Journal of Operational Research, vol. 290, no. 2, pp. 405–421, 2021.
  • [14] J. Sakuma and S. Kobayashi, “A genetic algorithm for privacy preserving combinatorial optimization,” in Annual Conference on Genetic and Evolutionary Computation, 2007.
  • [15] C. Wang, Z. Yu, S. McAleer, T. Yu, and Y. Yang, “Asp: Learn a universal neural solver!” arXiv preprint arXiv:2303.00466, 2023.
  • [16] S. Geisler, J. Sommer, J. Schuchardt, A. Bojchevski, and S. Günnemann, “Generalization of neural combinatorial solvers through the lens of adversarial robustness,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=vJZ7dPIjip3
  • [17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347
  • [18] A. Lodi and G. Zarpellon, “On learning and branching: a survey,” Top, vol. 25, pp. 207–236, 2017.
  • [19] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” Advances in Neural Information Processing Systems, vol. 28, 2015.
  • [20] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural combinatorial optimization with reinforcement learning,” arXiv preprint arXiv:1611.09940, 2016.
  • [21] W. Kool, H. van Hoof, and M. Welling, “Attention, learn to solve routing problems!” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=ByxBFsRqYm
  • [22] D. Selsam, M. Lamm, B. Bünz, P. Liang, L. de Moura, and D. L. Dill, “Learning a SAT solver from single-bit supervision,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=HJMC_iA5tm
  • [23] E. Yolcu and B. Póczos, “Learning local search heuristics for boolean satisfiability,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [24] M. Nazari, A. Oroojlooy, L. Snyder, and M. Takác, “Reinforcement learning for solving the vehicle routing problem,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [25] T. Barrett, W. Clements, J. Foerster, and A. Lvovsky, “Exploratory combinatorial optimization with reinforcement learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 3243–3250.
  • [26] P. Gupta, M. Gasse, E. Khalil, P. Mudigonda, A. Lodi, and Y. Bengio, “Hybrid models for learning to branch,” Advances in Neural Information Processing Systems, vol. 33, pp. 18 087–18 097, 2020.
  • [27] P. Gupta, E. B. Khalil, D. Chetélat, M. Gasse, Y. Bengio, A. Lodi, and M. P. Kumar, “Lookback for learning to branch,” Transactions of Machine Learning Research (TMLR), 2022. [Online]. Available: https://openreview.net/forum?id=EQpGkw5rvL
  • [28] Y. Tang, S. Agrawal, and Y. Faenza, “Reinforcement learning for integer programming: Learning to cut,” in International Conference on Machine Learning.   PMLR, 2020, pp. 9367–9376.
  • [29] T. Achterberg, “Constraint integer programming,” Ph.D. dissertation, 2007.
  • [30] Z. Zeyang, Z. Ziwei, W. Xin, and Z. Wenwu, “Learning to solve travelling salesman problem with hardness-adaptive curriculum,” in AAAI Conference on Artificial Intelligence, vol. 36, 2022, pp. 9136–9144.
  • [31] J. Yuan, W. Yaoxin, C. Zhiguang, and Z. Jie, “Learning to solve routing problems via distributionally robust optimization,” in AAAI Conference on Artificial Intelligence, vol. 36, 2022, pp. 9786–9794.
  • [32] J. Bi, Y. Ma, J. Wang, Z. Cao, J. Chen, Y. Sun, and Y. M. Chee, “Learning generalizable models for vehicle routing problems via knowledge distillation,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=sOVNpUEgKMp
  • [33] L. Xin, W. Song, Z. Cao, and J. Zhang, “Generative adversarial training for neural combinatorial optimization models,” 2022. [Online]. Available: https://openreview.net/forum?id=9vsRT9mc7U
  • [34] H. Duan, P. Vaezipoor, M. B. Paulus, Y. Ruan, and C. Maddison, “Augment with care: Contrastive learning for combinatorial problems,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162.   PMLR, 17–23 Jul 2022, pp. 5627–5642. [Online]. Available: https://proceedings.mlr.press/v162/duan22b.html
  • [35] H. Lu, Z. Li, R. Wang, Q. Ren, X. Li, M. Yuan, J. Zeng, X. Yang, and J. Yan, “Roco: A general framework for evaluating robustness of combinatorial optimization solvers on graphs,” in The Eleventh International Conference on Learning Representations.
  • [36] G. Zarpellon, J. Jo, A. Lodi, and Y. Bengio, “Parameterizing branch-and-bound search trees to learn branching policies,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, pp. 3931–3939, May 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/16512
  • [37] M. Etheve, Z. Alès, C. Bissuel, O. Juan, and S. Kedad-Sidhoum, “Reinforcement learning for variable selection in a branch and bound algorithm,” in Integration of Constraint Programming, Artificial Intelligence, and Operations Research, E. Hebrard and N. Musliu, Eds.   Cham: Springer International Publishing, 2020, pp. 176–185.
  • [38] H. Sun, W. Chen, H. Li, and L. Song, “Improving learning to branch via reinforcement learning,” in Learning Meets Combinatorial Algorithms at NeurIPS2020, 2020. [Online]. Available: https://openreview.net/forum?id=z4D7-PTxTb
  • [39] C. W. F. Parsonson, A. Laterre, and T. D. Barrett, “Reinforcement learning for branch-and-bound optimisation using retrospective trajectories,” in AAAI Conference on Artificial Intelligence, vol. 37 (4), 2023, pp. 4061–4069. [Online]. Available: https://doi.org/10.1609/aaai.v37i4.25521
  • [40] Q. Qu, X. Li, Y. Zhou, J. Zeng, M. Yuan, J. Wang, J. Lv, K. Liu, and K. Mao, “An improved reinforcement learning algorithm for learning to branch,” arXiv preprint arXiv:2201.06213, 2022.
  • [41] D. Applegate, R. Bixby, V. Chvátal, and W. Cook, “Finding cuts in the tsp (a preliminary report),” Citeseer, Tech. Rep., 1995.
  • [42] A. Prouvost, J. Dumouchelle, L. Scavuzzo, M. Gasse, D. Chételat, and A. Lodi, “Ecole: A gym-like library for machine learning in combinatorial optimization solvers,” in Learning Meets Combinatorial Algorithms at NeurIPS2020, 2020. [Online]. Available: https://openreview.net/forum?id=IVc9hqgibyB
  • [43] R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese, “Generalizing to unseen domains via adversarial data augmentation,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 5339–5349. [Online]. Available: https://proceedings.neurips.cc/paper/2018/hash/1d94108e907bb8311d8802b48fd54b4a-Abstract.html
  • [44] Z. Chen, J. Liu, X. Wang, J. Lu, and W. Yin, “On representing mixed-integer linear programs by graph neural networks,” 2022.
  • [45] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning, 2nd ed., ser. Adaptive Computation and Machine Learning.   Cambridge, MA: MIT Press, 2018.
  • [46] S. Chen, E. Dobriban, and J. Lee, “A group-theoretic framework for data augmentation,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020, pp. 21 321–21 333. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/f4573fc71c731d5c362f0d7860945b88-Paper.pdf
  • [47] H. Wang, S. Zheng, C. Xiong, and R. Socher, “On the generalization gap in reparameterizable reinforcement learning,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.   PMLR, 09–15 Jun 2019, pp. 6648–6658. [Online]. Available: https://proceedings.mlr.press/v97/wang19o.html
  • [48] C. Hansknecht, I. Joormann, and S. Stiller, “Cuts, primal heuristics, and learning to branch for the time-dependent traveling salesman problem,” 2018.
  • [49] T. Joachims, “Optimizing search engines using clickthrough data,” Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002. [Online]. Available: https://api.semanticscholar.org/CorpusID:207605508
  • [50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
  • [51] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of Machine Learning Research, vol. 9, no. 11, 2008.
  • [52] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of Big Data, vol. 6, pp. 1–48, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:195811894
  • [53] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” ArXiv, vol. abs/1708.04896, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:2035600
  • [54] L. Perez and J. Wang, “The effectiveness of data augmentation in image classification using deep learning,” ArXiv, vol. abs/1712.04621, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:12219403
  • [55] D. Yarats, I. Kostrikov, and R. Fergus, “Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=GY6-6sTvGaf
  • [56] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas, “Reinforcement learning with augmented data,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020, pp. 19 884–19 895. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/e615c82aba461681ade82da2da38004a-Paper.pdf
  • [57] M. Laskin, A. Srinivas, and P. Abbeel, “CURL: Contrastive unsupervised representations for reinforcement learning,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119.   PMLR, 13–18 Jul 2020, pp. 5639–5650. [Online]. Available: https://proceedings.mlr.press/v119/laskin20a.html
  • [58] A. Sadeghi, M. Ma, B. Li, and G. B. Giannakis, “Distributionally robust semi-supervised learning over graphs,” arXiv preprint arXiv:2110.10582, 2021.
  • [59] H. Dai, H. Li, T. Tian, X. Huang, L. Wang, J. Zhu, and L. Song, “Adversarial attack on graph structured data,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1115–1124.
  • [60] F. Feng, X. He, J. Tang, and T.-S. Chua, “Graph adversarial training: Dynamically regularizing based on graph structure,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 6, pp. 2493–2504, 2019.
  • [61] H. Xue, K. Zhou, T. Chen, K. Guo, X. Hu, Y. Chang, and X. Wang, “Cap: Co-adversarial perturbation on weights and features for improving generalization of graph neural networks,” arXiv preprint arXiv:2110.14855, 2021.
  • [62] T. Dao, A. Gu, A. Ratner, V. Smith, C. De Sa, and C. Re, “A kernel theory of modern data augmentation,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.   PMLR, 09–15 Jun 2019, pp. 1528–1537. [Online]. Available: https://proceedings.mlr.press/v97/dao19b.html
  • [63] S. Wu, H. Zhang, G. Valiant, and C. Re, “On the generalization effects of linear transformations in data augmentation,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119.   PMLR, 13–18 Jul 2020, pp. 10 410–10 420. [Online]. Available: https://proceedings.mlr.press/v119/wu20g.html
  • [64] K. Leyton-Brown, M. Pearson, and Y. Shoham, “Towards a universal test suite for combinatorial auction algorithms,” in Proceedings of the 2nd ACM Conference on Electronic Commerce, 2000, pp. 66–76.
  • [65] E. Balas and A. Ho, Set covering algorithms using cutting planes, heuristics, and subgradient optimization: a computational study.   Springer, 1980.
  • [66] G. Cornuéjols, R. Sridharan, and J.-M. Thizy, “A comparison of heuristics and relaxations for the capacitated plant location problem,” European Journal of Operational Research, vol. 50, pp. 280–297, 1991.
  • [67] D. Bergman, A. A. Ciré, W. J. van Hoeve, and J. Hooker, “Decision diagrams for optimization,” Constraints, vol. 20, pp. 494 – 495, 2015.
  • [68] M. Gasse, S. Bowly, Q. Cappart, J. Charfreitag, L. Charlin, D. Chételat, A. Chmiela, J. Dumouchelle, A. Gleixner, A. M. Kazachkov, E. Khalil, P. Lichocki, A. Lodi, M. Lubin, C. J. Maddison, M. Christopher, D. J. Papageorgiou, A. Parjadis, S. Pokutta, A. Prouvost, L. Scavuzzo, G. Zarpellon, L. Yang, S. Lai, A. Wang, X. Luo, X. Zhou, H. Huang, S. Shao, Y. Zhu, D. Zhang, T. Quan, Z. Cao, Y. Xu, Z. Huang, S. Zhou, C. Binbin, H. Minggui, H. Hao, Z. Zhiyu, A. Zhiwu, and M. Kun, “The machine learning for combinatorial optimization competition (ml4co): Results and insights,” in Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track, ser. Proceedings of Machine Learning Research, D. Kiela, M. Ciccone, and B. Caputo, Eds., vol. 176.   PMLR, 06–14 Dec 2022, pp. 220–231. [Online]. Available: https://proceedings.mlr.press/v176/gasse22a.html
[Uncaptioned image] Haoyang Liu received the B.Sc. degree in School of Mathematics from University of Science and Technology of China, Hefei, China, in 2023. He is currently a graduate student in the Department of Electronic Engineering and Information Science at University of Science and Technology of China, Hefei, China. His research interests include reinforcement learning and learning to optimize.
[Uncaptioned image] Yufei Kuang received the B.Sc. degree in Statistics from Nanjing University, Nanjing, China, in 2020. He is currently a Ph.D. candidate in the School of Information Science and Technology at University of Science and Technology of China, Hefei, China. His research interests include reinforcement learning and learning to optimize.
[Uncaptioned image] Jie Wang received the B.Sc. degree in electronic information science and technology from University of Science and Technology of China, Hefei, China, in 2005, and the Ph.D. degree in computational science from the Florida State University, Tallahassee, FL, in 2011. He is currently a professor in the Department of Electronic Engineering and Information Science at University of Science and Technology of China, Hefei, China. His research interests include AI for science, knowledge graph, large-scale optimization, deep learning, etc. He is a senior member of IEEE.
[Uncaptioned image] Xijun Li is a senior researcher of HUAWEI Noah’s Ark Lab. Before that, he has received M.Sc. degree from Shanghai Jiao Tong University, P.R. China, in 2018. He is working towards his Ph.D. degree in the University of Science and Technology of China (HUAWEI-USTC Joint Ph.D. Program) under the supervision of Prof. Jie Wang. He has published 10+ papers on top peer-reviewed conferences and journals (such as ICLR, KDD, ICDE, SIGMOD, DAC, CIKM, ICDCS, TCYB, etc.) and applied/published 12 patents (first author in 6 patents) with Noah’s Ark Lab. And he has won the championship of student learderboard in Dual Track of NeurIPS’21 ML4CO Competition. His recent research interests focus on Mathematical Programming Solver, Learning to Optimization (L2O) and Machine Learning for Computer System (ML4CS).
[Uncaptioned image] Yongdong Zhang received the Ph.D. degree in electronic engineering from Tianjin University, Tianjin, China, in 2002. He is currently a Professor at the University of Science and Technology of China. He has authored more than 100 refereed journal articles and conference papers. His current research interests include multimedia content analysis and understanding, multimedia content security, video encoding, and streaming media technology. He serves as an Editorial Board Member of Multimedia Systems journal and Neurocomputing. He was the recipient of the Best Paper Award in PCM 2013, ICIMCS 2013, and ICME 2010, and the Best Paper Candidate in ICME 2011. He is a Senior Member of IEEE.
[Uncaptioned image] Feng Wu received the B.S. degree in electrical engineering from Xidian University in 1992, and the M.S. and Ph.D. degrees in computer science from the Harbin Institute of Technology in 1996 and 1999, respectively. He is currently a Professor with the University of Science and Technology of China, where he is also the Dean of the School of Information Science and Technology. Before that, he was a Principal Researcher and the Research Manager with Microsoft Research Asia. His research interests include image and video compression, media communication, and media analysis and synthesis. He has authored or coauthored over 200 high quality articles (including several dozens of IEEE Transaction papers) and top conference papers on MOBICOM, SIGIR, CVPR, and ACM MM. He has 77 granted U.S. patents. His 15 techniques have been adopted into international video coding standards. As a coauthor, he received the Best Paper Award at 2009 IEEE Transactions on Circuits and Systems for Video Technology, PCM 2008, and SPIE VCIP 2007. He also received the Best Associate Editor Award from IEEE Circuits and Systems Society in 2012. He also serves as the TPC Chair for MMSP 2011, VCIP 2010, and PCM 2009, and the Special Sessions Chair for ICME 2010 and ISCAS 2013. He serves as an Associate Editor for IEEE Transactions on Circuits and Systems for Video Technology, IEEE Transactions ON Multimedia, and several other international journals.

Appendix A More Background Information

A.1 Branching Rules

An efficiency branching policy significantly influences the size of B&B trees thus the overall solving efficiency. Commonly used traditional branching rules include strong branching and pseudocost branching.

The strong branching rule is known to produce the smallest B&B trees among the branching rules. Let P0P_{0} denote the linear relaxed optimal objective for the parent node. We call a variable xix_{i} branching candidate variable if xix_{i} has factional value xix_{i}^{*} in the optimal solution of the parent LP relaxation. For a branching candidate variable xix_{i}, we let Pi,1P_{i,1} and Pi,2P_{i,2} be the relaxed optimal objective for the children obtained by adding the lower or upper bound (xixi,xixix_{i}\leq\lfloor x_{i}^{*}\rfloor,x_{i}\geq\lceil x_{i}^{*}\rceil), respectively. Then strong branching chooses the branching variable xSBx_{SB} in the branching candidate set 𝒞\mathcal{C} by the following rules,

xSB=argmaxxi𝒞[max(Pi,1P0,ϵ)max(Pi,2P0,ϵ)],\displaystyle x_{SB}=\underset{x_{i}\in\mathcal{C}}{\text{argmax}}\left[\max(P_{i,1}-P_{0},\epsilon)\cdot\max(P_{i,2}-P_{0},\epsilon)\right],

where ϵ>0\epsilon>0 is a small constant. Since strong branching needs to solve the LP relaxations of the two sub-problems for each candidate variable, it requires considerable computation.

The pseudocost branching rule is simpler with much less computational cost than strong branching. It keeps track of the previous successful branching variables and gives a score for each candidate branching variable using the history information. The pseudocost branching rule is faster than the strong branching rule for less computational cost. However, the pseudocost branching rule requires considerable human experience for manual tuning.

To overcome the disadvantages of strong branching and pseudocost branching, some hybrid branching rules are proposed to combine these methods. The reliability pseudocost branching is one of the variants of these hybrid rules with state-of-the-art performance. Thus, the reliability pseudocost branching rule is used as the default branching rule in SCIP.

A.2 GNN Encoder for Branching Samples

In recent studies [10], researchers use a single graph neural network composed of two interleaved half-convolutions to process the instance bipartite graphs or branching samples. GNN first performs massage passing from the variable nodes to constraint nodes, then from constraint nodes to variable nodes to obtain the variable embeddings, i.e.,

𝐡iW=\displaystyle\mathbf{h}_{i}^{W}= MLP2([𝐰i,(𝐰i,𝐯j)MLP1([𝐰i,𝐯j,eij])])\displaystyle\mathrm{MLP}_{2}\left([\mathbf{w}_{i},\sum_{(\mathbf{w}_{i},\mathbf{v}_{j})\in\mathcal{E}}\mathrm{MLP}_{1}(\left[\mathbf{w}_{i},\mathbf{v}_{j},e_{ij}\right])]\right)
𝐡jV=\displaystyle\mathbf{h}_{j}^{V}= MLP4([𝐯j,(𝐰k,𝐯j)MLP3([𝐰k,𝐯j,ekj])]),\displaystyle\mathrm{MLP}_{4}\left([\mathbf{v}_{j},\sum_{(\mathbf{w}_{k},\mathbf{v}_{j})\in\mathcal{E}}\mathrm{MLP}_{3}(\left[\mathbf{w}_{k},\mathbf{v}_{j},e_{kj}\right])]\right),

where MLP denotes the multi-layer perceptron and [][\cdot] is the concatenation operation of vectors. The variable embeddings 𝐡jV\mathbf{h}_{j}^{V} can be used for downstream tasks such as optimal solution prediction, branching variable selection and so on.

Refer to caption
Figure 4: GNN inference for branching score prediction.

A.3 Primal-Dual Integral Metric

When employing the B&B algorithm, we keep track of the global primal and dual bound. The global primal bound at time tt, i.e., 𝐜𝐱t\mathbf{c}^{\top}\mathbf{x}_{t}^{*}, is the best objective found at time tt. The global dual bound is the minimum dual bound ztz^{*}_{t} of all the search tree leaves at time tt. By definition, we have the primal-dual gap 𝐜𝐱t𝐳t>0\mathbf{c}^{\top}\mathbf{x}_{t}^{*}-\mathbf{z}^{*}_{t}>0, and this gap decreases as the B&B algorithm proceeds. B&B finds the optimal solution if and only if 𝐜𝐱t𝐳t=0\mathbf{c}^{\top}\mathbf{x}_{t}^{*}-\mathbf{z}^{*}_{t}=0.

With time limit TT, the primal-dual integral (PD integral) is defined as

0T(𝐜𝐱t𝐳t)𝑑t.\displaystyle\int_{0}^{T}(\mathbf{c}^{\top}\mathbf{x}_{t}^{*}-\mathbf{z}_{t}^{*})dt.

When solving the same instance, solvers that achieve smaller PD-integral often lead to better feasible solutions given the same solving time. The PD-integral is a well-recognized metric for solver performance. Additionally, the PD gap metric is 0 if the instance is solved to be optimal in the limit time.

Appendix B Theoretical Analysis

B.1 More Related Work on Theoretical Results for Data Augmentation

Data augmentation is an effective method for introducing prior information and promoting data diversity, which is widely used in supervised learning [52, 53, 54] and reinforcement learning tasks [55, 56, 57]. Recently, augmentation on graph data has attracted more and more attention [58, 59, 60, 61]. For theoretical results, previous work [46] proposes a generalization bound for data augmentation in a group-theoretical way; [62] analyses from the kernel perspective of data augmentation; [63] studies the effects of data augmentation on the ridge estimator in an over-parametrized linear regression setting.

B.2 Proof of Theorem 1

The generalization bound for Equation 8 and 9 can be derived from the classical approach, where we briefly show the proof as follows. We start by splitting the term to be bounded into two parts,

𝒥p(π^)𝒥^p(π^)=[𝒥p(π^)𝒥p(π^)]+[𝒥p(π^)𝒥^p(π^)].\displaystyle\mathcal{J}_{p^{\prime}}(\hat{\pi})-\hat{\mathcal{J}}_{p}(\hat{\pi})=\left[\mathcal{J}_{p^{\prime}}(\hat{\pi})-\mathcal{J}_{p}(\hat{\pi})\right]+\left[\mathcal{J}_{p}(\hat{\pi})-\hat{\mathcal{J}}_{p}(\hat{\pi})\right].

The second part can be bounded by the Rademacher complexity using Mcdiarmid’s inequality,

𝒥p(π^)𝒥^p(π^)2K(Π)+Ω(K).\displaystyle\mathcal{J}_{p}(\hat{\pi})-\hat{\mathcal{J}}_{p}(\hat{\pi})\leq 2\mathfrak{R}_{K}(\ell\circ\Pi)+\Omega(K).

To bound the first part, we see that

𝒥p(π^)𝒥p(π^)\displaystyle\mathcal{J}_{p^{\prime}}(\hat{\pi})-\mathcal{J}_{p}(\hat{\pi})
=\displaystyle= 𝔼G𝔼𝐬[(π(𝐬),πE(𝐬))]𝔼G𝔼𝐬[(π(𝐬),πE(𝐬))]\displaystyle\mathbb{E}_{G^{\prime}}\mathbb{E}_{\mathbf{s}^{\prime}}\left[\ell(\pi(\mathbf{s}^{\prime}),\pi_{E}(\mathbf{s}^{\prime}))\right]-\mathbb{E}_{G}\mathbb{E}_{\mathbf{s}}\left[\ell(\pi(\mathbf{s}),\pi_{E}(\mathbf{s}))\right]
\displaystyle\leq supfLip1(𝔼G𝔼𝐬[f(𝐬)]𝔼G𝔼𝐬[f(𝐬)])\displaystyle\sup_{\|f\|_{\text{Lip}}\leq 1}\left(\mathbb{E}_{G^{\prime}}\mathbb{E}_{\mathbf{s}^{\prime}}\left[f(\mathbf{s}^{\prime})\right]-\mathbb{E}_{G}\mathbb{E}_{\mathbf{s}}\left[f(\mathbf{s})\right]\right)
=\displaystyle= 𝒲(DπE,p(𝐬),DπE,p(𝐬)).\displaystyle\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},p}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s})).

We complete the proof of Equation 8 by combining the bounds for these two parts. The proof of the generalization bound under instance augmentation (Equation 9) is the same.

We next compare the generalization bound of the two bounds.

2KΦ(Π)2K(Π)\displaystyle 2\mathfrak{R}_{K}^{\Phi}(\ell\circ\Pi)-2\mathfrak{R}_{K}(\ell\circ\Pi)
\displaystyle\leq 2𝔼𝒟[supπΠ1Kk=1K(π(𝐬~k),πE(𝐬~k))]\displaystyle 2\mathbb{E}_{\mathcal{D}}\left[\sup_{\pi\in\Pi}\frac{1}{K}\sum_{k=1}^{K}\ell(\pi(\tilde{\mathbf{s}}_{k}),\pi_{E}(\tilde{\mathbf{s}}_{k}))\right]
2𝔼𝒟[supπΠ1Kk=1K(π(𝐬k),πE(𝐬k))]]\displaystyle\qquad-2\mathbb{E}_{\mathcal{D}}\left[\sup_{\pi\in\Pi}\frac{1}{K}\sum_{k=1}^{K}\ell(\pi(\mathbf{s}_{k}),\pi_{E}(\mathbf{s}_{k}))]\right]
=\displaystyle= 2𝔼𝐬~[supπΠ1Kk=1K(π(𝐬~k),πE(𝐬~k))]\displaystyle 2\mathbb{E}_{\tilde{\mathbf{s}}}\left[\sup_{\pi\in\Pi}\frac{1}{K}\sum_{k=1}^{K}\ell(\pi(\tilde{\mathbf{s}}_{k}),\pi_{E}(\tilde{\mathbf{s}}_{k}))\right]
2𝔼𝐬[supπΠ1Kk=1K(π(𝐬k),πE(𝐬k))]]\displaystyle\qquad-2\mathbb{E}_{\mathbf{s}}\left[\sup_{\pi\in\Pi}\frac{1}{K}\sum_{k=1}^{K}\ell(\pi(\mathbf{s}_{k}),\pi_{E}(\mathbf{s}_{k}))]\right]
\displaystyle\leq 2supf𝔼𝐬~[f(𝐬~)]𝔼𝐬[f(𝐬)]\displaystyle 2\sup_{f}\mathbb{E}_{\tilde{\mathbf{s}}}[f(\tilde{\mathbf{s}})]-\mathbb{E}_{\mathbf{s}}[f(\mathbf{s})]
=\displaystyle= 2𝒲(DπE,Φ(p)(𝐬),DπE,p(𝐬)).\displaystyle 2\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},\Phi(p)}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s})).

If α1/3\alpha\leq 1/3 and the following inequality holds,

2𝒲(DπE,Φ(p)(𝐬),DπE,p(𝐬))\displaystyle 2\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},\Phi(p)}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))
\displaystyle\leq 𝒲(DπE,p(𝐬),DπE,p(𝐬))𝒲(DπE,Φ(p)(𝐬),DπE,p(𝐬)).\displaystyle\tiny{\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},p}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))-\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},\Phi(p)}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))}.

We thus show Equation 10 then the generalization bound of instance augmentation can be lower.

𝒲(DπE,Φ(p)(𝐬),DπE,p(𝐬))+2KΦ(Π)\displaystyle\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},\Phi(p)}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))+2\mathfrak{R}_{K}^{\Phi}(\ell\circ\Pi)
𝒲(DπE,p(𝐬),DπE,p(𝐬))+2K(Π).\displaystyle\leq\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},p}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))+2\mathfrak{R}_{K}(\ell\circ\Pi).

B.3 Proof of Theorem 2

We start by decompose the generalization error as we did in the IL setting

𝔼𝒟[𝒥p(π^Φ)𝒥p(π)]\displaystyle\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{p^{\prime}}(\hat{\pi}_{\Phi})-\mathcal{J}_{p^{\prime}}(\pi^{*})] (12)
=\displaystyle= 𝔼𝒟[𝒥p(π^Φ)𝒥Φ(p)(π^Φ)]+𝔼𝒟[𝒥Φ(p)(π^Φ)𝒥^Φ(p)(π^Φ)]\displaystyle\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{p^{\prime}}(\hat{\pi}_{\Phi})-\mathcal{J}_{\Phi(p)}(\hat{\pi}_{\Phi})]+\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{\Phi(p)}(\hat{\pi}_{\Phi})-\hat{\mathcal{J}}_{\Phi(p)}(\hat{\pi}_{\Phi})]
+𝔼𝒟[𝒥^Φ(p)(π^Φ)𝒥^Φ(p)(π)]\displaystyle+\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{J}}_{\Phi(p)}(\hat{\pi}_{\Phi})-\hat{\mathcal{J}}_{\Phi(p)}(\pi^{*})]
+𝔼𝒟[𝒥^Φ(p)(π)𝒥Φ(p)(π)]+𝔼𝒟[𝒥Φ(p)(π)𝒥p(π)]\displaystyle+\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{J}}_{\Phi(p)}(\pi^{*})-\mathcal{J}_{\Phi(p)}(\pi^{*})]+\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{\Phi(p)}(\pi^{*})-\mathcal{J}_{p^{\prime}}(\pi^{*})]
\displaystyle\leq 𝔼𝒟[𝒥p(π^Φ)𝒥Φ(p)(π^Φ)]+𝔼𝒟[𝒥Φ(p)(π^Φ)𝒥^Φ(p)(π^Φ)]\displaystyle\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{p^{\prime}}(\hat{\pi}_{\Phi})-\mathcal{J}_{\Phi(p)}(\hat{\pi}_{\Phi})]+\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{\Phi(p)}(\hat{\pi}_{\Phi})-\hat{\mathcal{J}}_{\Phi(p)}(\hat{\pi}_{\Phi})]
+𝔼𝒟[𝒥Φ(p)(π)𝒥p(π)]\displaystyle+\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{\Phi(p)}(\pi^{*})-\mathcal{J}_{p^{\prime}}(\pi^{*})]

The inequality in (12) holds since the training loss 𝒥^Φ(p)\hat{\mathcal{J}}_{\Phi(p)} reaches its minimum at the policy π^Φ\hat{\pi}_{\Phi} by definition, and the expectation 𝔼𝒟[𝒥^Φ(p)(π)𝒥Φ(p)(π)]=0\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{J}}_{\Phi(p)}(\pi^{*})-\mathcal{J}_{\Phi(p)}(\pi^{*})]=0. We then try to control the other three terms in (12).

For reparameterizable RL,

𝒥Φ(p)(π)=𝔼ξp(ξ)[R(π,𝒯,Φ,ξ)],\displaystyle\mathcal{J}_{\Phi(p)}(\pi)=\mathbb{E}_{\xi\sim p(\xi)}[R(\pi,\mathcal{T},\mathcal{I}^{\Phi},\xi)],

For the second term in Equation (12), we view RR as a random variable and the total returns sampled from instances drawn from distribution Φ(p)\Phi(p) are i.i.d. We then use the standard symmetrization argument to obtain the bound with confidence at least 1δ1-\delta,

𝒥Φ(p)(π^Φ)𝒥^Φ(p)(π^Φ)\displaystyle\mathcal{J}_{\Phi(p)}(\hat{\pi}_{\Phi})-\hat{\mathcal{J}}_{\Phi(p)}(\hat{\pi}_{\Phi})\leq 2nΦ(Π)+Ω(n).\displaystyle 2\mathfrak{R}_{n}^{\Phi}(\ell\circ\Pi)+\Omega(n). (13)

Finally, we bound the first and the last terms. Different from the IL setting, the transition function plays an important role in sampling training branching samples. Thus, the branching samples cannot be viewed as i.i.d samples.

For any policy πΠ\pi\in\Pi, we have the following estimations,

𝒥Φ(p)(π)𝒥p(π)\displaystyle\mathcal{J}_{\Phi(p)}(\pi)-\mathcal{J}_{p^{\prime}}(\pi)
\displaystyle\leq 𝔼ξp(ξ)R(π,𝒯,Φ,ξ)𝔼ξp(ξ)R(π,𝒯,,ξ)\displaystyle\mathbb{E}_{\xi\sim p(\xi)}R(\pi,\mathcal{T},\mathcal{I}^{\Phi},\xi)-\mathbb{E}_{\xi\sim p(\xi)}R(\pi,\mathcal{T},\mathcal{I}^{\prime},\xi)
=\displaystyle= 𝔼ξt=0T1r(𝐬~t)𝔼ξt=0T1r(𝐬t)\displaystyle\mathbb{E}_{\xi}\sum_{t=0}^{T-1}r(\tilde{\mathbf{s}}_{t})-\mathbb{E}_{\xi}\sum_{t=0}^{T-1}r(\mathbf{s}^{\prime}_{t})
\displaystyle\leq 𝔼ξt=0T1|r(𝐬~t)r(𝐬t)|\displaystyle\mathbb{E}_{\xi}\sum_{t=0}^{T-1}\left|r(\tilde{\mathbf{s}}_{t})-r({\mathbf{s}}^{\prime}_{t})\right|
\displaystyle\leq t=0T1𝔼ξLipr𝐬~t𝐬t.\displaystyle\sum_{t=0}^{T-1}\mathbb{E}_{\xi}\text{Lip}_{r}\|\tilde{\mathbf{s}}_{t}-\mathbf{s}^{\prime}_{t}\|.

We next bound the discrepancy between states from original instances and augmented instances at each step recursively.

𝐬~t𝐬t\displaystyle\|\tilde{\mathbf{s}}_{t}-\mathbf{s}^{\prime}_{t}\|
\displaystyle\leq 𝒯(𝐬~t1,π(𝐬~t1))𝒯(𝐬t1,π(𝐬t1))\displaystyle\|\mathcal{T}(\tilde{\mathbf{s}}_{t-1},\pi(\tilde{\mathbf{s}}_{t-1}))-\mathcal{T}(\mathbf{s}^{\prime}_{t-1},\pi(\mathbf{s}^{\prime}_{t-1}))\|
\displaystyle\leq 𝒯(𝐬~t1,π(𝐬~t1))𝒯(𝐬~t1,π(𝐬t1))\displaystyle\|\mathcal{T}(\tilde{\mathbf{s}}_{t-1},\pi(\tilde{\mathbf{s}}_{t-1}))-\mathcal{T}(\tilde{\mathbf{s}}_{t-1},\pi(\mathbf{s}^{\prime}_{t-1}))\|
+𝒯(𝐬~t1,π(𝐬t1))𝒯(𝐬t1,π(𝐬t1))\displaystyle+\|\mathcal{T}(\tilde{\mathbf{s}}_{t-1},\pi({\mathbf{s}}^{\prime}_{t-1}))-\mathcal{T}(\mathbf{s}^{\prime}_{t-1},\pi(\mathbf{s}^{\prime}_{t-1}))\|
\displaystyle\leq Lip𝒯2Lipπ𝐬~t1𝐬t1+Lip𝒯1𝐬~t1𝐬t1\displaystyle\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}\|\tilde{\mathbf{s}}_{t-1}-\mathbf{s}^{\prime}_{t-1}\|+\text{Lip}_{\mathcal{T}_{1}}\|\tilde{\mathbf{s}}_{t-1}-\mathbf{s}^{\prime}_{t-1}\|
=\displaystyle= (Lip𝒯2Lipπ+Lip𝒯1)𝐬~t1𝐬t1\displaystyle(\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}+\text{Lip}_{\mathcal{T}_{1}})\|\tilde{\mathbf{s}}_{t-1}-\mathbf{s}^{\prime}_{t-1}\|
\displaystyle\leq (Lip𝒯2Lipπ+Lip𝒯1)t𝐬~0𝐬0\displaystyle(\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}+\text{Lip}_{\mathcal{T}_{1}})^{t}\|\tilde{\mathbf{s}}_{0}-\mathbf{s}^{\prime}_{0}\|
=\displaystyle= (Lip𝒯2Lipπ+Lip𝒯1)tc(Φ(G),G).\displaystyle(\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}+\text{Lip}_{\mathcal{T}_{1}})^{t}c(\Phi(G),G^{\prime}).

We thus obtain the bound

𝔼𝒟[𝒥Φ(p)(π)𝒥p(π)]\displaystyle\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{\Phi(p)}(\pi)-\mathcal{J}_{p^{\prime}}(\pi)] (14)
\displaystyle\leq 𝔼𝒟[t=0T1𝔼ξLipr(Lip𝒯2Lipπ+Lip𝒯1)tc(Φ(G),G)]\displaystyle\mathbb{E}_{\mathcal{D}}[\sum_{t=0}^{T-1}\mathbb{E}_{\xi}\text{Lip}_{r}(\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}+\text{Lip}_{\mathcal{T}_{1}})^{t}c(\Phi(G),G^{\prime})]
=\displaystyle= t=0T1Lipr(Lip𝒯2Lipπ+Lip𝒯1)t𝒲𝒢(Φ(p),p).\displaystyle\sum_{t=0}^{T-1}\text{Lip}_{r}(\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}+\text{Lip}_{\mathcal{T}_{1}})^{t}\mathcal{W}_{\mathcal{G}}(\Phi(p),p^{\prime}).

So far, combing Equation 13 and 14, we complete the proof.

Appendix C Details of the Benchmarks

C.1 Synthesis Benchmarks for Evaluations

We evaluate the performance of our proposed methods on four classical benchmarks of NP-hard problem families, namely combinatorial auctions, minimum set covering, capacitated facility location and maximum independent set. We use these four synthesis benchmarks mainly for the following reasons. First, these four benchmarks are widely-used testing beds for MILP solvers [10, 26, 27]. Second, we can generate a series of transfer distributions with much larger scales to evaluate solvers’ generalization ability. Third, these benchmarks represent a wide collection of MILP problems in practice.

We use the code in https://github.com/ds4dm/learn2branch for data generation. (1) Combinatorial auctions is comprised of instances generated following the arbitrary relationships procedure of [64]. (2) Set covering is comprised of instances generated following [65] with default density 0.05. (3) Capacitated facility location is comprised of instances generated following [66]. (4) Maximum independent set is comprised of instances on Erdős-Rényi random graphs, generated following the procedure of [67]. For all benchmarks, we classify six distributions D1D_{1} to D6D_{6} by increasing instance scales or difficulty levels, where D1D_{1} is the easiest with the smallest scales and D6D_{6} is the most difficult with the largest scales that are five to ten times larger than those in D1D_{1}. The baselines can solve all the instances of D1D_{1} and D2D_{2} within 900s but reach the time limit in D4D6D_{4}-D_{6}. Table X summarizes the generation hyperparameters for generation algorithms.

We generate 1000010000 D1D_{1} instances for training and 20002000 D1D_{1} instances for validation, and we evaluate on 3232 instances on D1D_{1} to D6D_{6}, respectively. For the GNN and AdaSolver-IL methods, we solve the training instances with SCIP 7.0.3 and record branching samples and branching decisions of a strong branching expert. We collect 100000100000 samples for training and 2000020000 for validation in total. For RL-based solvers, we limit the processing time of each episode to no more than 1000s.

TABLE X: The hyperparameters for generating the instances.
Distributions D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Row 500 1000 2000 3000 4000 8000
Column 1000 1000 1000 1000 1000 1000
Set Covering
Distributions D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Items 100 200 300 400 500 600
Bids 500 1000 1500 2000 2500 3000
Combinatorial Auctions
Distributions D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Facilities 100 200 400 600 800 1600
Customers 100 100 100 100 100 100
Capacitated Facility Location
Distributions D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Nodes 500 1000 1500 2000 2500 3000
Affinity 4 4 4 4 4 4
Maximum Independent Set

C.2 Real-world Benchmarks for Evaluations

The real-world dataset we use in this paper is the Anonymous dataset from the NeurIPS 2021 ML4CO competition [68]. It consists of 98 training and 20 validation instances. Similarly to what we do in the synthesis benchmarks, we collect 100000100000 training samples and 2000020000 validation samples for GNN and AdaSolver-IL. For RL-based solvers, we limit the processing time of each episode to no more than 1000s.

C.3 Datasets for the Visulization

We generate ten distributions of the set covering benchmark for visualization using the combinations of {500,1000}\{500,1000\} rows, 1000 columns and {0.01,0.05,0.1,0.2,0.5}\{0.01,0.05,0.1,0.2,0.5\} density. The density hyperparameter for data generation controls the proportion of nonzero entries of the constraint matrices. We change the density of the instances to obtain graphs with different local structures to simulate the environmental perturbations in real-world applications.

Appendix D Implementation Details

D.1 Hard-ware Specification and Experiment Settings

We conducted all the experiments on a single machine with NVidia GeForce GTX 3090 GPUs and Intel(R) Xeon(R) E5-2667 v4 CPUs @ 3.20GHz. During the evaluation, we run three random seeds {0,1,2}\{0,1,2\} over each instance.

D.2 Details of the Baselines

In this part, we introduce the details of the baselines used in this paper.

(1) GNN. We use the same implementation of GCNN in [10]. First, we apply two interleaved half-convolutions to the input bipartite graphs. The graph convolution has two successive passes with 64 units, one from variable to constraints and one from constraints to variables. Then we apply a final 2-layer perceptron on the embeddings of variable nodes, combined with a masked softmax activation to produce a probability distribution over the branching candidates. We also use the prenorm layers proposed in [10] and conduct a pre-train stage for them.

(2) SVMRank and LambdaMART. We implement the SVMRank and LambdaMART model for branching policy in SCIP as [10]. The two models take inputs as the 91-dimensional branching features described in [9] and are trained on pairs of variable’s state-rank samples for ranking. The SVMRank is trained by minimizing a pairwise loss of variables and the LambdaMART is trained by minimizing normalized cumulative gain. As mentioned in [10], we reduce the dataset with 250000 candidate variables for training and 100000 for validation because of the prohibitive memory and time cost.

(3) Trees. The Trees model is implemented following that in [10], where we obtain the input variable features by concatenating the bipartite graph variable features, the minimum, mean and maximum of the edge-constraint features over its neighborhood. The model predicts a branching score for each candidate variable and is trained by minimizing the mean-squared error between the predicted scores and ground truth strong branching scores. We also reduce the size of the dataset for Trees.

(4) tMDP+DFS. tMDP+DFS is a reinforcement-learning-based method under the framework of tree Markov Decision Processes (tMDP), a variant of temporal Markov Decision Processes. We implement the tMDP+DFS model with two interleaved half-convolutions GNN encoder as described in [11]. The GNN model architecture is the same as [10]. tMDP+DFS is trained by REINFORCE with tree policy gradients and depth-first search node selection.

(5) TreeGate. TreeGate is an imitation learning framework with input features of 25-dimensional branching candidate states and 60-dimensional search tree states [36]. TreeGate introduces such tree-state parameterization by first embedding the tree-state inputs and then sending them to the proposed gating layers to extract the tree-based signal. Combining the variable representations and tree-state representation, TreeGate predicts the branching scores for each branching candidate variable.

D.3 Representations of instance Bipartite Graph and Branching Samples

To avoid mixing up, recall that the instance bipartite graphs are the input of the augmenter and do not contain solving statistics (since we augment the instance before we solve it with the solver), while the branching samples are the input of branching policy with solving statistics. We use the MILP Bipartite representation for instance graphs implemented in Ecole 0.8 package [42] with nine variable features, one constraint feature and one edge feature listed in Table XI. The branching samples use the graph representations described in [10] listed in Table XII.

TABLE XI: The variable features, constraint features and edge features used for instance bipartite graphs.
Index Variable Feature Name Description
0 Objective Objective coefficient
1-4 Variable type Variable type (binary, integer, implicit-integer, continuous)
5 Specified bounds Whether the variable has a lower bound
6 Specified bounds Whether the variable has an upper bound
7 Lower bound Whether the variable reaches its lower bound
8 Upper bound Whether the variable reaches its upper bound
Index Constraint Feature Name Description
0 Bias Normalized right-hand-side of the constraint
Index Constraint Feature Name Description
0 Coefficient Constraint coefficient
TABLE XII: The variable features, constraint features and edge feature used for branching samples, following those proposed in [10].
Index Variable Feature Name Description
0-3 Variable type Variable type
4 Normalized coefficient Normalized objective coefficient
5 Specified bounds Whether the variable has a lower bound
6 Specified bounds Whether the variable has an upper bound
7 Lower bound Whether the variable reaches its lower bound
8 Upper bound Whether the variable reaches its upper bound
9 Solution fractionality Fractional part of the variable
10-13 Categorical Variable is at its bound or zero
14 Reduced cost Amount by which the objective coefficient of the variable should decrease so that the variable assumes a positive value in the LP solution
15 Age Number of LP iterations
16 Solution value Value of the variable
17 Incumbent value Value of the variable in the current best solution
18 Average incumbent value Average value in the observed feasible primal solutions
Index Constraint Feature Name Description
0 Cosine similarity Cosine of the angle between objective coefficients and the coefficients of this constraint
1 Bias Normalized right-hand -side of the constraint
2 Age Iterations since the last time the constraint was active
3 Normalized dual value Value of dual variable corresponding to the constraint
4 Bounds Whether the constraint reaches its bound
Index Constraint Feature Name Description
0 Coefficient Constraint coefficient

D.4 Implementation Details of AdaSolver

D.4.1 Model Architecture of the Augmenter

The augmenter comprises three parts: an augmentation policy network, a state-value function network and a discriminator network. Each part has a GNN graph encoder and MLP decoders for downstream tasks. The GNN graph encoder encodes the node features of the instance bipartite graphs into 10-dimensional embeddings, then applies an interleaved half-convolution layer to extract 64-dimensional representations for variables {𝐡iV}i=1n\{\mathbf{h}_{i}^{V}\}_{i=1}^{n} and constraints {𝐡jW}j=1m\{\mathbf{h}_{j}^{W}\}_{j=1}^{m}, respectively. The augmentation policy Φη\Phi_{\eta} and state-value function VαV_{\alpha} share the same instance bipartite graph encoder. For the augmentation policy, we use three single-layer networks, MLP1(𝐡iV)\text{MLP}_{1}(\mathbf{h}_{i}^{V}), MLP2(𝐡jW)\text{MLP}_{2}(\mathbf{h}_{j}^{W}) and MLP3([𝐡iV,𝐡jW])\text{MLP}_{3}([\mathbf{h}_{i}^{V},\mathbf{h}_{j}^{W}]) to obtain the masking probability for the ithi^{th} variable, the jthj^{th} constraint and edge 𝐞i,j\mathbf{e}_{i,j}. Then we choose the nodes and edges to mask according to the predicted probability with a predefined masking proportion. The state-value function uses one single-layer network to predict the current state value. The discriminator uses one single-layer network to predict the probability of an augmented instance that is from the original distribution.

D.4.2 Training and Optimization

we use Adam optimizer [50] to train the networks in all experiments.

In the IL setting, we set an initial learning rate of 1×1031\times 10^{-3} and batch size of 8 for GNN and set a learning rate of 1×1031\times 10^{-3} and batch size of 4 for the augmenter. We divide the learning rate for GNN by 5 when the GNN validation loss does not improve for 10 epochs, and stop training if it does not improve for 20. The instance graph augmentation begins at Epoch 10. In each epoch after Epoch 10, we choose k1k_{1} instances to perform augmentation and extract k2k_{2} branching samples from the augmented instances. Here k1=50k_{1}=50, k2=1000k_{2}=1000 in the combinatorial auctions benchmark; k1=100k_{1}=100, k2=5000k_{2}=5000 in other three synthesis benchmarks; k1=10k_{1}=10, k2=1000k_{2}=1000 in the real-world Anonymous dataset. Then we randomly choose 10000 branching samples from the original instances and 2000 samples from the augmented instances to train the GNN.

In the RL setting, we set an initial learning rate of 1×1061\times 10^{-6} and batch size of 8 for tMDP+DFS and set a learning rate of 1×1061\times 10^{-6} and batch size of 4 for the augmenter. In each epoch, we choose 10 instances and run the branching agent on them. We augment 10 instances and run on them every ten epochs. The maximum exploring time is set to be no more than 1000s on each instance.

D.4.3 Hyperparameters

We list the common hyperparameters in Table XIII. For augmentation operators and masking proportions, we choose to mask 3% constraints and 1% edges on the set covering benchmark, mask 1% variables on the combinatorial auctions benchmark, mask 1% constraints and 1% variables on the capacity facility location benchmark, mask 1% edges and 5% constraints on the maximum independent set, and mask 2% constraints and 1% variables in the Anonymous dataset.

TABLE XIII: Common hyperparameters for the augmenter in AdaSolver.
Hyperparameter Value
Half-convolution layers 1 for variables and 1 for constraints
Graph encoder hidden size 64
Batch size 4
PPO learning rate 1e-3
PPO entropy coefficient 1e-2
PPO epochs at each iteration 3
PPO buffer size 300
PPO clip ratio 0.2

D.4.4 Variants of AdaSolver

Random augmentation (RA). The branching policy network is the same as GNN. Different from AdaSolver, RA does not have an augmentation network. Instead, RA randomly masks the constraints, variables and edges under the same operators and proportions as AdaSolver.

AdaSolver-REINFORCE. The branching policy network is the same as GNN. Different from AdaSolver, AdaSolver-REINFORCE employs the REINFORCE algorithm to train the augmentation network, and other hyperparameters remain the same. It is well-known that the REINFORCE algorithm is less sample-efficient than the PPO algorithm, thus the performance of AdaSolver-REINFORCE is inferior to AdaSolver.

We will release all the codes for training and end-to-end evaluation once the paper is accepted.

Appendix E More Experimental Results

In this part, we provide the results for the experiments in the main text in terms of nodes (Nodes) and the primal-dual gap (PD gap). We first give some explanations for these metrics.

Nodes. The number of explored nodes is a metric that reflects the size of B&B searching trees. However, the Nodes metric may be not an appropriate metric to measure the solving efficiency. First, fewer nodes do not mean shorter solving time since the solver may spend more time on each node, resulting in a longer total solving time. Second, on the distributions that solvers cannot find the optimal solutions within the time limit, the Nodes metric is not appropriate to reflect the solving performance. On the one hand, we still cannot estimate the number of explored nodes when solving to the optimal value. On the other hand, given the time limit, fewer nodes imply that the solvers even explore the harder nodes that require more computation to solve the LP relaxations.

PD gap. (1) On the distributions that solvers can find the optimal solutions with the time limit, the PD gap metric reaches zero and fewer nodes imply a better solving performance. (2) On the distributions that solvers cannot find the optimal solutions within the time limit, the PD gap is a suitable metric to measure the solving efficiency. Given the time limit, the lower PD gap implies that the solver has found a better feasible solution so far.

E.1 More Results of Main Evaluation

We present more results on the comparative experiments. Table XIV reports the performance of AdaSolver and the baselines in terms of the total number of nodes and PD gap. The results in Table XIV demonstrate that AdaSolver-IL explores significantly fewer nodes than tMDP and TreeGate consistently on the easier distributions, and slightly more nodes than GNN. However, AdaSolver-IL achieves faster solving efficiency, which means that AdaSolver-IL spends less time than GNN on each node and leads to a wiser strategy. For the distributions that the solvers reach the time limit, AdaSolver-IL outperforms all the baselines in terms of the PD gap. We also observe that AdaSolver-IL explores more nodes than other baselines within the limited time, which demonstrates that AdaSolver-IL learns a policy that chooses easier nodes to explore.

TABLE XIV: More results on main evaluation in terms of Nodes and PD gap.
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap
RPB 247.59 0.00 5866.93 0.00 52814.36 0.05 25630.00 0.10 18279.17 0.15 3910.22 0.26
SVMRank 332.60 0.00 5478.59 0.00 20326.22 0.03 5152.89 0.13 2382.83 0.19 798.75 0.28
Trees 396.46 0.00 5960.50 0.00 17705.65 0.07 6323.89 0.13 3552.26 0.18 1021.86 0.28
LambdaMart 327.42 0.00 5928.39 0.00 28178.68 0.03 7501.61 0.13 3472.22 0.18 893.93 0.28
tMDP+DFS 2324.27 0.00 35037.37 0.02 34062.76 0.04 11145.84 0.17 5365.54 0.22 1519.11 0.31
TreeGate 1242.64 0.00 12826.16 0.004 19133.68 0.03 6901.62 0.15 3144.67 0.20 568.44 0.30
GNN 259.61 0.00 3913.11 0.00 28926.49 0.05 14676.73 0.11 7162.61 0.16 1120.95 0.27
AdaSolver-RL 2227.30 0.00 35936.70 0.02 32528.50 0.12 15368.45 0.17 8476.60 0.21 1653.05 0.32
AdaSolver-IL 270.69 0.00 4361.15 0.00 40711.45 0.05 30649.53 0.10 19573.54 0.14 2367.05 0.27
Set Covering
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap
RPB 42.00 0.00 1616.12 0.00 14154.73 0.00 24072.15 0.01 15178.22 0.02 9185.16 0.02
SVMRank 93.02 0.00 1604.26 0.00 11150.03 0.004 9843.47 0.01 7564.52 0.02 6019.93 0.02
Trees 102.49 0.00 3487.29 0.00 29083.64 0.005 16295.54 0.01 12398.58 0.02 10584.14 0.02
LambdaMart 87.58 0.00 1587.87 0.00 14109.08 0.003 15478.90 0.01 11426.58 0.02 8965.19 0.02
tMDP+DFS 141.69 0.00 3290.90 0.00 27826.58 0.003 47398.77 0.01 28752.17 0.02 16827.58 0.02
TreeGate 527.47 0.00 4473.12 0.00 26772.53 0.003 21841.75 0.02 15610.67 0.02 6948.79 0.03
GNN 84.99 0.00 1306.10 0.00 11710.30 0.00 32724.97 0.009 21995.83 0.01 15017.98 0.02
AdaSolver-RL 191.85 0.00 2308.95 0.00 31815.45 0.00 58011.39 0.01 38012.95 0.02 22258.45 0.02
AdaSolver-IL 102.71 0.00 1953.57 0.00 39556.67 0.00 36135.36 0.008 27599.51 0.01 17599.67 0.01
Combinatorial Auctions
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap
RPB 387.31 0.00 413.93 0.00 145.20 0.002 11.78 0.002 2.38 0.02 1.20 0.07
SVMRank 491.71 0.00 533.21 0.00 337.59 0.002 34.63 0.004 10.46 0.02 2.46 0.09
Trees 508.60 0.00 588.19 0.00 317.75 0.002 31.00 0.004 8.22 0.02 2.78 0.09
LambdaMart 445.69 0.00 520.89 0.00 338.38 0.002 34.79 0.004 10.51 0.02 2.52 0.09
tMDP+DFS 723.95 0.00 788.96 0.00 361.68 0.002 42.21 0.002 7.68 0.02 2.89 0.09
TreeGate 584.64 0.00 631.03 0.00 299.50 0.003 46.27 0.004 2.93 0.007 1.44 0.05
GNN 458.11 0.00 592.82 0.00 290.25 0.002 73.76 0.003 16.22 0.005 3.17 0.07
AdaSolver-RL 1024.40 0.00 839.95 0.00 392.70 0.002 116.00 0.004 36.08 0.004 8.13 0.04
AdaSolver-IL 529.61 0.00 659.79 0.00 487.49 0.002 190.46 0.002 59.84 0.003 3.72 0.08
Capacitated Facility Location
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap
RPB 594.46 0.00 16097.93 0.00 17730.83 0.03 900.00 5387.45 900.00 1874.07 900.01 854.44
SVMRank 289.60 0.00 5757.21 0.01 5097.74 0.02 1536.43 0.03 909.91 0.04 507.74 0.05
Trees 653.03 0.00 13349.63 0.004 7205.15 0.02 2037.52 0.03 1444.09 0.04 912.21 0.05
LambdaMart 425.62 0.00 12268.72 0.005 8108.65 0.02 1862.61 0.03 1084.89 0.04 605.03 0.05
tMDP+DFS 459.08 0.00 21183.04 0.003 23298.80 0.02 7350.35 0.03 3615.30 0.04 1647.60 0.05
TreeGate 2136.85 0.003 8585.86 0.01 3516.26 0.01 1198.28 0.04 925.65 0.05 999.87 0.05
GNN 285.58 0.00 10623.60 0.001 16520.12 0.02 3809.20 0.03 1392.95 0.04 679.15 0.04
AdaSolver-RL 351.50 0.00 22448.85 0.00 30533.75 0.01 12887.25 0.03 7027.05 0.04 4016.80 0.05
AdaSolver-IL 241.12 0.00 10664.04 0.004 20861.74 0.01 19596.78 0.03 12479.65 0.04 6548.56 0.05
Maximum Independent Set

E.2 More Results of Ablation Studies

We compare the Nodes and PD gap metrics in the ablation studies. The results in Table XV show that AdaSolver outperforms RA in terms of solving time and PD gap, demonstrating the effectiveness of our proposed adversarial instance augmenter.

TABLE XV: More results on the ablation studies in terms of Nodes and PD gap.
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow
RA 269.83 0.00 5676.33 0.00 44928.63 0.06 14929.49 0.18 8362.64 0.23 3301.91 0.31
AdaSolver-IL 270.69 0.00 4361.15 0.00 40711.45 0.05 30649.53 0.10 19573.54 0.14 2367.05 0.27
RA 2227.10 0.00 36310.10 0.02 31287.20 0.12 10865.95 0.18 5084.90 0.22 815.05 0.33
AdaSolver-RL 2227.30 0.00 35936.70 0.02 32528.50 0.12 15368.45 0.17 8476.60 0.21 1653.05 0.32
Set Covering
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes\downarrow PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow
RA 101.24 0.00 2491.01 0.00 43006.75 0.00 29486.44 0.009 21203.95 0.01 14441.14 0.02
AdaSolver-IL 102.71 0.00 1953.57 0.00 39556.67 0.00 36135.36 0.008 27599.51 0.01 17599.67 0.01
RA 199.05 0.00 2237.20 0.00 30545.60 0.00 58141.90 0.01 31930.00 0.02 15079.60 0.03
AdaSolver-RL 191.85 0.00 2308.95 0.00 31815.45 0.00 58011.39 0.01 38012.95 0.02 22258.45 0.02
Combinatorial Auctions
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow
RA 473.79 0.00 608.75 0.00 438.14 0.002 89.51 0.004 29.80 0.005 9.52 0.05
AdaSolver-IL 529.61 0.00 659.79 0.00 487.49 0.002 190.46 0.002 59.84 0.003 3.72 0.08
RA 1018.60 0.00 854.90 0.00 468.90 0.002 98.50 0.003 7.90 0.008 2.15 0.11
AdaSolver-RL 1024.40 0.00 839.95 0.00 392.70 0.002 116.00 0.004 36.08 0.004 8.13 0.04
Capacitated Facility Location
D1D_{1} D2D_{2} D3D_{3} D4D_{4} D5D_{5} D6D_{6}
Method Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow
RA 226.79 0.00 40493.42 0.00 31616.33 0.01 14390.99 0.03 7323.71 0.04 4721.40 0.05
AdaSolver-IL 241.12 0.00 10664.04 0.004 20861.74 0.01 19596.78 0.03 12479.65 0.04 6548.56 0.05
RA 323.60 0.00 23020.05 0.007 27906.65 0.01 9073.45 0.03 2974.65 0.04 1434.90 0.05
AdaSolver-RL 351.50 0.00 22448.85 0.00 30533.75 0.01 12887.25 0.03 7027.05 0.04 4016.80 0.05
Maximum Independent Set

E.3 More Results of Longer Time Limit

We evaluate our proposed AdaSolver on a much longer time limit of two hours in the Anonymous dataset to investigate whether AdaSolver can still perform well when used in a much longer time limit. We train the models with a time limit of 900s. The results are in Table XVI, which demonstrate that AdaSolver significantly improves the performance of the underlying solvers.

TABLE XVI: The results in the real-world dataset in a longer time limit show that AdaSolver significantly outperforms the underlying solvers.
Method Time(s) Nodes PD integral \downarrow PD gap \downarrow
RPB 6703.81 300709.45 435985.03 1.50e+19
tMDP+DFS 5647.83 338191.85 241413.64 1.02
GNN 5670.69 308766.63 206413.12 1.13
AdaSolver-RL 5514.66 358002.10 211590.82 0.86
AdaSolver-IL 5640.18 295459.21 203917.75 1.04

E.4 More Results on the Sample Efficiency

We compare the Nodes and PD gap metrics in the experiments of Section 8.3. The results in Table XVII show that AdaSolver significantly outperforms the underlying baselines in terms of PD gap when trained on limited instances, demonstrating much higher sample efficiency. Notice that in the distributions where the solvers can solve the instances within the time limit, AdaSolver-IL also explores significantly less nodes.

TABLE XVII: Sample efficiency of our proposed AdaSolver-IL. The results demonstrate that AdaSolver-IL significantly improves the sample efficiency when the training instances are few.
Method Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow
GNN 769.60 0.00 15754.15 0.00 37577.55 0.10 26173.75 0.15 18570.70 0.20 4854.60 0.30
AdaSolver-IL 445.25 0.00 10010.15 0.00 49802.45 0.08 38256.80 0.13 28539.60 0.17 7270.45 0.28
Set Covering
Method Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow Nodes PD gap \downarrow
GNN 217.50 0.00 3532.60 0.00 34269.20 0.004 33319.85 0.01 23282.75 0.02 16594.65 0.03
AdaSolver-IL 107.93 0.00 1714.28 0.00 14221.71 0.00 31613.71 0.01 23753.37 0.01 16616.03 0.02
Combinatorial Auctions