Promoting Generalization for Exact Solvers via Adversarial Instance Augmentation

Haoyang Liu, Yufei Kuang, Jie Wang, Xijun Li, Yongdong Zhang, and Feng Wu H. Liu, Y. Kuang, J. Wang, X. Li, Y. Zhang, and F. Wu are with: a) CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China, Hefei 230027, China; b) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230091, China. E-mail: [email protected], ,[email protected], [email protected], [email protected], [email protected]. X. Li is with Huawei Noah’s Ark Lab. E-mail: [email protected]. Manuscript received August, 2023.

Abstract

Machine learning has been successfully applied to improve the efficiency of Mixed-Integer Linear Programming (MILP) solvers. However, the learning-based solvers often suffer from severe performance degradation on unseen MILP instances—especially on large-scale instances from a perturbed environment—due to the limited diversity of training distributions. To tackle this problem, we propose a novel approach, which is called Adversarial Instance Augmentation and does not require to know the problem type for new instance generation, to promote data diversity for learning-based branching modules in the branch-and-bound (B&B) Solvers (AdaSolver). We use the bipartite graph representations for MILP instances and obtain various perturbed instances to regularize the solver by augmenting the graph structures with a learned augmentation policy. The major technical contribution of AdaSolver is that we formulate the non-differentiable instance augmentation as a contextual bandit problem and adversarially train the learning-based solver and augmentation policy, enabling efficient gradient-based training of the augmentation policy. To the best of our knowledge, AdaSolver is the first general and effective framework for understanding and improving the generalization of both imitation-learning-based (IL-based) and reinforcement-learning-based (RL-based) B&B solvers. Extensive experiments demonstrate that by producing various augmented instances, AdaSolver leads to a remarkable efficiency improvement across various distributions.

Index Terms:

Mixed-Integer Linear Programming, Branching, Graph Neural Network, Generalization and Robustness.

1 Introduction

Combinatorial optimization (CO) problem is one of the most fundamental optimization problems that widely originate from scheduling [1], chip design [2], route planning [3], and other industrial scenarios. Generally, solving CO problems is time-consuming and computationally expensive due to their NP-hard nature. In practice, a large family of CO problems can be formulated as Mixed-integer Linear Programmings (MILPs). Exact MILP solvers, such as the B&B solvers SCIP [4] and Gurobi [5], usually rely on hand-craft heuristics that require considerable manual tunings and complex working flows [6, 7]. To speed up the solvers, recent attempts have leveraged fast-inference learning-based models to approximate some complex heuristics [8, 9, 10] or explore more effective heuristics in solvers [11, 12] and achieved promising in-distribution efficiency improvement.

However, the generalization of learning-based solving models to out-of-distribution (OOD) scenarios remains a significant challenge for their real-world applications. First, the solvers are often trained on a given dataset and need to process instances that vary widely in scales in real-world scenarios. The learning-based solvers may process instances with much larger scales than the training instances in practice. Second, the environment changes often perturb the instance structures. For example, on staff scheduling problems, emergencies occur and present extra restrictions to the jobs. We thus need to consider extra kinds of constraints representing the restrictions that are not included in the previous MILP problems, leading to slight differences in the problem structures. Third, acquiring or generating more data from other distributions or with larger scales may not be feasible due to information security/privacy concerns or high data acquisition costs [13, 14]. For example, in a company’s daily job-shop scheduling problems, confidential business information, such as production capacity, can be estimated from the MILP instances. Thus, the company is likely to provide limited and anonymous training data to avoid privacy disclosure.

Existing work attempts to improve the generalization of learning-based solvers by generating instances from different distributions and then employing domain-randomization type approaches [15, 16]. These approaches have severe limitations when it comes to anonymous datasets, since no prior knowledge of the problem type to perform data generation algorithms. Therefore, this paper focuses on this restrictive practical setting where training data only comes from limited training domains, and acquiring a significant of new instances within a short time is intractable. Experiments in such a setting have demonstrated that the solving time could be much longer than traditional heuristics under distributional shifts, especially on larger-scale MILP instances, which hinders the applicability of neural solvers to real-world scenarios.

In such a practical setting, we represent MILP instances as bipartite graphs [10] and propose to use graph data augmentation—a data manipulation technique—that can promote the diversity of graph structures in training distributions. Unlike domain randomization, data augmentation perturbs the graph structures and does not require more training instances or information on problem types. Moreover, we theoretically justify that data augmentation can reduce the generalization bound to the unseen distributions in Section 7. We employ a learnable function as an augmentation policy network to explore effective augmentation strategies. However, two challenges in the augmentation procedure remain to address. First, it is intractable to train the augmentation policy with the gradient-based method since modifying the graph structures is generally a non-differentiable operation, posing challenges for backward propagation. Second, the augmentation policy network receives significantly less training data than the learning-based solver, since the augmentation policy takes MILP instances as training data and each augmented instance can generate a sequence of branching samples as training data for the learning-based solver. The training of the augmentation policy thus falls behind with the solvers. These challenges of non-differentiability and unbalanced data require a practicable and sample-efficient training framework.

To address the aforementioned challenges, we propose Adversarial Instance Augmentation for Solver (AdaSolver), a robust optimization framework for learning-based B&B solvers. To the best of our knowledge, AdaSolver is the first general framework for understanding and improving the robustness and generalization ability of both the IL and RL B&B solvers under real-world restrictions. Specifically, we focus on the competitive graph neural network (GNN) approach for learning branching heuristics, resulting in two variants AdaSolver-IL and AdaSolver-RL in the IL and RL settings. AdaSolver comprises two components: (1) a GNN branching policy for branching variable selection and (2) an adversarial augmentation network (augmenter) to simulate the real-world perturbations by modifying the structures of MILPs’ bipartite graph representations and promoting structural diversity in newly augmented instances. To train the augmenter, we formulate the adversarial graph augmentation as a contextual bandit problem, which offers two key advantages. First, the contextual bandit formulation enables the gradient-based training of the augmenter that controls the non-differentiable graph structure augmentation. Second, under this formulation, we can propose a sample-efficient decision model adapted from [17] that can learn fast with a limited amount of data. Furthermore, to make sure that the augmented instances do not deviate severely from the training distributions, we leverage a discriminator network to filter out the bad instances to avoid disrupting the model training.

We conduct experiments under five benchmarks and compare our approach with baselines in the in-distribution and out-of-distribution testing instances. Our experiments demonstrate that AdaSolver significantly enhances solvers’ robustness to environmental changes and consistently improves the solving efficiency on various distributions.

2 Related Work

2.1 Machine Learning for Combinatorial Optimization

Learning to optimize CO has attracted more and more interest in the operation research and machine learning fields [13, 18]. Existing methods can be roughly classified into two categories. One focuses on solving specific CO problems in an end-to-end style, such as Traveling Salesperson Problem (TSP) [19, 20, 21], Satisfiability Problem (SAT) [22, 23], Vehicle Routing Problem (VRP) [21, 24] and Maximum Cut Problem (MCP) [25]. While the other category employs learning-based methods to approximate heuristics in the exact B&B solvers, such as branching [9, 10, 26, 27], node selection [8], and cut selection [28, 12]. Built on the exact solvers, these methods have optimal guarantees and can be applied to solve general MILP instances. This paper concentrates on the branching module in the exact solvers since the computational efficiency and quality of the selected variables during branching have a great impact on the overall B&B solving efficiency [29].

Recently, researchers have paid more attention to the generalization and robustness of end-to-end neural solvers For example, existing works improve the generalization of end-to-end neural solvers by curriculum learning [15] and [30], group distributionally robust optimization [31], knowledge distillation [32], generative adversarial training [33], [16] and so on. These methods require the knowledge of problem type for instance generation, and most of them focus on the routing problems since instance generation for these problems is not costly. [34] also proposes label-preserving augmentations specified for SAT problems. However, due to the restrictive practical setting mentioned in the introduction, these approaches are not applicable to the exact B&B solvers when the instances are anonymous or hard for generation. For popular exact CO solvers, [35] propose an adversarial attack approach to evaluate solvers’ robustness. [36] aims to promote the generalization across heterogeneous MILPs in the IL setting. Nonetheless, designing learning-based robust exact solvers for both IL and RL solvers remains largely unexplored.

2.2 Machine Learning for Branching

Existing works on learning to branch include imitation-learning-based (IL-based) and reinforcement-learning-based (RL-based) algorithms. IL-based methods assume a branching expert and leverage a machine learning model to mimic the branching strategy of this expert. For example, [9] uses support vector machine (SVM) to imitate a strong branching expert; [10, 27] represent each branching state as a bipartite graph and employ GNNs to improve the branching performance; [36] mimics the given optimal solutions as oracles with tree search. RL-based methods do not require a branching expert and explore the branch-and-branch trees to improve the heuristics automatically. For example, [37] first uses reinforcement learning for variable selection; other works use evolution strategy [38], retrospective branching trajectories [39], superior Q-network trained with the mixture data [40] and tree Markov decision process [11] to further improve the performance of RL-based branching policy.

3 Background

3.1 Branch-and-Bound Algorithm

In practice, most of the CO problems can be formulated as Mix-Integer Linear Programmings (MILPs) in the form of

\displaystyle\underset{{\mathbf{x}}\in\mathbb{R}^{n}}{\min}\,{\mathbf{c}}^{\top}{\mathbf{x}},\text{ s.t. }{\mathbf{A}\mathbf{x}}\circ{\mathbf{b}},{\mathbf{l}}\leq{\mathbf{x}}\leq{\mathbf{u}},\text{ }{\mathbf{x}}\in\mathbb{Z}^{p}\times\mathbb{R}^{n-p},

(1)

where ${\mathbf{c}}\in\mathbb{R}^{n}$ , $\mathbf{A}\in\mathbb{R}^{m\times n}$ , ${\mathbf{b}}\in\mathbb{R}^{m}$ , $\circ\in\{\leq,=,\geq\}^{m}$ , ${\mathbf{l}}\in(\mathbb{R}\cup\{-\infty\})^{n}$ and ${\mathbf{u}}\in(\mathbb{R}\cup\{+\infty\})^{n}$ denote the lower and upper variable bound, respectively. The branch-and-bound (B&B) algorithm solves MILP instances with a divide-and-conquer approach. B&B recursively performs the following steps. First, it solves the linear relaxation of a selected subproblem, starting with the original MILP. It identifies an integer variable $x_{i}$ with fractional value $x_{i}^{*}$ in the current LP solution and generates two subproblems by adding constraints $x_{i}\leq\lfloor x_{i}^{*}\rfloor,x_{i}\geq\lceil x_{i}^{*}\rceil$ to the current subproblem, respectively. B&B then chooses a new subproblem to explore next. As the algorithm proceeds, the subproblems form a binary searching tree, where the root node represents the original MILP and the left and right children of a subproblem are two new subproblems created in the previous step. If a subproblem is infeasible or does not contain optimal solutions, the corresponding subtree is pruned.

A key factor influencing the efficiency of the B&B algorithm is how to select a fractional variable to branch on. So far, the strong branching policy [41] has been found to produce the smallest B&B trees among the existing heuristics. Strong branching selects a branching variable by evaluating the expected bound improvement for each candidate variable.

3.2 Instance Bipartite Graph and Branching Samples

We represent each MILP instance (1) as a weighted bipartite graph $G=(W\cup V,E)$ as follows. The vertex sets $W$ and $V$ represent the MILP’s constraint set and variable set, respectively. Each node in $W$ or $V$ has its own features $\mathbf{w}$ or $\mathbf{v}$ containing the information of the corresponding constraint or variable. The adjacency matrix $E$ is the constraint matrix $\mathbf{A}$ . We let $\mathcal{G}$ denote the set of all such bipartite graphs, called instance bipartite graphs.

Given a MILP instance, the B&B solver recursively solves a sequence of subproblems, which are also MILPs. We can thus represent each subproblem as a bipartite graph as we did for the MILP instances. Furthermore, we would like to involve some solver statistics in the features of these bipartite graphs, such as the reduced cost and incumbent values, to fully observe the solving states [10]. We call these observed bipartite graphs with solver statistics branching samples, denoted by $\mathbf{s}$ .

Refer to caption — Figure 1: A simple example of MILP instance bipartite graph. The $i^{th}$ constraint node features $\mathbf{w}_{i}$ is in the form of $(b_{i},\circ_{i})$ . The $j^{th}$ variable node features $\mathbf{v}_{i}$ is $(c_{j},l_{j},u_{j},\tau_{j})$ , where $\tau_{j}$ is an integer indicator such that $\tau_{j}=1$ if $x_{j}$ is an integer variable and $\tau_{j}=0$ otherwise. In practice, we prefer to use the more complex instance bipartite graph representations with hand-craft features (please refer to the MILP Bipartite class in [42] for more details).

4 IL- and RL-based Branching Formulation

The sequential branching decisions made during B&B can be formulated as a Markov decision process (MDP) [8], where the solver and branching module are the environment and agent, respectively.

We consider the solving process as a finite-horizon MDP, called branching MDP, with maximal time $T$ , specified by the tuple $(\mathcal{S},\mathcal{A},\mathcal{I},\mathcal{T},r)$ . Here $\mathcal{S}$ represents the state space of branching samples, $\mathcal{A}$ is the action space consisting of candidate branching variables, and $r(s):\mathcal{S}\to\mathbb{R}$ is the reward function. The initialization function $\mathcal{I}:\mathbb{R}\to\mathcal{G}$ takes in a random noise $\xi$ and samples from $p(G)$ to produce the instance $G=\mathcal{I}(\xi)$ , where we view the instance graph $G$ as the initial state $\mathbf{s}_{0}$ . The transition function $\mathcal{T}:\mathcal{S}\times\mathcal{A}\to\mathcal{S}$ is deterministic since the branching process rigorously follows the universal mathematical rules. A policy $\pi\in\Pi:\mathcal{S}\to\mathcal{A}$ maps an observed state to an action using the corresponding branching strategy.

At time $t$ , the branching module observes a branching sample $\mathbf{s}_{t}$ corresponding to the current subproblem. The branching module then selects a variable $a_{t}$ in the branching candidates $\mathcal{A}$ according to the branching policy $\pi$ . Finally, the solver splits the current subproblem using $a_{t}$ to generate another two subproblems and decides the next subproblem $\mathbf{s}_{t+1}=\mathcal{T}(\mathbf{s}_{t},a_{t})$ to perform branching.

In practice, we only have access to a finite set of training instances $\{G_{i}\}_{i=1}^{n}$ , and the testing instances may be drawn from an unknown environment $p^{\prime}(G)$ different from the training distribution $p(G)$ . The distributional shifts influence the initialization function in the testing environment, which we denote by $\mathcal{I}^{\prime}$ .

To describe the relationship between training and testing distributions, we first define a cost function $c:\mathcal{G}\times\mathcal{G}\to\mathbb{R}$ to measure the discrepancy between two instance graphs. We make the following assumptions. (A1) The transition functions in the training and testing environment remain the same since the solver follows rigorous mathematical rules and node selection policy. (A2) The transition function $\mathcal{T}$ is $\text{Lip}_{\mathcal{T}_{1}}$ -Lipschitz with the first variable and $\text{Lip}_{\mathcal{T}_{2}}$ -Lipschitz with the second variable, that is

	$\displaystyle\\|\mathcal{T}(\mathbf{s},a)-\mathcal{T}(\mathbf{s}^{\prime},a)\\|$	$\displaystyle\leq\text{Lip}_{\mathcal{T}_{1}}\\|\mathbf{s}-\mathbf{s}^{\prime}\\|,\forall\mathbf{s},\mathbf{s}^{\prime}\text{ and }a;$
	$\displaystyle\\|\mathcal{T}(\mathbf{s},a)-\mathcal{T}(\mathbf{s},a^{\prime})\\|$	$\displaystyle\leq\text{Lip}_{\mathcal{T}_{2}}\\|a-a^{\prime}\\|,\forall a,a^{\prime}\text{ and }\mathbf{s},$

where the norm $\|\cdot\|$ is a general and simplified notation that can represent the norms in the state space or action space. (A3) Each policy $\pi\in\Pi$ is Lipschitz continuous with Lipschitz constant $\text{Lip}_{\pi}$ .

4.1 IL-based branching formulation

To solve the branching MDP mentioned above, we can employ imitation learning or reinforcement learning to train the branching agent. IL-based branching does not require the reward signal, instead, it assumes a strong branching expert with policy $\pi_{E}$ that is considered to be near optimal. A predefined dataset of branching samples and branching variables $\{(\mathbf{s}_{k},a_{E,k})\}_{k=1}^{K}$ is obtained by running the expert on the training instances $\{G_{i}\}_{i=1}^{n}$ . Given the dataset, IL clones the behavior of the expert by enforcing the learned agent to make the same decision with the expert under the same observation,

\displaystyle\min_{\pi}\mathbb{E}_{G\sim p(G)}\mathbb{E}_{(\mathbf{s},a^{E})\sim D_{\pi_{E},G,\mathcal{T}}}\left[\ell(\pi(\mathbf{s}),a_{E})\right],

(2)

where $\ell(\cdot,\cdot)$ is the loss function. $D_{\pi_{E},G,\mathcal{T}}$ is the joint distribution of the sequence of states and action when running the expert $\pi_{E}$ on instance $G$ , i.e.,

\displaystyle D_{\pi_{E},G,\mathcal{T}}=p(G)\pi_{E}(a_{0}|G)\prod_{t=1}^{T-1}p(\mathbf{s}_{t+1}|\mathbf{s}_{t},a_{t})\pi_{E}(a_{t}|\mathbf{s}_{t}),

which is kept fixed during training. To simplify the notations, we let $L(\pi,G,\mathcal{T}):=\mathbb{E}_{(\mathbf{s},a_{E})\sim D_{\pi_{E},G,\mathcal{T}}}\left[\ell(\pi(\mathbf{s}),a_{E})\right]$ .

4.2 RL-based branching formulation

The agent needs to explore the B&B tree using the current policy $\pi$ and tries to improve the policy by minimizing the negative accumulative reward,

\displaystyle\min_{\pi}\mathbb{E}_{G\sim p(G)}\mathbb{E}_{\mathbf{s}\sim D_{\pi,G,\mathcal{T}}}\left[\sum_{t=0}^{T-1}-\gamma^{t}r(\mathbf{s}_{t})\right],

(3)

where we should notice that the state distribution $D_{\pi,G,\mathcal{T}}$ depends on the policy $\pi$ , instance $G$ and transition function $\mathcal{T}$ . We let $R(\pi,G,\mathcal{T}):=\mathbb{E}_{\mathbf{s}\sim D_{\pi,G,\mathcal{T}}}\left[\sum_{t=0}^{T-1}-\gamma^{t}r(\mathbf{s}_{t})\right]$ .

5 From Robust Optimization to Instance Augmentation

Given a branching policy $\pi$ and a training instance distribution $p(G)$ , we formulate the distributionally robust training as the following optimization problems

	$\displaystyle\min_{\pi}\sup_{\tilde{p}}\mathbb{E}_{\tilde{G}\sim\tilde{p}(G)}L(\pi,\tilde{G},\mathcal{T}),$	$\displaystyle\mathcal{W}_{\mathcal{G}}(p,\tilde{p})\leq\rho;$		(IL)
	$\displaystyle\min_{\pi}\sup_{\tilde{p}}\mathbb{E}_{\tilde{G}\sim\tilde{p}(G)}R(\pi,\tilde{G},\mathcal{T}),$	$\displaystyle\mathcal{W}_{\mathcal{G}}(p,\tilde{p})\leq\rho,$		(RL)

where $\mathcal{W}_{\mathcal{G}}(p,\tilde{p})$ is the Wasserstein distance that measures the distance between distributions $p(G)$ and $\tilde{p}(G)$ . Specifically, we define $\mathcal{W}_{\mathcal{G}}(p,\tilde{p}):=\inf_{\mu}\mathbb{E}_{\mu}\left[c(G,\tilde{G})\right]$ , where $\mu$ is the coupling distribution of $p(G)$ and $\tilde{p}(G)$ . We let $F(\pi,G,\mathcal{T})$ represent $L$ or $R$ to simplify the notations. However, computing the supremum over the environment variable is intractable, so we use a Lagrangian relaxation with a penalty $\beta$ instead:

\displaystyle\min_{\pi}\sup_{\tilde{p}}\mathbb{E}_{\tilde{G}\sim\tilde{p}(G)}F(\pi,\tilde{G},{\mathcal{T}})-\beta\mathcal{W}_{\mathcal{G}}(p,\tilde{p}).

[43] shows that we can define surrogate loss functions

\displaystyle\varphi(\pi,G,\mathcal{T}):=\sup_{\tilde{G}\in\mathcal{G}}[F(\pi,G,\mathcal{T})-\beta c(G,\tilde{G})]

such that the gradients of the surrogate functions

\displaystyle\nabla_{\theta}\varphi(\pi_{\theta},G,\mathcal{T})=\nabla_{\theta}F(\pi_{\theta},G^{*},\mathcal{T}),

where the augmented instance $G^{*}$ is

G^{*}=\text{argmax}_{\tilde{G}\in\mathcal{G}}[F(\pi_{\theta},\tilde{G},{\mathcal{T}})-\beta c(G,\tilde{G})].

This implies that in order to compute the gradient of $\varphi$ , we just need to compute the gradient of the objective on augmented instances $G^{*}$ . Thus, we turn to find an effective augmentation strategy.

We first specify the optimal policy $\pi^{*}$ in the testing distribution, which is the best policy to generalize in the hypothesis policy class $\Pi$ , i.e.,

\displaystyle\pi^{*}=\underset{\pi\in\Pi}{\text{arg\,min }}\mathcal{J}_{p^{\prime}}(\pi),\qquad\mathcal{J}_{p^{\prime}}(\pi)=\mathbb{E}_{G^{\prime}\sim p^{\prime}}\left[F(\pi,G^{\prime},\mathcal{T})\right].

(4)

In practice, we can only get access to finite instances $\mathcal{D}=\{G_{i}\}_{i=1}^{n}$ sampled from the distribution to estimate the objective, and obtain the best policy $\hat{\pi}$ on the dataset,

\displaystyle\hat{\pi}=\underset{\pi\in\Pi}{\text{arg\,min }}\hat{\mathcal{J}}_{p}(\pi),

(5)

where the empirical risks

	$\displaystyle\hat{\mathcal{J}}_{p}(\pi)=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{T}\sum_{j=0}^{T-1}\ell(\pi(\mathbf{s}^{i}_{t}),a^{i}_{E,t});$		(IL)
	$\displaystyle\hat{\mathcal{J}}_{p}(\pi)=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{T}\sum_{j=0}^{T-1}\gamma^{t}r(\mathbf{s}^{i}_{t}).$		(RL)

We then derive the objective using data augmentation, and denote the distribution of augmented instances by $\Phi(p)$ and the best policy on the augmented dataset by $\hat{\pi}_{\Phi}$ ,

\displaystyle\hat{\pi}_{\Phi}=\underset{\pi\in\Pi}{\text{arg\,min }}\hat{\mathcal{J}}_{\Phi(p)}(\pi),

(6)

in which $\hat{\mathcal{J}}_{\Phi(p)}(\pi)$ is the empirical risks using augmented instances.

6 AdaSolver

In this section, we present the robust optimization framework for learning-based exact MILP solvers using adversarial instance augmentation, namely Adversarial Instance Augmentation for the exact MILP Solver (AdaSolver).

6.1 Adversarial Augmentation

The overview of the proposed framework on the learn-to-branch task is depicted in Figure 2, which consists of two components: a learning-based branching policy network and an adversarial augmenter. The adversarial augmenter employs data augmentation to promote the diversity of graph structures in training data, while the learning-based branching policy network tries to minimize the training loss on samples from both the original and augmented instances to extract task-relevant features across distributions.

6.1.1 Instance Graph Augmentation Operators

Inspired by the fact that the MILP problem structures are included in the instance graph structures, we propose to augment the instances by masking variables, constraints, and edges in instance bipartite graphs to promote the diversity of graph structures. These operators are also reasonable from a real-world perspective. Notice that many MILP problems arise from real-world scheduling scenarios. These problems often feature variables that represent available workers, machines, vehicles, or other material resources, and they feature constraints that formulate the scheduling restrictive relationship among these resources. Therefore, these simple and reliable augmentation operators can also simulate perturbations of real-world resources, i.e., certain materials are inaccessible or some restrictions are lifted. Moreover, recent studies have shown that GNNs struggle to process certain “foldable” MILPs, as stated in [44]. In real-world scenarios, CO instances are highly structural and symmetric, resulting in numerous such structures in the training data that GNNs are unable to distinguish. This issue exacerbates the challenges posed by the limited diversity of training data. By augmenting the graph structures, AdaSolver effectively breaks the symmetric structures in training instances and promotes the diversity of instance graph structures. Thus, instance augmentation can improve not only the generalization ability of branching policy but also the sample efficiency, in which we can use fewer instances to attain comparable performance as we show in Section 8.3.

In practice, we control the usage of augmentation operators and masking proportion so that the augmented instances preserve a similar problem structure and computational hardness with the original instances, preventing harmful effects on training.

6.1.2 Contextual Bandit for Augmentation

The instance augmenter is a function $\Phi:\mathcal{G}\to\mathcal{G}$ mapping an instance bipartite graph $G$ to the augmented instance graph $\Phi(G)$ by applying augmentation operators on $G$ . However, two challenges for the instance augmentation remain to address. First, employing structural augmentation on a bipartite graph is generally a non-differentiable process. It is thus intractable to train the adversarial augmenter since the non-differentiable mapping $\Phi$ poses challenges for backward propagation and the gradient-based update fails. Second, the training efficiency of the augmenter and branching policy network is different. The training of the augmenter is much slower than the branching policy network since the augmenter will receive much less training data than the branching policy network. The adversarial augmenter operates on MILP instance graphs $\{G_{k}\}_{k}$ to obtain augmented instances $\{\Phi(G_{k})\}_{k}$ . Later the expert policy runs on each augmented instance to generate a sequence of branching samples containing solving states and actions $\{({\mathbf{s}},\pi_{E}({\mathbf{s}}))\}$ in the IL setting, or the agent runs on the augmented instances to collect samples $\{({\mathbf{s}},a,r({\mathbf{s}},a))\}$ in the RL setting to calibrate the branching policy. Consequently, the adversarial augmenter receives significantly less training data $\{\Phi(G_{k})\}_{k}$ than the branching policy. The issues of non-differentiability and unsatisfactory training efficiency require a sample-efficient training strategy under sparse supervised signals. Therefore, we resort to a contextual bandit version of the proximal policy optimization (PPO) algorithm [17].

To formulate the graph augmentation as a contextual bandit problem, we view the solver as the environment and the adversarial augmenter as the agent. A contextual bandit problem can be represented as the tuple $(\mathcal{G},\mathcal{A}^{\text{CB}},r^{\text{CB}})$ . To avoid misleading notations with branching MDP, here we use the superscript CB to specify the contextual bandit formulation. Then we specify the state space ${\mathcal{G}}$ , the action space $\mathcal{A}^{\text{CB}}$ and reward function $r^{\text{CB}}:{\mathcal{G}}\times\mathcal{A}^{\text{CB}}\to\mathbb{R}$ for a MILP instance as follows.

•

The state space $\mathcal{G}$ , or context vector space, also known as the context vector space, is the set of graph representations of MILP instances.
•

An action $a^{\text{CB}}=(V^{\text{CB}},E^{\text{CB}},W^{\text{CB}})\in\mathcal{A}^{\text{CB}}$ , where $V^{\text{CB}}$ , $E^{\text{CB}}$ and $W^{\text{CB}}$ are the subsets of variables, edges, and constraints to mask, respectively. The agent applies action $a^{\text{CB}}$ to an instance graph $G$ to obtain an augmented instance graph $\Phi(G)$ . After that, we collect branching samples on it.
•

The reward $r^{\text{CB}}(G,a^{\text{CB}})$ is defined to be the average loss of branching policy under the samples collected from the augmented graph $\Phi(G)$ from $G$ .

We describe the training process for the augmenter policy network $\Phi_{\eta}$ with parameters $\eta$ under the contextual bandit formulation as follows. In practice, the augmenter policy network takes as input an instance graph and outputs the masking probability of each node and edge. We select an action and record the predicted probability, i.e., $a^{\text{CB}}=\max_{a}\Phi_{\eta}(a|G)$ . We leverage a learnable state-value function $V_{\alpha}:\mathcal{G}\to\mathbb{R}$ with parameters $\alpha$ to calculate the variance-reduced estimation of the advantage function $\hat{A}=r^{\text{CB}}-V_{\alpha}(G)$ . In order to improve the training efficiency of the augmenter, we use a replay buffer $\mathcal{B}$ consisting of the tuples $(G,\hat{A},\Phi_{\eta}(a^{\text{CB}}|G))$ under the history augmentation policy for performing experience replay. We use $\lambda(\eta)=\frac{\Phi_{\eta_{\text{current}}}(a^{\text{CB}}|G)}{\Phi_{\eta_{\text{old}}}(a^{\text{CB}}|G)}$ to represent the probability proportion of the current policy and old policies sampled from the buffer. Finally, we reach the training objective of the adversarial augmenter, which includes the policy network and state-value function, by minimizing

		$\displaystyle\small L_{V}(\alpha)=\mathbb{E}_{\mathcal{B}}\left[\hat{A}^{2}\right]\quad\text{ and }$		(7)
		$\displaystyle L_{\Phi}(\eta)=\mathbb{E}_{\mathcal{B}}\left[\min(\lambda(\eta),\text{clip}(\lambda(\eta),1-\epsilon,1+\epsilon))\hat{A})\right],$		(7)

where $\epsilon$ is the clip ratio. We show in Table VII that the sample-efficient bandit algorithm plays a vital role in the overall training process.

6.1.3 Discriminator for Data Augmentation

The adversarial augmenter tends to produce instances quite different from the original distribution, leading to some extremely easy or hard instances. This is because the training objective is to maximize the loss of the GNN on the augmented data, but the augmenter does not receive any restrictions on the augmentation. Thus, the augmenter may ignore the distribution information and just produce some hard OOD instances, which might break the MILP problem structures. To address this problem, we leverage a discriminator network $D_{\psi}$ to identify and remove those instances that severely deviate from the original distribution.

We instantiate the discriminator using a graph convolutional encoder and an MLP decoder. The graph encoder extracts the structure information of the MILP instance and the decoder predicts the probability that the instance is from the original distribution or not. To train the discriminator, we minimize the objective

\displaystyle\min_{\psi}L_{D}(\psi)=\mathbb{E}_{G}[D_{\psi}(\Phi(G))]+\mathbb{E}_{G}[1-D_{\psi}(G)].

When inference, the discriminator takes as input the augmented graph and decides to use the instance for training the GNN when $D(G)>0.5$ and discard it otherwise. Moreover, we use an additional regularized reward for the GNN, i.e.,

\displaystyle r^{\text{CB}}=F(\pi_{\theta},\Phi(G),\mathcal{T})-\beta D(\Phi(G))

to control the augmentation, where $\beta=0.01$ is a constant.

We provide the pseudo-code for the AdaSolver framework in Algorithm 1.

INITIALIZE: learning-based branching policy

\pi_{\theta}

, adversarial graph augmenter

(\Phi_{\eta},V_{\alpha},D_{\psi})

, CO instances

\mathcal{D}

, original training dataset

\mathcal{D}_{\text{train}}

, training epoch

N

, learning rate

\kappa

for $N$ epochs do

Randomly sample

K

MILPs

\{G_{k}\}_{k=1}^{K}

from

\mathcal{D}

for $k=1,\dots,K$ do

Perform augmentation on

G_{k}

to obtain the augmented instance

\Phi(G_{k})

with policy

\Phi_{\eta}

Train

\pi_{\theta}

using IL or RL and record the average loss

r^{\text{CB}}_{k}

G_{k}

Record

(G_{k},\hat{A}_{k},\Phi_{\eta}(a_{k}^{\text{CB}}|G_{k}))

in the buffer

Train

\pi_{\theta}

using IL or RL on

\mathcal{D}

Compute the

L_{\Phi}(\eta)

L_{V}(\alpha)

and

L_{D}(\psi)

over the replay buffer

Update the parameters

\eta

\alpha

and

\psi

OUTPUT: parameters of branching policy

\theta^{*}

Algorithm 1 Training of AdaSolver.

7 Theoretical Analysis for Adversarial Augmentation Framework

In this section, we provide a detailed theoretical analysis of the proposed instance augmentation technique. We analyze the generalization bound for data augmentation in both IL and RL settings. We suppose that the augmented instances are drawn from a perturbed distribution $\Phi(p)$ .

7.1 IL Setting

Since the branching samples can be shuffled and collected by interval sampling, the transition function $\mathcal{T}$ may have little effect on the training dataset. We thus omit the transition $\mathcal{T}$ in the IL setting and view the collected branching samples drawn independently from the same distribution. For the hypothesis policy class $\Pi$ , we use the classical Rademacher complexity $\mathfrak{R}_{m}(\ell\circ\Pi)$

\displaystyle\mathfrak{R}_{m}(\ell\circ\Pi)=\mathbb{E}_{\mathcal{D}}\mathbb{E}_{\sigma}\left[\sup_{\pi\in\Pi}\frac{1}{K}\sum_{k=1}^{K}\sigma_{k}\ell(\pi(\mathbf{s}_{k}),\pi_{E}({\mathbf{s}_{k}})\right]

and $\mathfrak{R}_{m}^{\Phi}(\ell\circ\Pi)$ under the instance augmentation

\displaystyle\mathfrak{R}_{m}^{\Phi}(\ell\circ\Pi)=\mathbb{E}_{\mathcal{D}}\mathbb{E}_{\sigma}\left[\sup_{\pi\in\Pi}\frac{1}{K}\sum_{k=1}^{K}\sigma_{k}\ell(\pi(\tilde{\mathbf{s}}_{k}),\pi_{E}({\tilde{\mathbf{s}}_{k}}))\right]

to bound the generalization gap, where $\mathbf{s}\sim D_{\pi_{E},p}$ and $\tilde{\mathbf{s}}\sim D_{\pi_{E},\Phi(p)}$ . Furthermore, to simplify the notations, we denote the quantity $\sqrt{\frac{\log(1/\delta)}{2k}}$ by $\Omega(k)$ .

Now we derive the IL generalization bound between the empirical risks in the training dataset and the expected risk in the testing dataset. We can find the classical proof of the generalization bound with Rademacher complexity in [45] and the in-distribution generalization bound for data augmentaion in [46]. We now derive the OOD version of the generalization bound under data augmentation in the IL setting (9).

Theorem 1.

Suppose that the loss function $\ell(\cdot,\cdot)$ is bounded by 1 and is $\frac{1}{2}$ -Lipschitz continuous. With probability as least $1-\delta$ and for any $\pi\in\Pi$ , we have the generalization bounds in the out-of-distribution imitation learning setting as follows.

•

The generalization bound for empirical risk minimization

	$\displaystyle\mathcal{J}_{p^{\prime}}(\hat{\pi})-\hat{\mathcal{J}}_{p}(\hat{\pi})\leq$	$\displaystyle\mathcal{W}_{\\|\cdot\\|}(D_{\pi_{E},p}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))$		(8)
		$\displaystyle+2\mathfrak{R}_{K}(\ell\circ\Pi)+\Omega(K).$		(8)

•

The generalization bound for data augmentation

	$\displaystyle\mathcal{J}_{p^{\prime}}(\hat{\pi})-\hat{\mathcal{J}}_{\Phi(p)}(\hat{\pi})$	$\displaystyle\leq\mathcal{W}_{\\|\cdot\\|}(D_{\pi_{E},\Phi(p)}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))$		(9)
		$\displaystyle+2\mathfrak{R}_{K}^{\Phi}(\ell\circ\Pi)+\Omega(K).$		(9)

The notation $D_{\pi,\tilde{p}}(\mathbf{s})$ denotes the distribution over branching samples that are collected from instances drawn from distribution $\tilde{p}$ using policy $\pi$ . $\mathcal{W}_{\|\cdot\|}$ is the Wasserstein distance between two distributions with respect to the norm $\|\cdot\|$ over the branching sample space. Furthermore, we suppose that instance augmentation close the distribution between training and testing data, i.e., for some $\alpha<1$ ,

\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},\Phi(p)},D_{\pi_{E},p^{\prime}})\leq\alpha\mathcal{W}_{\|\cdot\|}(D_{\pi_{E},p},D_{\pi_{E},p^{\prime}}).

For $\alpha\leq 1/3$ , the generalization bound in Equation 9 is lower than that in Equation 8, that is

		$\displaystyle\mathcal{W}_{\\|\cdot\\|}(D_{\pi_{E},\Phi(p)}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))+2\mathfrak{R}_{K}^{\Phi}(\ell\circ\Pi)$		(10)
		$\displaystyle\leq\mathcal{W}_{\\|\cdot\\|}(D_{\pi_{E},p}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))+2\mathfrak{R}_{K}(\ell\circ\Pi).$		(10)

Thus, Theorem 1 shows that data augmentation is beneficial for out-of-distribution generalization settings.

7.2 RL Setting

We now explore the theory of instance augmentation for OOD RL settings. The transition function $\mathcal{T}$ in the RL setting cannot be omitted since we use the parameterized policy to collect branching samples sequentially. We thus need to consider the sequential property to analyze the generalization. We denote the Rademacher complexity for policy class $\Pi$ by

\displaystyle\mathfrak{R}_{n}(R\circ\Pi)=\mathbb{E}_{\mathcal{D}}\mathbb{E}_{\sigma}\left[\sup_{\pi\in\Pi}\frac{1}{n}\sum_{j=1}^{n}\sigma_{j}R(\pi,\mathcal{T},G)\right].

For instance augmentation,

\displaystyle\mathfrak{R}_{n}^{\Phi}(R\circ\Pi)=\mathbb{E}_{\mathcal{D}}\mathbb{E}_{\sigma}\left[\sup_{\pi\in\Pi}\frac{1}{n}\sum_{j=1}^{n}\sigma_{j}R(\pi,\mathcal{T},\Phi(G))\right].

We derive the following generalization bound and generalization error for RL setting in Theorem 2.

Notice that the sample distribution $D_{\pi,G,\mathcal{T}}$ is closely related to $\pi$ and changes with different $\pi$ , posing difficulties in analyzing the generalization gap. Thus, we consider the reparameterizable RL proposed in [47]. Notice that given an instance $G$ and deterministic policy $\pi$ , the solving process is deterministic since the transition $\mathcal{T}$ is deterministic. Reparameterizable RL supposes that we can use a random variable $\xi$ to model the random noise and employ the reparameterization trick such that

\displaystyle\mathbb{E}_{G\sim p(G)}R(\pi,\mathcal{T},G)=\mathbb{E}_{\xi\sim p(\xi)}R(\pi,\mathcal{T},\mathcal{I},\xi).

The RL process thus samples from a deterministic distribution $p(\xi)$ in training and testing environments to produce instances $G=\mathcal{I}(\xi)$ and models the distributional shifts in different initialization functions. This presents significant technical convenience for our analysis.

Theorem 2.

Consider the solvers that can be formulated as the reparameterizable RL with horizon $T$ and parameter $\xi$ . Suppose that the expected accumulated cost function $R$ is normalized and bounded by 1. The state reward function $r:\mathcal{S}\to\mathbb{R}$ is $\text{Lip}_{r}$ -Lipschitz. With probability at least $1-\delta$ , we have

	$\displaystyle\mathcal{J}_{p^{\prime}}(\hat{\pi}_{\Phi})$	$\displaystyle-\mathcal{J}_{p^{\prime}}(\pi^{*})\leq 2\mathfrak{R}_{n}^{\Phi}(R\circ\Pi)+2\Omega(n)$		(11)
		$\displaystyle+2\sum_{t=0}^{T-1}\text{Lip}_{r}(\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}+\text{Lip}_{\mathcal{T}_{1}})^{t}\mathcal{W}_{\mathcal{G}}(\Phi(p),p^{\prime}).$		(11)

Similar to Theorem 1, we have the bound (11) is lower than that without instance augmentation.

8 Experiments

In this part, we conduct comparative experiments on multiple well-recognized benchmarks to demonstrate the effectiveness of our proposed AdaSolver framework in terms of solving efficiency and sample efficiency. Meanwhile, we also perform ablation studies and extensive studies to provide further insight into AdaSolver.

8.1 Experiment Setup

8.1.1 Benchmarks

We evaluate the performance of our proposed AdaSolver on four widely used benchmarks of NP-hard synthesis problem families and a challenging real-world Anonymous dataset. The synthesis datasets include the set covering, combinatorial auctions, capacitated facility location and maximum independent set. We follow the conventional data generation settings in [10], [26] and [27]. In each benchmark, we generate $10000$ training instances and $2000$ validation instances for AdaSolver. For the final evaluation, we generate an in-distribution testing set of 32 instances $D_{1}$ , and five transfer sets $D_{2}-D_{6}$ of 32 instances with larger scales. We emphasize that the instance scales in the transfer datasets increase in order and the maximum scale of the dataset is much larger than that in the previous works. The Anonymous dataset is a well-recognized real-world dataset consisting of a curated subset of the challenge MILPLIB library. It consists of large-scale instances with 98 for training, 20 for validation and 20 for testing. The instances in the Anonymous dataset have an average of 49603 constraints and 37881 variables.

We report the instances’ scale for each benchmark in Table I. Due to the limited space, More information about benchmarks and instance generation for the synthesis datasets can be found in Appendix C.

TABLE I: Statistical information for the four synthesis benchmarks. Notice that the largest distributions are at least five times larger than the training distributions, which is much more larger than the previous works.

	Set Covering						Combinatorial Auctions
	$D_{1}$	$D_{2}$	$D_{3}$	$D_{4}$	$D_{5}$	$D_{6}$	$D_{1}$	$D_{2}$	$D_{3}$	$D_{4}$	$D_{5}$	$D_{6}$
Constraint number	500	1000	2000	4000	3000	8000	193.03	386.25	576.75	771.65	963.56	1156.59
Variable number	1000	1000	1000	1000	1000	1000	500	1000	1500	2000	2500	3000

	Capacitated Facility Location						Maximum Independent Set
	$D_{1}$	$D_{2}$	$D_{3}$	$D_{4}$	$D_{5}$	$D_{6}$	$D_{1}$	$D_{2}$	$D_{3}$	$D_{4}$	$D_{5}$	$D_{6}$
Constraint number	10201	20301	40501	60701	80901	161701	1952.34	3946.56	5942.34	7936.40	9934.12	11931.65
Variable number	10100	20100	40100	60100	80100	160100	500	1000	1500	2000	2500	3000

8.1.2 Metric

We report the average solving time (Time, lower is better) and primal-dual integral (PD integral, lower is better) for different branching approaches for the comparative experiments. For more details on the PD integral metric, please see Appendix A.3. Moreover, we provide more results in terms of another two well-recognized metrics, i.e., the average branching nodes (Node) and the average primal-dual gap (PD gap, lower is better) in Appendix E.1.

Following [12], we mainly focus on the Time metric in the distributions where the solver can solve all the instances within the time limit since the solving time is the most direct metric to reflect the solving efficiency. However, the average solving time is almost the same for the distributions where the solvers cannot solve out most of the instances within the limited time. As a result, it is hard to tell which method has the best performance thus the solving time metric is not suitable in these distributions. So we focus on another well-recognized metric primal-dual integral in these distributions instead. Given a time limit, the lower primal-dual integral implies the better quality of the found feasible solutions by the solver. Specifically, we focus on the Time metric on $D_{1}-D_{2}$ for the synthesis benchmarks, $D_{3}$ for the combinatorial auctions and capacitated facility location benchmarks. We focus on the primal-dual integral on the other distributions.

The Node metric reflects the size of B&B searching trees. When the solver solves out an instance, fewer explored nodes imply a more efficient searching policy.

8.1.3 Baselines

Our baselines include a SCIP branching rule and six learning-based branchers containing the competitive IL-based and RL-based approaches. The SCIP branching rule is the traditional state-of-the-art reliability pseudo-cost (RPB) branching heuristic. We then introduce the learning-based branching baselines. SVMRank [9] uses a support vector machine for learning to select the branching variables. Trees [48] is a branching policy using ExtraTrees. LMart [49] is based on LambdaMART. tMDP+DFS is a reinforcement learning method based on tree MDP [11]. TreeGate is a tree search branching approach [36] trained with imitation learning. GNN is the graph neural network approach for learning to branch [10]. Finally, we build our AdaSolver framework on the IL-based GNN branching policy [10] (GNN) and RL-based GNN branching policy (tMDP+DFS) [11], resulting in two methods AdaSolver-IL and AdaSolver-RL. As stated in [11], the RL-based method performs worse than RPB or the IL-based baselines. For RL-based setting, we thus mainly focus on the improvement of AdaSovler-RL over the underlying RL-based solvers.

8.1.4 Implementation

Throughout all experiments, the backend solver is the state-of-the-art open-source solver SCIP 7.0.3. The solving time limit on a testing instance is 900s. Similar to [10, 26, 27], we deactivate solver restarts and allow employing cutting plane generation module only at the root node. We keep all the other SCIP parameters to default so as to better observe the improvements from AdaSolver and make comparisons of these approaches as fair.

For model training, we use Adam optimizer [50] to optimize the model parameters. When training AdaSolver, the adversarial augmenter randomly chooses training instances to perform augmentation in each training epoch. We then collect branching samples from the augmented instances. We choose the augmentation operators that perform the best for each benchmark. More details for these baselines and our AdaSolver are available in Appendix D.

8.2 Comparative Experiments on Solving Effciency

8.2.1 Scale Generalization

TABLE II: The generalization results on multiple datasets. We report the arithmetic mean of solving time and PD integral. The metrics we mainly focus on are marked with

\downarrow

, and the lower values imply better performance. The best performance is marked in bold.

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
RPB	10.98	114.91	87.62	666.215	831.41	7152.10	900.00	11494.93	900.00	14757.09	900.00	23564.46
SVMRank	11.40	116.21	158.29	989.67	890.66	8546.88	900.03	13444.79	900.05	17018.42	900.05	24535.92
Trees	14.86	131.23	210.26	1238.87	899.87	8690.80	900.01	13226.00	900.02	16772.03	900.02	24599.38
LambdaMart	9.46	106.92	127.34	812.27	875.87	7797.14	900.01	12927.04	900.02	16668.55	900.03	24525.11
tMDP+DFS	20.73	159.59	444.38	2892.12	900.006	11559.79	900.00	15393.24	900.00	18384.06	900.00	24387.4
TreeGate	25.16	187.99	271.82	1790.70	900.00	10391.53	900.00	14389.55	900.00	17747.06	900.02	24118.51
GNN	9.25	116.22	95.57	716.25	844.74	7310.17	900.00	11899.40	900.00	16067.00	900.00	23786.51
AdaSolver-RL	18.87	152.26	433.97	2885.97	900.00	11330.87	900.00	14466.95	900.00	17550.63	900.00	24319.00
AdaSolver-IL	6.28	80.67	73.78	515.92	839.80	7079.13	900.00	10691.05	900.00	14452.71	900.00	21792.69
	Set Covering

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
RPB	2.84	32.20	23.20	156.62	224.66	156.62	838.77	1622.46	900.00	2313.35	900.004	2657.51
SVMRank	2.59	26.32	51.12	194.46	427.01	625.01	881.33	1730.40	900.07	2293.06	900.00	2580.31
Trees	2.76	26.44	87.05	221.68	790.01	994.48	900.01	1782.20	900.01	2323.16	900.01	2535.18
LambdaMart	1.90	24.69	28.08	156.10	287.69	446.49	862.31	1563.85	900.02	2136.37	900.02	2455.08
tMDP+DFS	2.18	24.76	27.98	146.15	312.93	518.45	860.76	1871.73	900.00	2664.34	900.00	3103.08
TreeGate	5.75	37.74	84.44	303.53	623.25	1168.37	891.50	2486.65	900.00	3302.70	900.00	3771.89
GNN	2.20	30.48	20.30	167.60	195.68	384.17	819.28	1300.99	900.00	1931.54	900.00	2293.41
AdaSolver-RL	2.04	106.83	17.28	117.28	323.94	517.91	846.32	1685.69	900.00	2379.82	900.00	2980.48
AdaSolver-IL	1.63	22.44	16.34	117.13	178.32	314.32	817.57	1154.79	900.00	1716.55	900.00	2107.19
	Combinatorial Auctions

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
RPB	65.67	170.34	287.52	719.26	689.45	3333.72	886.54	10762.83	892.90	20102.78	897.51	27527.24
SVMRank	56.98	274.04	207.16	1011.28	592.15	3913.06	884.87	13898.43	873.08	22201.25	887.00	33649.01
Trees	77.41	285.65	262.54	1038.46	652.15	3895.60	887.00	13649.01	878.81	22433.54	889.54	34909.74
LambdaMart	51.30	281.76	201.06	988.92	591.41	3927.82	884.29	13908.60	872.67	22455.29	886.84	34976.25
tMDP+DFS	73.35	275.38	270.71	1108.57	683.42	4253.40	886.07	12950.94	888.14	23675.80	893.70	34995.77
TreeGate	47.96	180.17	192.29	757.39	661.64	4238.80	873.98	10553.59	886.34	19564.65	897.63	27529.18
GNN	51.63	290.91	235.07	1107.29	660.25	4418.54	854.27	10297.25	868.93	19033.90	890.53	32302.74
AdaSolver-RL	64.10	257.10	233.64	892.77	587.02	3332.38	852.53	7779.28	849.93	14727.70	852.51	27298.25
AdaSolver-IL	34.51	196.56	147.47	738.36	555.66	3324.63	731.81	6644.35	819.31	13818.34	884.48	28231.23
	Capacitated Facility Location

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
RPB	16.46	86.57	302.43	1045.88	883.79	3818.32	900.00	4634.32	900.00	5067.01	900.00	5683.46
SVMRank	18.37	76.17	456.19	879.49	883.18	2539.03	900.07	3640.36	900.09	4313.94	900.08	5105.40
Trees	26.04	91.01	701.12	1353.35	900.01	2828.39	900.02	3729.98	900.02	4353.62	900.03	5013.31
LambdaMart	12.80	61.79	445.80	826.64	877.65	2475.11	900.02	3570.20	900.02	4221.15	900.04	5042.00
tMDP+DFS	9.45	54.00	294.04	644.69	781.44	2201.37	897.91	3628.92	900.00	4613.32	900.01	5314.64
TreeGate	67.48	229.32	844.92	2711.83	900.03	3939.96	900.03	4520.56	900.01	4906.46	900.01	5299.72
GNN	11.09	65.73	254.12	558.31	817.04	2284.32	900.00	3532.03	900.01	4387.27	900.01	5168.77
AdaSolver-RL	8.43	51.46	282.54	627.81	803.75	2117.01	879.81	3318.08	900.02	4148.24	900.00	5099.92
AdaSolver-IL	7.39	44.56	210.36	460.80	806.78	2355.11	900.00	3591.38	900.00	4375.48	900.00	5152.28
	Maximum Independent Set

We evaluate the generalization ability of AdaSolver and compare it to other baselines on four benchmarks of generated datasets (see Table II). According to Table II, our AdaSolver-IL consistently outperforms the baselines in the four synthesis benchmarks in terms of Time or PD Integral metric. AdaSolver-RL has a significant improvement over the underlying RL-based solver. Please refer to Appendix E.1 for results in terms of Nodes and PD gap metrics.

In the easier distributions $D_{1}$ and $D_{2}$ , AdaSolver-IL achieves the shortest solving time and outperforms all the baselines. The GNN is a competitive baseline that outperforms other learning-based baselines on much of the distributions $D_{1}$ and $D_{2}$ of the four benchmarks. While LambdaMart is another promising learning-based baseline with stable performance. Compared to GNN, AdaSolver-IL achieves 27.66% improvement of solving time on the $D_{1}$ and $D_{2}$ , leading to a total of 39.06% acceleration over the RPB heuristics. AdaSolver-RL improves the solving time of tMDP+DFS up to 12.12%. The results demonstrate that the AdaSolver is effective in improving the performance in the in-distribution or minor out-of-distribution situations.

In the harder distributions $D_{3}-D_{6}$ , AdaSolver still significantly outperforms other baselines with the lowest PD integral, leading to an improvement of 11.12% over GNN and 13.15% over tMDP+DFS. This demonstrates that AdaSolver can consistently enhance the generalization ability of the underlying solvers to instances with over five times to ten times larger scales.

We summarize the improvement of AdaSolver over the underlying learning-based solver GNN and tMDP+DFS in Table III, demonstrating the effectiveness of AdaSolver.

TABLE III: Improvement of AdaSolver over the underlying solvers on the four synthesis benchmarks.

Benchmark	Improvement (%)
Benchmark	AdaSolver-IL	AdaSolver-RL
Set Covering	14.44	4.02
Combinatorial Auctions	14.12	10.95
Capacitated Facility Location	26.95	23.35
Maximum Independent Set	7.73	6.87

8.2.2 Perturbation Robustness (Real-World Dataset)

To evaluate the performance in real-world applications, we deploy AdaSolver to a large-scale real-world Anonymous dataset, a widely-used subset of the MILPLIB library. The instances’ constraint number in the Anonymous dataset ranges from 3375 to 159823, and the variable number ranges from 1613 to 92261. The wide range of problem scales as well as real-world perturbations pose significant challenges for the learning-based methods. The results are summarized in Table IV, which demonstrates that AdaSolver is able to improve the performance of the underlying solvers both in IL and RL settings. AdaSolver-IL can reach the best PD gap and AdaSolver-RL has the best PD integral metric. The performance of AdaSolver sheds light on the generalizable neural solvers in real-world applications.

TABLE IV: The results in the real-world dataset show that AdaSolver significantly outperforms the underlying solvers.

Method	Time(s)	Nodes	PD integral $\downarrow$	PD gap $\downarrow$
RPB	866.68	43907.95	76018.92	5.50e+19
SVMRank	899.93	14519.66	65522.04	4.83e+19
Trees	900.02	9658.55	76143.65	4.66e+19
LambdaMart	830.45	16229.83	62415.32	4.66e+19
tMDP+DFS	763.81	40933.30	44679.11	5.00e+18
GNN	823.88	37469.58	45483.01	3.50e+18
AdaSolver-RL	754.08	46746.05	42179.25	5.00e+18
AdaSolver-IL	830.50	57156.53	42579.31	3.33e+18

8.3 Improving Sample Efficiency of the Branching Policy

We compare the performance of AdaSolver-IL and the underlying GNN solver trained on only 100 instances, one percent amount of instances of that in Section 8.2.1. The results in Table V demonstrate that the underlying learning-based solvers GNN face significant performance degradation when trained on fewer instances. Specifically, the performance of GNN decrease by 49.70% on the set covering benchmark and 76.45% on the combinatorial auctions benchmark. In such a situation, AdaSolver-IL effectively relieves the performance degradation and outperforms GNN, which implies that AdaSolver is much more sample-efficient. In some real-world scenarios, we often have only a few training instances, and the limited availability of instances poses severe challenges for the training of learning-based solvers. Thus, the performance under fewer training instances significantly reflects the potential of practical applicability.

TABLE V: Sample efficiency of our proposed AdaSolver-IL. The GNN baseline faces significant performance degradation when the training instances are few. The results demonstrate that AdaSolver-IL significantly improves the sample efficiency.

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
GNN	13.41	138.13	278.75	1694.85	900.00	10049.36	900.00	14158.38	900.00	16706.91	900.00	24062.04
AdaSolver-IL	7.35	85.06	237.97	1289.08	900.00	8840.34	900.00	12381.27	900.00	15240.56	900.00	22754.06
	Set Covering

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
GNN	3.28	35.51	40.05	185.77	577.08	820.29	900.00	1981.16	900.00	2582.09	900.00	3014.64
AdaSolver-IL	2.62	35.69	24.54	182.54	217.18	428.28	830.51	1349.77	900.00	1934.11	900.00	2326.64
	Combinatorial Auctions

8.4 Ablation Studies

We conduct ablation studies on the four synthesis benchmarks to show the contributions of the adversarial augmentation strategy. Specifically, we compare the performance of AdaSolver with a random augmentation approach (RA) without the adversarial graph augmenter. Please see Table VI for the results in terms of Time, Nodes and PD integral metrics. An interesting observation is that RA has a slight superiority over AdaSolver in terms of solving time on the distribution $D_{1}$ of the set covering, capacitated facility location and maximum independent set benchmarks. But AdaSolver consistently performs better in terms of other benchmarks. RA may be more effective in improving the in-distribution performance of branching policy but fails to capture and produce the graph structures that are useful for generalization to larger instances. The results demonstrate that the adversarial graph augmenter plays an important role in finding the worst-case instances that are beneficial for the generalization of the branching policy networks.

TABLE VI: Ablation study of adversarial augmentation strategy, where RA denotes the random strategy.

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
RA	6.13	79.31	85.01	572.71	852.54	7299.03	900.00	15603.95	900.00	18496.45	900.00	24254.16
AdaSolver-IL	6.28	80.67	73.78	515.92	839.80	7079.13	900.00	10691.05	900.00	14452.71	900.00	21792.69
RA	18.75	149.36	436.97	2931.59	900.00	11646.88	900.00	15478.89	900.00	18691.82	900.00	25966.91
AdaSolver-RL	18.87	152.26	433.97	2885.97	900.00	11330.87	900.00	14466.95	900.00	17550.63	900.00	24319.00
	Set Covering

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
RA	1.70	22.93	22.90	127.69	464.55	662.24	829.49	1327.54	900.00	1889.85	900.00	2299.41
AdaSolver-IL	1.63	22.44	16.34	117.13	178.32	314.32	817.57	1154.79	900.00	1716.55	900.00	2107.19
RA	2.18	24.76	18.80	118.94	301.43	489.06	845.88	1751.81	900.00	2548.19	900.00	3247.74
AdaSolver-RL	2.04	106.83	17.28	117.28	323.94	517.91	846.32	1685.69	900.00	2379.82	900.00	2980.48
	Combinatorial Auctions

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
RA	33.24	215.65	140.40	752.18	564.12	3548.71	865.28	11576.80	871.82	18585.50	877.97	32716.55
AdaSolver	34.51	196.56	147.47	738.36	555.66	3324.63	731.81	6644.35	819.31	13818.34	884.48	28231.23
RA	68.96	238.76	243.00	936.84	637.58	3953.49	877.48	8579.21	892.98	21496.21	893.97	39560.20
AdaSolver-RL	64.10	257.10	233.64	892.77	587.02	3332.38	852.53	7779.28	849.93	14727.70	852.51	27298.25
	Capacitated Facility Location

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
RA	7.26	45.85	534.59	996.39	871.08	2999.36	900.00	5684.19	900.00	6331.45	900.00	6833.94
AdaSolver-IL	7.39	44.56	210.36	460.80	806.78	2355.11	900.00	3591.38	900.00	4375.48	900.00	5152.28
RA	8.00	48.94	275.73	623.85	790.42	2153.45	890.97	3686.33	900.00	4687.60	900.00	5625.48
AdaSolver-RL	8.43	51.46	282.54	627.81	803.75	2117.01	879.81	3318.08	900.02	4148.24	900.00	5099.92
	Maximum Independent Set

8.5 Extensive Studies

8.5.1 Different Training Methods of the Augmenter

We compare AdaSolver with its variance that uses reinforce algorithm to train the augmenter (AdaSolver-REINFORCE) in Table VII. The AdaSolver outperforms AdaSolver-REINFORCE in all distributions of the set covering and combinatorial auctions benchmarks. It is well-known that PPO is much more sample-efficient than reinforce, and AdaSolver further improves the performance using the sample-efficient training method.

TABLE VII: Performance of AdaSolver-IL and AdaSolver-REINFORCE. The results show that AdaSolver-IL outperforms AdaSolver-REINFORCE in all distributions.

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
AdaSolver-REINFORCE	1.79	24.66	26.65	151.13	493.92	737.42	882.36	1472.78	900.00	1864.20	900.00	2189.29
AdaSolver-IL	1.63	22.44	16.34	117.13	178.32	314.32	817.57	1154.79	900.00	1716.55	900.00	2107.19
	Combinatorial Auctions

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
AdaSolver-REINFORCE	6.99	91.23	104.30	712.53	872.44	7749.23	900.00	11105.89	900.00	14739.92	900.00	23590.80
AdaSolver-IL	6.28	80.67	73.78	515.92	839.80	7079.13	900.00	10691.05	900.00	14452.71	900.00	21792.69
	Set Covering

8.5.2 Sensitivity Analysis on the Masking proportions

We analyze the sensitivity of AdaSolver’s performance to the masking proportion on set covering and combinatorial auctions benchmarks, and we report the results in Table VIII. The results show that the performance of AdaSolver is not sensitive to different masking proportions and outperforms the default GNN policy when the proportion is limited to 4%. Another observation is that the lower masking proportions tend to provide a better-learned policy since lower masking proportions promote data diversity and do not significantly break the problem structures, avoiding biased data for training.

TABLE VIII: Sensitivity of the masking proportion. The results show that AdaSolver-IL is not sensitive to the masking proportion when the proportion is limited to 4%.

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
AdaSolver-IL 0.01	1.63	22.44	16.34	117.13	178.32	314.32	817.57	1154.79	900.00	1716.55	900.00	2107.19
AdaSolver-IL 0.02	1.96	27.68	19.21	155.75	194.64	375.73	819.28	1254.26	900.00	1743.97	900.00	2051.01
AdaSolver-IL 0.03	1.92	26.14	17.95	132.11	202.41	360.52	820.05	1240.96	900.00	1720.49	900.00	2054.11
AdaSolver-IL 0.04	2.22	30.55	20.35	156.33	193.72	361.55	820.80	1209.92	900.00	1700.80	900.00	1968.02
AdaSolver-IL 0.05	2.36	32.73	19.78	161.43	196.50	384.36	823.93	1292.45	900.00	1769.48	900.00	2116.08
	Combinatorial Auctions

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Time(s) $\downarrow$	PD integral	Time(s) $\downarrow$	PD integral	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$	Time(s)	PD integral $\downarrow$
AdaSolver-IL 0.01	6.28	80.67	73.78	515.92	839.80	7079.13	900.00	10691.05	900.00	14452.71	900.00	21792.69
AdaSolver-IL 0.02	7.09	87.37	90.46	671.65	872.50	7392.75	900.00	11415.87	900.00	14788.13	900.00	23364.03
AdaSolver-IL 0.03	7.05	87.98	90.39	672.11	730.26	7396.32	900.00	11282.07	900.00	15107.76	900.00	23151.86
AdaSolver-IL 0.04	9.05	98.56	99.42	737.68	869.89	8088.66	900.00	12128.56	900.00	15376.10	900.00	24009.53
AdaSolver-IL 0.05	9.20	98.94	107.61	794.04	876.45	8432.83	900.00	12479.55	900.00	16414.10	900.00	23640.26
	Set Covering

8.5.3 Analysis on the GNN Embeddings in the IL setting

We use T-SNE [51] to visualize the learned embeddings of the GNN branching policy trained on the distribution $D_{1}$ . The results on the set covering benchmark are illustrated in Figure 3, where different colors represent samples from various distributions on the set covering benchmark. In order to investigate the robustness of environmental perturbations, we select ten distributions with different instance scales and nonzero entry proportions of constraint matrices. The instance bipartite graphs across distributions thus have different local graph structures.

We first observe that before training, GNN’s embeddings on different distributions are separated from each other with little overlap 3(a). The GNN model without AdaSolver cannot capture task-relevant features across distributions. This means that the GNN model does not extract common task-relevant features across distributions on the same problem family. The predicted scores using these embeddings will become unstable across distributions, and the generalization ability of the GNN model is poor. We observe in Figure 3(b) that the embeddings of GNN do not scatter around but still appear loose. Finally, we can see that our proposed AdaSolver framework leads to more concentrative and stable embeddings in Figure 3(c), providing evidence that AdaSolver is not so sensitive to the distributional shifts and perturbations.

8.5.4 Analysis on the Imitation Accuraccy

We report the imitation learning accuracy on different distributions on the set covering and combinatorial auctions benchmarks in Table IX. For in-distribution $D_{1}$ imitation learning accuracy, AdaSolver-IL achieves the comparable or higher imitation accuracy of GNN. An interesting observation is that on $D_{4}-D_{6}$ of the set covering benchmark, the imitation learning accuracy of AdaSolver-IL is slightly lower than that of GNN, but AdaSolver-IL consistently outperforms GNN on these distributions. Thus, the imitation learning accuracy is not a definite factor of solving performance, and the higher imitation learning accuracy does not imply better performance.

TABLE IX: Imitation learning accuracy on the testing distributions of the set covering and combinatorial auctions benchmarks.

	Set Covering
	$D_{1}$	$D_{2}$	$D_{3}$	$D_{4}$	$D_{5}$	$D_{5}$
GNN	70.02	66.36	54.07	48.03	48.05	16.08
AdaSolver-IL	70.03	66.39	54.02	52.01	44.08	16.00

	Combinatorial Auctions
	$D_{1}$	$D_{2}$	$D_{3}$	$D_{4}$	$D_{5}$	$D_{6}$
GNN	67.63	72.42	67.86	72.07	76.02	47.03
AdaSolver-IL	68.55	72.57	66.92	62.00	74.02	44.06

9 Conclusion

This paper focuses on the generalization and robustness issues of the learning-based MILP solvers under restrictive training distributions. We observe from empirical results that the learned embeddings of GNN are unstable across distributions on the learn-to-branch task, and we infer the overfitting risks to the homogeneous graph structures. To tackle this challenge, we propose a distributionally robust optimization framework (AdaSolver) for neural solvers to promote the diversity of instance graph structures in training distributions by adversarial instance augmentation. Specifically, we formulate the adversarial instance graph augmentation as a contextual bandit problem and employ a sample efficient decision model to enable efficient training of the augmenter under insufficient training data. Experiments show that AdaSolver significantly improves the generalization of neural solvers on five well-recognized benchmarks.

Acknowledgments

The authors would like to thank all the anonymous reviewers for their insightful comments. This work was supported by the National Key R&D Program of China under contract 2022ZD0119801, National Nature Science Foundations of China grants U19B2026, U19B2044, 61836011, 62021001, and61836006, and the Fundamental Research Funds for the Central Universities grant WK3490000004.

References

[1] Z.-L. Chen, “Integrated production and outbound distribution scheduling: review and extensions,” Operations research, vol. 58, no. 1, pp. 130–148, 2010.
[2] K. Ma, L. Xiao, J. Zhang, and T. Li, “Accelerating an fpga-based sat solver by software and hardware co-design,” Chinese Journal of Electronics, vol. 28, no. 5, pp. 953–961, 2019.
[3] S. Liu, J. M. Pinto, and L. G. Papageorgiou, “A tsp-based milp model for medium-term planning of single-stage continuous multiproduct plants,” Industrial & Engineering Chemistry Research, vol. 47, no. 20, pp. 7733–7743, 2008.
[4] K. Bestuzheva, M. Besançon, W.-K. Chen, A. Chmiela, T. Donkiewicz, J. van Doornmalen, L. Eifler, O. Gaul, G. Gamrath, A. Gleixner et al., “The scip optimization suite 8.0,” arXiv preprint arXiv:2112.08872, 2021.
[5] L. Gurobi Optimization, “Gurobi optimizer reference manual,” 2021.
[6] A. Colorni, M. Dorigo, F. Maffioli, V. Maniezzo, G. Righini, and M. Trubian, “Heuristics from nature for hard combinatorial optimization problems,” International Transactions in Operational Research, vol. 3, no. 1, pp. 1–21, 1996.
[7] J. Hromkovič, Algorithmics for hard problems: introduction to combinatorial optimization, randomization, approximation, and heuristics. Springer Science & Business Media, 2013.
[8] H. He, H. Daume III, and J. M. Eisner, “Learning to search in branch and bound algorithms,” Advances in Neural Information Processing Systems, vol. 27, 2014.
[9] E. Khalil, P. Le Bodic, L. Song, G. Nemhauser, and B. Dilkina, “Learning to branch in mixed integer programming,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016.
[10] M. Gasse, D. Chételat, N. Ferroni, L. Charlin, and A. Lodi, “Exact combinatorial optimization with graph convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[11] L. Scavuzzo, F. Chen, D. Chetelat, M. Gasse, A. Lodi, N. Yorke-Smith, and K. Aardal, “Learning to branch with tree mdps,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 18 514–18 526. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/756d74cd58592849c904421e3b2ec7a4-Paper-Conference.pdf
[12] Z. Wang, X. Li, J. Wang, Y. Kuang, M. Yuan, J. Zeng, Y. Zhang, and F. Wu, “Learning cut selection for mixed-integer linear programming via hierarchical sequence model,” in The Eleventh International Conference on Learning Representations, 2023.
[13] Y. Bengio, A. Lodi, and A. Prouvost, “Machine learning for combinatorial optimization: a methodological tour d’horizon,” European Journal of Operational Research, vol. 290, no. 2, pp. 405–421, 2021.
[14] J. Sakuma and S. Kobayashi, “A genetic algorithm for privacy preserving combinatorial optimization,” in Annual Conference on Genetic and Evolutionary Computation, 2007.
[15] C. Wang, Z. Yu, S. McAleer, T. Yu, and Y. Yang, “Asp: Learn a universal neural solver!” arXiv preprint arXiv:2303.00466, 2023.
[16] S. Geisler, J. Sommer, J. Schuchardt, A. Bojchevski, and S. Günnemann, “Generalization of neural combinatorial solvers through the lens of adversarial robustness,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=vJZ7dPIjip3
[17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347
[18] A. Lodi and G. Zarpellon, “On learning and branching: a survey,” Top, vol. 25, pp. 207–236, 2017.
[19] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” Advances in Neural Information Processing Systems, vol. 28, 2015.
[20] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural combinatorial optimization with reinforcement learning,” arXiv preprint arXiv:1611.09940, 2016.
[21] W. Kool, H. van Hoof, and M. Welling, “Attention, learn to solve routing problems!” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=ByxBFsRqYm
[22] D. Selsam, M. Lamm, B. Bünz, P. Liang, L. de Moura, and D. L. Dill, “Learning a SAT solver from single-bit supervision,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=HJMC_iA5tm
[23] E. Yolcu and B. Póczos, “Learning local search heuristics for boolean satisfiability,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[24] M. Nazari, A. Oroojlooy, L. Snyder, and M. Takác, “Reinforcement learning for solving the vehicle routing problem,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[25] T. Barrett, W. Clements, J. Foerster, and A. Lvovsky, “Exploratory combinatorial optimization with reinforcement learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 3243–3250.
[26] P. Gupta, M. Gasse, E. Khalil, P. Mudigonda, A. Lodi, and Y. Bengio, “Hybrid models for learning to branch,” Advances in Neural Information Processing Systems, vol. 33, pp. 18 087–18 097, 2020.
[27] P. Gupta, E. B. Khalil, D. Chetélat, M. Gasse, Y. Bengio, A. Lodi, and M. P. Kumar, “Lookback for learning to branch,” Transactions of Machine Learning Research (TMLR), 2022. [Online]. Available: https://openreview.net/forum?id=EQpGkw5rvL
[28] Y. Tang, S. Agrawal, and Y. Faenza, “Reinforcement learning for integer programming: Learning to cut,” in International Conference on Machine Learning. PMLR, 2020, pp. 9367–9376.
[29] T. Achterberg, “Constraint integer programming,” Ph.D. dissertation, 2007.
[30] Z. Zeyang, Z. Ziwei, W. Xin, and Z. Wenwu, “Learning to solve travelling salesman problem with hardness-adaptive curriculum,” in AAAI Conference on Artificial Intelligence, vol. 36, 2022, pp. 9136–9144.
[31] J. Yuan, W. Yaoxin, C. Zhiguang, and Z. Jie, “Learning to solve routing problems via distributionally robust optimization,” in AAAI Conference on Artificial Intelligence, vol. 36, 2022, pp. 9786–9794.
[32] J. Bi, Y. Ma, J. Wang, Z. Cao, J. Chen, Y. Sun, and Y. M. Chee, “Learning generalizable models for vehicle routing problems via knowledge distillation,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=sOVNpUEgKMp
[33] L. Xin, W. Song, Z. Cao, and J. Zhang, “Generative adversarial training for neural combinatorial optimization models,” 2022. [Online]. Available: https://openreview.net/forum?id=9vsRT9mc7U
[34] H. Duan, P. Vaezipoor, M. B. Paulus, Y. Ruan, and C. Maddison, “Augment with care: Contrastive learning for combinatorial problems,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 17–23 Jul 2022, pp. 5627–5642. [Online]. Available: https://proceedings.mlr.press/v162/duan22b.html
[35] H. Lu, Z. Li, R. Wang, Q. Ren, X. Li, M. Yuan, J. Zeng, X. Yang, and J. Yan, “Roco: A general framework for evaluating robustness of combinatorial optimization solvers on graphs,” in The Eleventh International Conference on Learning Representations.
[36] G. Zarpellon, J. Jo, A. Lodi, and Y. Bengio, “Parameterizing branch-and-bound search trees to learn branching policies,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, pp. 3931–3939, May 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/16512
[37] M. Etheve, Z. Alès, C. Bissuel, O. Juan, and S. Kedad-Sidhoum, “Reinforcement learning for variable selection in a branch and bound algorithm,” in Integration of Constraint Programming, Artificial Intelligence, and Operations Research, E. Hebrard and N. Musliu, Eds. Cham: Springer International Publishing, 2020, pp. 176–185.
[38] H. Sun, W. Chen, H. Li, and L. Song, “Improving learning to branch via reinforcement learning,” in Learning Meets Combinatorial Algorithms at NeurIPS2020, 2020. [Online]. Available: https://openreview.net/forum?id=z4D7-PTxTb
[39] C. W. F. Parsonson, A. Laterre, and T. D. Barrett, “Reinforcement learning for branch-and-bound optimisation using retrospective trajectories,” in AAAI Conference on Artificial Intelligence, vol. 37 (4), 2023, pp. 4061–4069. [Online]. Available: https://doi.org/10.1609/aaai.v37i4.25521
[40] Q. Qu, X. Li, Y. Zhou, J. Zeng, M. Yuan, J. Wang, J. Lv, K. Liu, and K. Mao, “An improved reinforcement learning algorithm for learning to branch,” arXiv preprint arXiv:2201.06213, 2022.
[41] D. Applegate, R. Bixby, V. Chvátal, and W. Cook, “Finding cuts in the tsp (a preliminary report),” Citeseer, Tech. Rep., 1995.
[42] A. Prouvost, J. Dumouchelle, L. Scavuzzo, M. Gasse, D. Chételat, and A. Lodi, “Ecole: A gym-like library for machine learning in combinatorial optimization solvers,” in Learning Meets Combinatorial Algorithms at NeurIPS2020, 2020. [Online]. Available: https://openreview.net/forum?id=IVc9hqgibyB
[43] R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese, “Generalizing to unseen domains via adversarial data augmentation,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 5339–5349. [Online]. Available: https://proceedings.neurips.cc/paper/2018/hash/1d94108e907bb8311d8802b48fd54b4a-Abstract.html
[44] Z. Chen, J. Liu, X. Wang, J. Lu, and W. Yin, “On representing mixed-integer linear programs by graph neural networks,” 2022.
[45] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning, 2nd ed., ser. Adaptive Computation and Machine Learning. Cambridge, MA: MIT Press, 2018.
[46] S. Chen, E. Dobriban, and J. Lee, “A group-theoretic framework for data augmentation,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 21 321–21 333. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/f4573fc71c731d5c362f0d7860945b88-Paper.pdf
[47] H. Wang, S. Zheng, C. Xiong, and R. Socher, “On the generalization gap in reparameterizable reinforcement learning,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 6648–6658. [Online]. Available: https://proceedings.mlr.press/v97/wang19o.html
[48] C. Hansknecht, I. Joormann, and S. Stiller, “Cuts, primal heuristics, and learning to branch for the time-dependent traveling salesman problem,” 2018.
[49] T. Joachims, “Optimizing search engines using clickthrough data,” Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002. [Online]. Available: https://api.semanticscholar.org/CorpusID:207605508
[50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
[51] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of Machine Learning Research, vol. 9, no. 11, 2008.
[52] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of Big Data, vol. 6, pp. 1–48, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:195811894
[53] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” ArXiv, vol. abs/1708.04896, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:2035600
[54] L. Perez and J. Wang, “The effectiveness of data augmentation in image classification using deep learning,” ArXiv, vol. abs/1712.04621, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:12219403
[55] D. Yarats, I. Kostrikov, and R. Fergus, “Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=GY6-6sTvGaf
[56] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas, “Reinforcement learning with augmented data,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 19 884–19 895. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/e615c82aba461681ade82da2da38004a-Paper.pdf
[57] M. Laskin, A. Srinivas, and P. Abbeel, “CURL: Contrastive unsupervised representations for reinforcement learning,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 5639–5650. [Online]. Available: https://proceedings.mlr.press/v119/laskin20a.html
[58] A. Sadeghi, M. Ma, B. Li, and G. B. Giannakis, “Distributionally robust semi-supervised learning over graphs,” arXiv preprint arXiv:2110.10582, 2021.
[59] H. Dai, H. Li, T. Tian, X. Huang, L. Wang, J. Zhu, and L. Song, “Adversarial attack on graph structured data,” in International Conference on Machine Learning. PMLR, 2018, pp. 1115–1124.
[60] F. Feng, X. He, J. Tang, and T.-S. Chua, “Graph adversarial training: Dynamically regularizing based on graph structure,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 6, pp. 2493–2504, 2019.
[61] H. Xue, K. Zhou, T. Chen, K. Guo, X. Hu, Y. Chang, and X. Wang, “Cap: Co-adversarial perturbation on weights and features for improving generalization of graph neural networks,” arXiv preprint arXiv:2110.14855, 2021.
[62] T. Dao, A. Gu, A. Ratner, V. Smith, C. De Sa, and C. Re, “A kernel theory of modern data augmentation,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 1528–1537. [Online]. Available: https://proceedings.mlr.press/v97/dao19b.html
[63] S. Wu, H. Zhang, G. Valiant, and C. Re, “On the generalization effects of linear transformations in data augmentation,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 10 410–10 420. [Online]. Available: https://proceedings.mlr.press/v119/wu20g.html
[64] K. Leyton-Brown, M. Pearson, and Y. Shoham, “Towards a universal test suite for combinatorial auction algorithms,” in Proceedings of the 2nd ACM Conference on Electronic Commerce, 2000, pp. 66–76.
[65] E. Balas and A. Ho, Set covering algorithms using cutting planes, heuristics, and subgradient optimization: a computational study. Springer, 1980.
[66] G. Cornuéjols, R. Sridharan, and J.-M. Thizy, “A comparison of heuristics and relaxations for the capacitated plant location problem,” European Journal of Operational Research, vol. 50, pp. 280–297, 1991.
[67] D. Bergman, A. A. Ciré, W. J. van Hoeve, and J. Hooker, “Decision diagrams for optimization,” Constraints, vol. 20, pp. 494 – 495, 2015.
[68] M. Gasse, S. Bowly, Q. Cappart, J. Charfreitag, L. Charlin, D. Chételat, A. Chmiela, J. Dumouchelle, A. Gleixner, A. M. Kazachkov, E. Khalil, P. Lichocki, A. Lodi, M. Lubin, C. J. Maddison, M. Christopher, D. J. Papageorgiou, A. Parjadis, S. Pokutta, A. Prouvost, L. Scavuzzo, G. Zarpellon, L. Yang, S. Lai, A. Wang, X. Luo, X. Zhou, H. Huang, S. Shao, Y. Zhu, D. Zhang, T. Quan, Z. Cao, Y. Xu, Z. Huang, S. Zhou, C. Binbin, H. Minggui, H. Hao, Z. Zhiyu, A. Zhiwu, and M. Kun, “The machine learning for combinatorial optimization competition (ml4co): Results and insights,” in Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track, ser. Proceedings of Machine Learning Research, D. Kiela, M. Ciccone, and B. Caputo, Eds., vol. 176. PMLR, 06–14 Dec 2022, pp. 220–231. [Online]. Available: https://proceedings.mlr.press/v176/gasse22a.html

Appendix A More Background Information

A.1 Branching Rules

An efficiency branching policy significantly influences the size of B&B trees thus the overall solving efficiency. Commonly used traditional branching rules include strong branching and pseudocost branching.

The strong branching rule is known to produce the smallest B&B trees among the branching rules. Let $P_{0}$ denote the linear relaxed optimal objective for the parent node. We call a variable $x_{i}$ branching candidate variable if $x_{i}$ has factional value $x_{i}^{*}$ in the optimal solution of the parent LP relaxation. For a branching candidate variable $x_{i}$ , we let $P_{i,1}$ and $P_{i,2}$ be the relaxed optimal objective for the children obtained by adding the lower or upper bound ( $x_{i}\leq\lfloor x_{i}^{*}\rfloor,x_{i}\geq\lceil x_{i}^{*}\rceil$ ), respectively. Then strong branching chooses the branching variable $x_{SB}$ in the branching candidate set $\mathcal{C}$ by the following rules,

\displaystyle x_{SB}=\underset{x_{i}\in\mathcal{C}}{\text{argmax}}\left[\max(P_{i,1}-P_{0},\epsilon)\cdot\max(P_{i,2}-P_{0},\epsilon)\right],

where $\epsilon>0$ is a small constant. Since strong branching needs to solve the LP relaxations of the two sub-problems for each candidate variable, it requires considerable computation.

The pseudocost branching rule is simpler with much less computational cost than strong branching. It keeps track of the previous successful branching variables and gives a score for each candidate branching variable using the history information. The pseudocost branching rule is faster than the strong branching rule for less computational cost. However, the pseudocost branching rule requires considerable human experience for manual tuning.

To overcome the disadvantages of strong branching and pseudocost branching, some hybrid branching rules are proposed to combine these methods. The reliability pseudocost branching is one of the variants of these hybrid rules with state-of-the-art performance. Thus, the reliability pseudocost branching rule is used as the default branching rule in SCIP.

A.2 GNN Encoder for Branching Samples

In recent studies [10], researchers use a single graph neural network composed of two interleaved half-convolutions to process the instance bipartite graphs or branching samples. GNN first performs massage passing from the variable nodes to constraint nodes, then from constraint nodes to variable nodes to obtain the variable embeddings, i.e.,

	$\displaystyle\mathbf{h}_{i}^{W}=$	$\displaystyle\mathrm{MLP}_{2}\left([\mathbf{w}_{i},\sum_{(\mathbf{w}_{i},\mathbf{v}_{j})\in\mathcal{E}}\mathrm{MLP}_{1}(\left[\mathbf{w}_{i},\mathbf{v}_{j},e_{ij}\right])]\right)$
	$\displaystyle\mathbf{h}_{j}^{V}=$	$\displaystyle\mathrm{MLP}_{4}\left([\mathbf{v}_{j},\sum_{(\mathbf{w}_{k},\mathbf{v}_{j})\in\mathcal{E}}\mathrm{MLP}_{3}(\left[\mathbf{w}_{k},\mathbf{v}_{j},e_{kj}\right])]\right),$

where MLP denotes the multi-layer perceptron and $[\cdot]$ is the concatenation operation of vectors. The variable embeddings $\mathbf{h}_{j}^{V}$ can be used for downstream tasks such as optimal solution prediction, branching variable selection and so on.

A.3 Primal-Dual Integral Metric

When employing the B&B algorithm, we keep track of the global primal and dual bound. The global primal bound at time $t$ , i.e., $\mathbf{c}^{\top}\mathbf{x}_{t}^{*}$ , is the best objective found at time $t$ . The global dual bound is the minimum dual bound $z^{*}_{t}$ of all the search tree leaves at time $t$ . By definition, we have the primal-dual gap $\mathbf{c}^{\top}\mathbf{x}_{t}^{*}-\mathbf{z}^{*}_{t}>0$ , and this gap decreases as the B&B algorithm proceeds. B&B finds the optimal solution if and only if $\mathbf{c}^{\top}\mathbf{x}_{t}^{*}-\mathbf{z}^{*}_{t}=0$ .

With time limit $T$ , the primal-dual integral (PD integral) is defined as

\displaystyle\int_{0}^{T}(\mathbf{c}^{\top}\mathbf{x}_{t}^{*}-\mathbf{z}_{t}^{*})dt.

When solving the same instance, solvers that achieve smaller PD-integral often lead to better feasible solutions given the same solving time. The PD-integral is a well-recognized metric for solver performance. Additionally, the PD gap metric is 0 if the instance is solved to be optimal in the limit time.

Appendix B Theoretical Analysis

B.1 More Related Work on Theoretical Results for Data Augmentation

Data augmentation is an effective method for introducing prior information and promoting data diversity, which is widely used in supervised learning [52, 53, 54] and reinforcement learning tasks [55, 56, 57]. Recently, augmentation on graph data has attracted more and more attention [58, 59, 60, 61]. For theoretical results, previous work [46] proposes a generalization bound for data augmentation in a group-theoretical way; [62] analyses from the kernel perspective of data augmentation; [63] studies the effects of data augmentation on the ridge estimator in an over-parametrized linear regression setting.

B.2 Proof of Theorem 1

The generalization bound for Equation 8 and 9 can be derived from the classical approach, where we briefly show the proof as follows. We start by splitting the term to be bounded into two parts,

\displaystyle\mathcal{J}_{p^{\prime}}(\hat{\pi})-\hat{\mathcal{J}}_{p}(\hat{\pi})=\left[\mathcal{J}_{p^{\prime}}(\hat{\pi})-\mathcal{J}_{p}(\hat{\pi})\right]+\left[\mathcal{J}_{p}(\hat{\pi})-\hat{\mathcal{J}}_{p}(\hat{\pi})\right].

The second part can be bounded by the Rademacher complexity using Mcdiarmid’s inequality,

\displaystyle\mathcal{J}_{p}(\hat{\pi})-\hat{\mathcal{J}}_{p}(\hat{\pi})\leq 2\mathfrak{R}_{K}(\ell\circ\Pi)+\Omega(K).

To bound the first part, we see that

		$\displaystyle\mathcal{J}_{p^{\prime}}(\hat{\pi})-\mathcal{J}_{p}(\hat{\pi})$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{G^{\prime}}\mathbb{E}_{\mathbf{s}^{\prime}}\left[\ell(\pi(\mathbf{s}^{\prime}),\pi_{E}(\mathbf{s}^{\prime}))\right]-\mathbb{E}_{G}\mathbb{E}_{\mathbf{s}}\left[\ell(\pi(\mathbf{s}),\pi_{E}(\mathbf{s}))\right]$
	$\displaystyle\leq$	$\displaystyle\sup_{\\|f\\|_{\text{Lip}}\leq 1}\left(\mathbb{E}_{G^{\prime}}\mathbb{E}_{\mathbf{s}^{\prime}}\left[f(\mathbf{s}^{\prime})\right]-\mathbb{E}_{G}\mathbb{E}_{\mathbf{s}}\left[f(\mathbf{s})\right]\right)$
	$\displaystyle=$	$\displaystyle\mathcal{W}_{\\|\cdot\\|}(D_{\pi_{E},p}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s})).$

We complete the proof of Equation 8 by combining the bounds for these two parts. The proof of the generalization bound under instance augmentation (Equation 9) is the same.

We next compare the generalization bound of the two bounds.

		$\displaystyle 2\mathfrak{R}_{K}^{\Phi}(\ell\circ\Pi)-2\mathfrak{R}_{K}(\ell\circ\Pi)$
	$\displaystyle\leq$	$\displaystyle 2\mathbb{E}_{\mathcal{D}}\left[\sup_{\pi\in\Pi}\frac{1}{K}\sum_{k=1}^{K}\ell(\pi(\tilde{\mathbf{s}}_{k}),\pi_{E}(\tilde{\mathbf{s}}_{k}))\right]$
		$\displaystyle\qquad-2\mathbb{E}_{\mathcal{D}}\left[\sup_{\pi\in\Pi}\frac{1}{K}\sum_{k=1}^{K}\ell(\pi(\mathbf{s}_{k}),\pi_{E}(\mathbf{s}_{k}))]\right]$
	$\displaystyle=$	$\displaystyle 2\mathbb{E}_{\tilde{\mathbf{s}}}\left[\sup_{\pi\in\Pi}\frac{1}{K}\sum_{k=1}^{K}\ell(\pi(\tilde{\mathbf{s}}_{k}),\pi_{E}(\tilde{\mathbf{s}}_{k}))\right]$
		$\displaystyle\qquad-2\mathbb{E}_{\mathbf{s}}\left[\sup_{\pi\in\Pi}\frac{1}{K}\sum_{k=1}^{K}\ell(\pi(\mathbf{s}_{k}),\pi_{E}(\mathbf{s}_{k}))]\right]$
	$\displaystyle\leq$	$\displaystyle 2\sup_{f}\mathbb{E}_{\tilde{\mathbf{s}}}[f(\tilde{\mathbf{s}})]-\mathbb{E}_{\mathbf{s}}[f(\mathbf{s})]$
	$\displaystyle=$	$\displaystyle 2\mathcal{W}_{\\|\cdot\\|}(D_{\pi_{E},\Phi(p)}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s})).$

If $\alpha\leq 1/3$ and the following inequality holds,

		$\displaystyle 2\mathcal{W}_{\\|\cdot\\|}(D_{\pi_{E},\Phi(p)}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))$
	$\displaystyle\leq$	$\displaystyle\tiny{\mathcal{W}_{\\|\cdot\\|}(D_{\pi_{E},p}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))-\mathcal{W}_{\\|\cdot\\|}(D_{\pi_{E},\Phi(p)}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))}.$

We thus show Equation 10 then the generalization bound of instance augmentation can be lower.

	$\displaystyle\mathcal{W}_{\\|\cdot\\|}(D_{\pi_{E},\Phi(p)}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))+2\mathfrak{R}_{K}^{\Phi}(\ell\circ\Pi)$
	$\displaystyle\leq\mathcal{W}_{\\|\cdot\\|}(D_{\pi_{E},p}(\mathbf{s}),D_{\pi_{E},p^{\prime}}(\mathbf{s}))+2\mathfrak{R}_{K}(\ell\circ\Pi).$

B.3 Proof of Theorem 2

We start by decompose the generalization error as we did in the IL setting

	$\displaystyle\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{p^{\prime}}(\hat{\pi}_{\Phi})-\mathcal{J}_{p^{\prime}}(\pi^{*})]$	(12)
$\displaystyle=$	$\displaystyle\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{p^{\prime}}(\hat{\pi}_{\Phi})-\mathcal{J}_{\Phi(p)}(\hat{\pi}_{\Phi})]+\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{\Phi(p)}(\hat{\pi}_{\Phi})-\hat{\mathcal{J}}_{\Phi(p)}(\hat{\pi}_{\Phi})]$
	$\displaystyle+\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{J}}_{\Phi(p)}(\hat{\pi}_{\Phi})-\hat{\mathcal{J}}_{\Phi(p)}(\pi^{*})]$
	$\displaystyle+\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{J}}_{\Phi(p)}(\pi^{})-\mathcal{J}_{\Phi(p)}(\pi^{})]+\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{\Phi(p)}(\pi^{})-\mathcal{J}_{p^{\prime}}(\pi^{})]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{p^{\prime}}(\hat{\pi}_{\Phi})-\mathcal{J}_{\Phi(p)}(\hat{\pi}_{\Phi})]+\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{\Phi(p)}(\hat{\pi}_{\Phi})-\hat{\mathcal{J}}_{\Phi(p)}(\hat{\pi}_{\Phi})]$
	$\displaystyle+\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{\Phi(p)}(\pi^{})-\mathcal{J}_{p^{\prime}}(\pi^{})]$

The inequality in (12) holds since the training loss $\hat{\mathcal{J}}_{\Phi(p)}$ reaches its minimum at the policy $\hat{\pi}_{\Phi}$ by definition, and the expectation $\mathbb{E}_{\mathcal{D}}[\hat{\mathcal{J}}_{\Phi(p)}(\pi^{*})-\mathcal{J}_{\Phi(p)}(\pi^{*})]=0$ . We then try to control the other three terms in (12).

For reparameterizable RL,

\displaystyle\mathcal{J}_{\Phi(p)}(\pi)=\mathbb{E}_{\xi\sim p(\xi)}[R(\pi,\mathcal{T},\mathcal{I}^{\Phi},\xi)],

For the second term in Equation (12), we view $R$ as a random variable and the total returns sampled from instances drawn from distribution $\Phi(p)$ are i.i.d. We then use the standard symmetrization argument to obtain the bound with confidence at least $1-\delta$ ,

\displaystyle\mathcal{J}_{\Phi(p)}(\hat{\pi}_{\Phi})-\hat{\mathcal{J}}_{\Phi(p)}(\hat{\pi}_{\Phi})\leq

\displaystyle 2\mathfrak{R}_{n}^{\Phi}(\ell\circ\Pi)+\Omega(n).

(13)

Finally, we bound the first and the last terms. Different from the IL setting, the transition function plays an important role in sampling training branching samples. Thus, the branching samples cannot be viewed as i.i.d samples.

For any policy $\pi\in\Pi$ , we have the following estimations,

		$\displaystyle\mathcal{J}_{\Phi(p)}(\pi)-\mathcal{J}_{p^{\prime}}(\pi)$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\xi\sim p(\xi)}R(\pi,\mathcal{T},\mathcal{I}^{\Phi},\xi)-\mathbb{E}_{\xi\sim p(\xi)}R(\pi,\mathcal{T},\mathcal{I}^{\prime},\xi)$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\xi}\sum_{t=0}^{T-1}r(\tilde{\mathbf{s}}_{t})-\mathbb{E}_{\xi}\sum_{t=0}^{T-1}r(\mathbf{s}^{\prime}_{t})$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\xi}\sum_{t=0}^{T-1}\left\|r(\tilde{\mathbf{s}}_{t})-r({\mathbf{s}}^{\prime}_{t})\right\|$
	$\displaystyle\leq$	$\displaystyle\sum_{t=0}^{T-1}\mathbb{E}_{\xi}\text{Lip}_{r}\\|\tilde{\mathbf{s}}_{t}-\mathbf{s}^{\prime}_{t}\\|.$

We next bound the discrepancy between states from original instances and augmented instances at each step recursively.

		$\displaystyle\\|\tilde{\mathbf{s}}_{t}-\mathbf{s}^{\prime}_{t}\\|$
	$\displaystyle\leq$	$\displaystyle\\|\mathcal{T}(\tilde{\mathbf{s}}_{t-1},\pi(\tilde{\mathbf{s}}_{t-1}))-\mathcal{T}(\mathbf{s}^{\prime}_{t-1},\pi(\mathbf{s}^{\prime}_{t-1}))\\|$
	$\displaystyle\leq$	$\displaystyle\\|\mathcal{T}(\tilde{\mathbf{s}}_{t-1},\pi(\tilde{\mathbf{s}}_{t-1}))-\mathcal{T}(\tilde{\mathbf{s}}_{t-1},\pi(\mathbf{s}^{\prime}_{t-1}))\\|$
		$\displaystyle+\\|\mathcal{T}(\tilde{\mathbf{s}}_{t-1},\pi({\mathbf{s}}^{\prime}_{t-1}))-\mathcal{T}(\mathbf{s}^{\prime}_{t-1},\pi(\mathbf{s}^{\prime}_{t-1}))\\|$
	$\displaystyle\leq$	$\displaystyle\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}\\|\tilde{\mathbf{s}}_{t-1}-\mathbf{s}^{\prime}_{t-1}\\|+\text{Lip}_{\mathcal{T}_{1}}\\|\tilde{\mathbf{s}}_{t-1}-\mathbf{s}^{\prime}_{t-1}\\|$
	$\displaystyle=$	$\displaystyle(\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}+\text{Lip}_{\mathcal{T}_{1}})\\|\tilde{\mathbf{s}}_{t-1}-\mathbf{s}^{\prime}_{t-1}\\|$
	$\displaystyle\leq$	$\displaystyle(\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}+\text{Lip}_{\mathcal{T}_{1}})^{t}\\|\tilde{\mathbf{s}}_{0}-\mathbf{s}^{\prime}_{0}\\|$
	$\displaystyle=$	$\displaystyle(\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}+\text{Lip}_{\mathcal{T}_{1}})^{t}c(\Phi(G),G^{\prime}).$

We thus obtain the bound

	$\displaystyle\mathbb{E}_{\mathcal{D}}[\mathcal{J}_{\Phi(p)}(\pi)-\mathcal{J}_{p^{\prime}}(\pi)]$	(14)
$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathcal{D}}[\sum_{t=0}^{T-1}\mathbb{E}_{\xi}\text{Lip}_{r}(\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}+\text{Lip}_{\mathcal{T}_{1}})^{t}c(\Phi(G),G^{\prime})]$
$\displaystyle=$	$\displaystyle\sum_{t=0}^{T-1}\text{Lip}_{r}(\text{Lip}_{\mathcal{T}_{2}}\text{Lip}_{\pi}+\text{Lip}_{\mathcal{T}_{1}})^{t}\mathcal{W}_{\mathcal{G}}(\Phi(p),p^{\prime}).$

So far, combing Equation 13 and 14, we complete the proof.

Appendix C Details of the Benchmarks

C.1 Synthesis Benchmarks for Evaluations

We evaluate the performance of our proposed methods on four classical benchmarks of NP-hard problem families, namely combinatorial auctions, minimum set covering, capacitated facility location and maximum independent set. We use these four synthesis benchmarks mainly for the following reasons. First, these four benchmarks are widely-used testing beds for MILP solvers [10, 26, 27]. Second, we can generate a series of transfer distributions with much larger scales to evaluate solvers’ generalization ability. Third, these benchmarks represent a wide collection of MILP problems in practice.

We use the code in https://github.com/ds4dm/learn2branch for data generation. (1) Combinatorial auctions is comprised of instances generated following the arbitrary relationships procedure of [64]. (2) Set covering is comprised of instances generated following [65] with default density 0.05. (3) Capacitated facility location is comprised of instances generated following [66]. (4) Maximum independent set is comprised of instances on Erdős-Rényi random graphs, generated following the procedure of [67]. For all benchmarks, we classify six distributions $D_{1}$ to $D_{6}$ by increasing instance scales or difficulty levels, where $D_{1}$ is the easiest with the smallest scales and $D_{6}$ is the most difficult with the largest scales that are five to ten times larger than those in $D_{1}$ . The baselines can solve all the instances of $D_{1}$ and $D_{2}$ within 900s but reach the time limit in $D_{4}-D_{6}$ . Table X summarizes the generation hyperparameters for generation algorithms.

We generate $10000$ $D_{1}$ instances for training and $2000$ $D_{1}$ instances for validation, and we evaluate on $32$ instances on $D_{1}$ to $D_{6}$ , respectively. For the GNN and AdaSolver-IL methods, we solve the training instances with SCIP 7.0.3 and record branching samples and branching decisions of a strong branching expert. We collect $100000$ samples for training and $20000$ for validation in total. For RL-based solvers, we limit the processing time of each episode to no more than 1000s.

TABLE X: The hyperparameters for generating the instances.

Distributions	$D_{1}$	$D_{2}$	$D_{3}$	$D_{4}$	$D_{5}$	$D_{6}$
Row	500	1000	2000	3000	4000	8000
Column	1000	1000	1000	1000	1000	1000
Set Covering

Distributions	$D_{1}$	$D_{2}$	$D_{3}$	$D_{4}$	$D_{5}$	$D_{6}$
Items	100	200	300	400	500	600
Bids	500	1000	1500	2000	2500	3000
Combinatorial Auctions

Distributions	$D_{1}$	$D_{2}$	$D_{3}$	$D_{4}$	$D_{5}$	$D_{6}$
Facilities	100	200	400	600	800	1600
Customers	100	100	100	100	100	100
Capacitated Facility Location

Distributions	$D_{1}$	$D_{2}$	$D_{3}$	$D_{4}$	$D_{5}$	$D_{6}$
Nodes	500	1000	1500	2000	2500	3000
Affinity	4	4	4	4	4	4
Maximum Independent Set

C.2 Real-world Benchmarks for Evaluations

The real-world dataset we use in this paper is the Anonymous dataset from the NeurIPS 2021 ML4CO competition [68]. It consists of 98 training and 20 validation instances. Similarly to what we do in the synthesis benchmarks, we collect $100000$ training samples and $20000$ validation samples for GNN and AdaSolver-IL. For RL-based solvers, we limit the processing time of each episode to no more than 1000s.

C.3 Datasets for the Visulization

We generate ten distributions of the set covering benchmark for visualization using the combinations of $\{500,1000\}$ rows, 1000 columns and $\{0.01,0.05,0.1,0.2,0.5\}$ density. The density hyperparameter for data generation controls the proportion of nonzero entries of the constraint matrices. We change the density of the instances to obtain graphs with different local structures to simulate the environmental perturbations in real-world applications.

Appendix D Implementation Details

D.1 Hard-ware Specification and Experiment Settings

We conducted all the experiments on a single machine with NVidia GeForce GTX 3090 GPUs and Intel(R) Xeon(R) E5-2667 v4 CPUs @ 3.20GHz. During the evaluation, we run three random seeds $\{0,1,2\}$ over each instance.

D.2 Details of the Baselines

In this part, we introduce the details of the baselines used in this paper.

(1) GNN. We use the same implementation of GCNN in [10]. First, we apply two interleaved half-convolutions to the input bipartite graphs. The graph convolution has two successive passes with 64 units, one from variable to constraints and one from constraints to variables. Then we apply a final 2-layer perceptron on the embeddings of variable nodes, combined with a masked softmax activation to produce a probability distribution over the branching candidates. We also use the prenorm layers proposed in [10] and conduct a pre-train stage for them.

(2) SVMRank and LambdaMART. We implement the SVMRank and LambdaMART model for branching policy in SCIP as [10]. The two models take inputs as the 91-dimensional branching features described in [9] and are trained on pairs of variable’s state-rank samples for ranking. The SVMRank is trained by minimizing a pairwise loss of variables and the LambdaMART is trained by minimizing normalized cumulative gain. As mentioned in [10], we reduce the dataset with 250000 candidate variables for training and 100000 for validation because of the prohibitive memory and time cost.

(3) Trees. The Trees model is implemented following that in [10], where we obtain the input variable features by concatenating the bipartite graph variable features, the minimum, mean and maximum of the edge-constraint features over its neighborhood. The model predicts a branching score for each candidate variable and is trained by minimizing the mean-squared error between the predicted scores and ground truth strong branching scores. We also reduce the size of the dataset for Trees.

(4) tMDP+DFS. tMDP+DFS is a reinforcement-learning-based method under the framework of tree Markov Decision Processes (tMDP), a variant of temporal Markov Decision Processes. We implement the tMDP+DFS model with two interleaved half-convolutions GNN encoder as described in [11]. The GNN model architecture is the same as [10]. tMDP+DFS is trained by REINFORCE with tree policy gradients and depth-first search node selection.

(5) TreeGate. TreeGate is an imitation learning framework with input features of 25-dimensional branching candidate states and 60-dimensional search tree states [36]. TreeGate introduces such tree-state parameterization by first embedding the tree-state inputs and then sending them to the proposed gating layers to extract the tree-based signal. Combining the variable representations and tree-state representation, TreeGate predicts the branching scores for each branching candidate variable.

D.3 Representations of instance Bipartite Graph and Branching Samples

To avoid mixing up, recall that the instance bipartite graphs are the input of the augmenter and do not contain solving statistics (since we augment the instance before we solve it with the solver), while the branching samples are the input of branching policy with solving statistics. We use the MILP Bipartite representation for instance graphs implemented in Ecole 0.8 package [42] with nine variable features, one constraint feature and one edge feature listed in Table XI. The branching samples use the graph representations described in [10] listed in Table XII.

TABLE XI: The variable features, constraint features and edge features used for instance bipartite graphs.

Index	Variable Feature Name	Description
0	Objective	Objective coefficient
1-4	Variable type	Variable type (binary, integer, implicit-integer, continuous)
5	Specified bounds	Whether the variable has a lower bound
6	Specified bounds	Whether the variable has an upper bound
7	Lower bound	Whether the variable reaches its lower bound
8	Upper bound	Whether the variable reaches its upper bound

Index	Constraint Feature Name	Description
0	Bias	Normalized right-hand-side of the constraint

Index	Constraint Feature Name	Description
0	Coefficient	Constraint coefficient

TABLE XII: The variable features, constraint features and edge feature used for branching samples, following those proposed in [10].

Index	Variable Feature Name	Description
0-3	Variable type	Variable type
4	Normalized coefficient	Normalized objective coefficient
5	Specified bounds	Whether the variable has a lower bound
6	Specified bounds	Whether the variable has an upper bound
7	Lower bound	Whether the variable reaches its lower bound
8	Upper bound	Whether the variable reaches its upper bound
9	Solution fractionality	Fractional part of the variable
10-13	Categorical	Variable is at its bound or zero
14	Reduced cost	Amount by which the objective coefficient of the variable should decrease so that the variable assumes a positive value in the LP solution
15	Age	Number of LP iterations
16	Solution value	Value of the variable
17	Incumbent value	Value of the variable in the current best solution
18	Average incumbent value	Average value in the observed feasible primal solutions

Index	Constraint Feature Name	Description
0	Cosine similarity	Cosine of the angle between objective coefficients and the coefficients of this constraint
1	Bias	Normalized right-hand -side of the constraint
2	Age	Iterations since the last time the constraint was active
3	Normalized dual value	Value of dual variable corresponding to the constraint
4	Bounds	Whether the constraint reaches its bound

Index	Constraint Feature Name	Description
0	Coefficient	Constraint coefficient

D.4 Implementation Details of AdaSolver

D.4.1 Model Architecture of the Augmenter

The augmenter comprises three parts: an augmentation policy network, a state-value function network and a discriminator network. Each part has a GNN graph encoder and MLP decoders for downstream tasks. The GNN graph encoder encodes the node features of the instance bipartite graphs into 10-dimensional embeddings, then applies an interleaved half-convolution layer to extract 64-dimensional representations for variables $\{\mathbf{h}_{i}^{V}\}_{i=1}^{n}$ and constraints $\{\mathbf{h}_{j}^{W}\}_{j=1}^{m}$ , respectively. The augmentation policy $\Phi_{\eta}$ and state-value function $V_{\alpha}$ share the same instance bipartite graph encoder. For the augmentation policy, we use three single-layer networks, $\text{MLP}_{1}(\mathbf{h}_{i}^{V})$ , $\text{MLP}_{2}(\mathbf{h}_{j}^{W})$ and $\text{MLP}_{3}([\mathbf{h}_{i}^{V},\mathbf{h}_{j}^{W}])$ to obtain the masking probability for the $i^{th}$ variable, the $j^{th}$ constraint and edge $\mathbf{e}_{i,j}$ . Then we choose the nodes and edges to mask according to the predicted probability with a predefined masking proportion. The state-value function uses one single-layer network to predict the current state value. The discriminator uses one single-layer network to predict the probability of an augmented instance that is from the original distribution.

D.4.2 Training and Optimization

we use Adam optimizer [50] to train the networks in all experiments.

In the IL setting, we set an initial learning rate of $1\times 10^{-3}$ and batch size of 8 for GNN and set a learning rate of $1\times 10^{-3}$ and batch size of 4 for the augmenter. We divide the learning rate for GNN by 5 when the GNN validation loss does not improve for 10 epochs, and stop training if it does not improve for 20. The instance graph augmentation begins at Epoch 10. In each epoch after Epoch 10, we choose $k_{1}$ instances to perform augmentation and extract $k_{2}$ branching samples from the augmented instances. Here $k_{1}=50$ , $k_{2}=1000$ in the combinatorial auctions benchmark; $k_{1}=100$ , $k_{2}=5000$ in other three synthesis benchmarks; $k_{1}=10$ , $k_{2}=1000$ in the real-world Anonymous dataset. Then we randomly choose 10000 branching samples from the original instances and 2000 samples from the augmented instances to train the GNN.

In the RL setting, we set an initial learning rate of $1\times 10^{-6}$ and batch size of 8 for tMDP+DFS and set a learning rate of $1\times 10^{-6}$ and batch size of 4 for the augmenter. In each epoch, we choose 10 instances and run the branching agent on them. We augment 10 instances and run on them every ten epochs. The maximum exploring time is set to be no more than 1000s on each instance.

D.4.3 Hyperparameters

We list the common hyperparameters in Table XIII. For augmentation operators and masking proportions, we choose to mask 3% constraints and 1% edges on the set covering benchmark, mask 1% variables on the combinatorial auctions benchmark, mask 1% constraints and 1% variables on the capacity facility location benchmark, mask 1% edges and 5% constraints on the maximum independent set, and mask 2% constraints and 1% variables in the Anonymous dataset.

TABLE XIII: Common hyperparameters for the augmenter in AdaSolver.

Hyperparameter	Value
Half-convolution layers	1 for variables and 1 for constraints
Graph encoder hidden size	64
Batch size	4
PPO learning rate	1e-3
PPO entropy coefficient	1e-2
PPO epochs at each iteration	3
PPO buffer size	300
PPO clip ratio	0.2

D.4.4 Variants of AdaSolver

Random augmentation (RA). The branching policy network is the same as GNN. Different from AdaSolver, RA does not have an augmentation network. Instead, RA randomly masks the constraints, variables and edges under the same operators and proportions as AdaSolver.

AdaSolver-REINFORCE. The branching policy network is the same as GNN. Different from AdaSolver, AdaSolver-REINFORCE employs the REINFORCE algorithm to train the augmentation network, and other hyperparameters remain the same. It is well-known that the REINFORCE algorithm is less sample-efficient than the PPO algorithm, thus the performance of AdaSolver-REINFORCE is inferior to AdaSolver.

We will release all the codes for training and end-to-end evaluation once the paper is accepted.

Appendix E More Experimental Results

In this part, we provide the results for the experiments in the main text in terms of nodes (Nodes) and the primal-dual gap (PD gap). We first give some explanations for these metrics.

Nodes. The number of explored nodes is a metric that reflects the size of B&B searching trees. However, the Nodes metric may be not an appropriate metric to measure the solving efficiency. First, fewer nodes do not mean shorter solving time since the solver may spend more time on each node, resulting in a longer total solving time. Second, on the distributions that solvers cannot find the optimal solutions within the time limit, the Nodes metric is not appropriate to reflect the solving performance. On the one hand, we still cannot estimate the number of explored nodes when solving to the optimal value. On the other hand, given the time limit, fewer nodes imply that the solvers even explore the harder nodes that require more computation to solve the LP relaxations.

PD gap. (1) On the distributions that solvers can find the optimal solutions with the time limit, the PD gap metric reaches zero and fewer nodes imply a better solving performance. (2) On the distributions that solvers cannot find the optimal solutions within the time limit, the PD gap is a suitable metric to measure the solving efficiency. Given the time limit, the lower PD gap implies that the solver has found a better feasible solution so far.

E.1 More Results of Main Evaluation

We present more results on the comparative experiments. Table XIV reports the performance of AdaSolver and the baselines in terms of the total number of nodes and PD gap. The results in Table XIV demonstrate that AdaSolver-IL explores significantly fewer nodes than tMDP and TreeGate consistently on the easier distributions, and slightly more nodes than GNN. However, AdaSolver-IL achieves faster solving efficiency, which means that AdaSolver-IL spends less time than GNN on each node and leads to a wiser strategy. For the distributions that the solvers reach the time limit, AdaSolver-IL outperforms all the baselines in terms of the PD gap. We also observe that AdaSolver-IL explores more nodes than other baselines within the limited time, which demonstrates that AdaSolver-IL learns a policy that chooses easier nodes to explore.

TABLE XIV: More results on main evaluation in terms of Nodes and PD gap.

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap
RPB	247.59	0.00	5866.93	0.00	52814.36	0.05	25630.00	0.10	18279.17	0.15	3910.22	0.26
SVMRank	332.60	0.00	5478.59	0.00	20326.22	0.03	5152.89	0.13	2382.83	0.19	798.75	0.28
Trees	396.46	0.00	5960.50	0.00	17705.65	0.07	6323.89	0.13	3552.26	0.18	1021.86	0.28
LambdaMart	327.42	0.00	5928.39	0.00	28178.68	0.03	7501.61	0.13	3472.22	0.18	893.93	0.28
tMDP+DFS	2324.27	0.00	35037.37	0.02	34062.76	0.04	11145.84	0.17	5365.54	0.22	1519.11	0.31
TreeGate	1242.64	0.00	12826.16	0.004	19133.68	0.03	6901.62	0.15	3144.67	0.20	568.44	0.30
GNN	259.61	0.00	3913.11	0.00	28926.49	0.05	14676.73	0.11	7162.61	0.16	1120.95	0.27
AdaSolver-RL	2227.30	0.00	35936.70	0.02	32528.50	0.12	15368.45	0.17	8476.60	0.21	1653.05	0.32
AdaSolver-IL	270.69	0.00	4361.15	0.00	40711.45	0.05	30649.53	0.10	19573.54	0.14	2367.05	0.27
	Set Covering

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap
RPB	42.00	0.00	1616.12	0.00	14154.73	0.00	24072.15	0.01	15178.22	0.02	9185.16	0.02
SVMRank	93.02	0.00	1604.26	0.00	11150.03	0.004	9843.47	0.01	7564.52	0.02	6019.93	0.02
Trees	102.49	0.00	3487.29	0.00	29083.64	0.005	16295.54	0.01	12398.58	0.02	10584.14	0.02
LambdaMart	87.58	0.00	1587.87	0.00	14109.08	0.003	15478.90	0.01	11426.58	0.02	8965.19	0.02
tMDP+DFS	141.69	0.00	3290.90	0.00	27826.58	0.003	47398.77	0.01	28752.17	0.02	16827.58	0.02
TreeGate	527.47	0.00	4473.12	0.00	26772.53	0.003	21841.75	0.02	15610.67	0.02	6948.79	0.03
GNN	84.99	0.00	1306.10	0.00	11710.30	0.00	32724.97	0.009	21995.83	0.01	15017.98	0.02
AdaSolver-RL	191.85	0.00	2308.95	0.00	31815.45	0.00	58011.39	0.01	38012.95	0.02	22258.45	0.02
AdaSolver-IL	102.71	0.00	1953.57	0.00	39556.67	0.00	36135.36	0.008	27599.51	0.01	17599.67	0.01
	Combinatorial Auctions

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap
RPB	387.31	0.00	413.93	0.00	145.20	0.002	11.78	0.002	2.38	0.02	1.20	0.07
SVMRank	491.71	0.00	533.21	0.00	337.59	0.002	34.63	0.004	10.46	0.02	2.46	0.09
Trees	508.60	0.00	588.19	0.00	317.75	0.002	31.00	0.004	8.22	0.02	2.78	0.09
LambdaMart	445.69	0.00	520.89	0.00	338.38	0.002	34.79	0.004	10.51	0.02	2.52	0.09
tMDP+DFS	723.95	0.00	788.96	0.00	361.68	0.002	42.21	0.002	7.68	0.02	2.89	0.09
TreeGate	584.64	0.00	631.03	0.00	299.50	0.003	46.27	0.004	2.93	0.007	1.44	0.05
GNN	458.11	0.00	592.82	0.00	290.25	0.002	73.76	0.003	16.22	0.005	3.17	0.07
AdaSolver-RL	1024.40	0.00	839.95	0.00	392.70	0.002	116.00	0.004	36.08	0.004	8.13	0.04
AdaSolver-IL	529.61	0.00	659.79	0.00	487.49	0.002	190.46	0.002	59.84	0.003	3.72	0.08
	Capacitated Facility Location

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap
RPB	594.46	0.00	16097.93	0.00	17730.83	0.03	900.00	5387.45	900.00	1874.07	900.01	854.44
SVMRank	289.60	0.00	5757.21	0.01	5097.74	0.02	1536.43	0.03	909.91	0.04	507.74	0.05
Trees	653.03	0.00	13349.63	0.004	7205.15	0.02	2037.52	0.03	1444.09	0.04	912.21	0.05
LambdaMart	425.62	0.00	12268.72	0.005	8108.65	0.02	1862.61	0.03	1084.89	0.04	605.03	0.05
tMDP+DFS	459.08	0.00	21183.04	0.003	23298.80	0.02	7350.35	0.03	3615.30	0.04	1647.60	0.05
TreeGate	2136.85	0.003	8585.86	0.01	3516.26	0.01	1198.28	0.04	925.65	0.05	999.87	0.05
GNN	285.58	0.00	10623.60	0.001	16520.12	0.02	3809.20	0.03	1392.95	0.04	679.15	0.04
AdaSolver-RL	351.50	0.00	22448.85	0.00	30533.75	0.01	12887.25	0.03	7027.05	0.04	4016.80	0.05
AdaSolver-IL	241.12	0.00	10664.04	0.004	20861.74	0.01	19596.78	0.03	12479.65	0.04	6548.56	0.05
	Maximum Independent Set

E.2 More Results of Ablation Studies

We compare the Nodes and PD gap metrics in the ablation studies. The results in Table XV show that AdaSolver outperforms RA in terms of solving time and PD gap, demonstrating the effectiveness of our proposed adversarial instance augmenter.

TABLE XV: More results on the ablation studies in terms of Nodes and PD gap.

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$
RA	269.83	0.00	5676.33	0.00	44928.63	0.06	14929.49	0.18	8362.64	0.23	3301.91	0.31
AdaSolver-IL	270.69	0.00	4361.15	0.00	40711.45	0.05	30649.53	0.10	19573.54	0.14	2367.05	0.27
RA	2227.10	0.00	36310.10	0.02	31287.20	0.12	10865.95	0.18	5084.90	0.22	815.05	0.33
AdaSolver-RL	2227.30	0.00	35936.70	0.02	32528.50	0.12	15368.45	0.17	8476.60	0.21	1653.05	0.32
	Set Covering

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes $\downarrow$	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$
RA	101.24	0.00	2491.01	0.00	43006.75	0.00	29486.44	0.009	21203.95	0.01	14441.14	0.02
AdaSolver-IL	102.71	0.00	1953.57	0.00	39556.67	0.00	36135.36	0.008	27599.51	0.01	17599.67	0.01
RA	199.05	0.00	2237.20	0.00	30545.60	0.00	58141.90	0.01	31930.00	0.02	15079.60	0.03
AdaSolver-RL	191.85	0.00	2308.95	0.00	31815.45	0.00	58011.39	0.01	38012.95	0.02	22258.45	0.02
	Combinatorial Auctions

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$
RA	473.79	0.00	608.75	0.00	438.14	0.002	89.51	0.004	29.80	0.005	9.52	0.05
AdaSolver-IL	529.61	0.00	659.79	0.00	487.49	0.002	190.46	0.002	59.84	0.003	3.72	0.08
RA	1018.60	0.00	854.90	0.00	468.90	0.002	98.50	0.003	7.90	0.008	2.15	0.11
AdaSolver-RL	1024.40	0.00	839.95	0.00	392.70	0.002	116.00	0.004	36.08	0.004	8.13	0.04
	Capacitated Facility Location

	$D_{1}$		$D_{2}$		$D_{3}$		$D_{4}$		$D_{5}$		$D_{6}$
Method	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$
RA	226.79	0.00	40493.42	0.00	31616.33	0.01	14390.99	0.03	7323.71	0.04	4721.40	0.05
AdaSolver-IL	241.12	0.00	10664.04	0.004	20861.74	0.01	19596.78	0.03	12479.65	0.04	6548.56	0.05
RA	323.60	0.00	23020.05	0.007	27906.65	0.01	9073.45	0.03	2974.65	0.04	1434.90	0.05
AdaSolver-RL	351.50	0.00	22448.85	0.00	30533.75	0.01	12887.25	0.03	7027.05	0.04	4016.80	0.05
	Maximum Independent Set

E.3 More Results of Longer Time Limit

We evaluate our proposed AdaSolver on a much longer time limit of two hours in the Anonymous dataset to investigate whether AdaSolver can still perform well when used in a much longer time limit. We train the models with a time limit of 900s. The results are in Table XVI, which demonstrate that AdaSolver significantly improves the performance of the underlying solvers.

TABLE XVI: The results in the real-world dataset in a longer time limit show that AdaSolver significantly outperforms the underlying solvers.

Method	Time(s)	Nodes	PD integral $\downarrow$	PD gap $\downarrow$
RPB	6703.81	300709.45	435985.03	1.50e+19
tMDP+DFS	5647.83	338191.85	241413.64	1.02
GNN	5670.69	308766.63	206413.12	1.13
AdaSolver-RL	5514.66	358002.10	211590.82	0.86
AdaSolver-IL	5640.18	295459.21	203917.75	1.04

E.4 More Results on the Sample Efficiency

We compare the Nodes and PD gap metrics in the experiments of Section 8.3. The results in Table XVII show that AdaSolver significantly outperforms the underlying baselines in terms of PD gap when trained on limited instances, demonstrating much higher sample efficiency. Notice that in the distributions where the solvers can solve the instances within the time limit, AdaSolver-IL also explores significantly less nodes.

TABLE XVII: Sample efficiency of our proposed AdaSolver-IL. The results demonstrate that AdaSolver-IL significantly improves the sample efficiency when the training instances are few.

Method	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$
GNN	769.60	0.00	15754.15	0.00	37577.55	0.10	26173.75	0.15	18570.70	0.20	4854.60	0.30
AdaSolver-IL	445.25	0.00	10010.15	0.00	49802.45	0.08	38256.80	0.13	28539.60	0.17	7270.45	0.28
	Set Covering

Method	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$	Nodes	PD gap $\downarrow$
GNN	217.50	0.00	3532.60	0.00	34269.20	0.004	33319.85	0.01	23282.75	0.02	16594.65	0.03
AdaSolver-IL	107.93	0.00	1714.28	0.00	14221.71	0.00	31613.71	0.01	23753.37	0.01	16616.03	0.02
	Combinatorial Auctions

	$\displaystyle\\|\mathcal{T}(\mathbf{s},a)-\mathcal{T}(\mathbf{s}^{\prime},a)\\|$	$\displaystyle\leq\text{Lip}_{\mathcal{T}_{1}}\\|\mathbf{s}-\mathbf{s}^{\prime}\\|,\forall\mathbf{s},\mathbf{s}^{\prime}\text{ and }a;$
	$\displaystyle\\|\mathcal{T}(\mathbf{s},a)-\mathcal{T}(\mathbf{s},a^{\prime})\\|$	$\displaystyle\leq\text{Lip}_{\mathcal{T}_{2}}\\|a-a^{\prime}\\|,\forall a,a^{\prime}\text{ and }\mathbf{s},$