Learning Backdoors for Mixed Integer Linear Programs with Contrastive Learning

Junyang Cai Correspondence to: Junyang Cai <[email protected]> Taoan Huang Bistra Dilkina University of Southern California

Abstract

Many real-world problems can be efficiently modeled as Mixed Integer Linear Programs (MILPs) and solved with the Branch-and-Bound method. Prior work has shown the existence of MILP backdoors, small sets of variables such that prioritizing branching on them when possible leads to faster running times. However, finding high-quality backdoors that improve running times remains an open question. Previous work learns to estimate the relative solver speed of randomly sampled backdoors through ranking and then decide whether to use the highest-ranked backdoor candidate. In this paper, we utilize the Monte-Carlo tree search method to collect backdoors for training, rather than relying on random sampling, and adapt a contrastive learning framework to train a Graph Attention Network model to predict backdoors. Our method, evaluated on several common MILP problem domains, demonstrates performance improvements over both Gurobi and previous models.

\paperid

1535

1 Introduction

Many real-world problems, such as path planning [40], scheduling [12], and network design [6, 21] problems, are combinatorial optimization (CO) problems and generally NP-hard to solve. Many CO problems can be formulated as Mixed Integer Linear Programs (MILPs), which aim to optimize a linear objective function over both continuous and integer variables in the feasible region defined by linear constraints. MILPs can be solved via Branch-and-Bound [32], which repeatedly solves a linear program (LP) relaxation of the problem and selects one of the integer variables that is fractional in the LP relaxation solution to branch on and creates new subproblems until all the integrality constraints are satisfied. Commercial MILP solvers, such as Gurobi [16], heavily rely on Branch-and-Bound algorithms and are enhanced by various powerful heuristics.

Backdoors, initially introduced for Constraint Satisfaction Problems [47], are generalized to MILPs [7], where MILP "strong backdoors" are defined as subsets of integer variables such that branching on only them yields an optimal integral solution. [11] report speedup in MILP solving times by prioritizing backdoor variables in branching. Instead of node selection, variable selection happens during the branching; we assign branching priority for backdoor variables before the start of the tree search. Previously, [7] proposed a sampling method for finding backdoors by randomly selecting from a subset of variables based on the fractionality of their LP relaxation solutions. More recently, [28] developed a Monte-Carlo tree search (MCTS) approach for finding backdoors. However, the performance of backdoors in a problem domain or even a specific instance varies and it is hard to identify backdoors that improve solving time efficiently.

Because of the advantage of the machine learning (ML) model to learn from complex historical distribution, learning-based approaches have been successfully used in combinatorial optimization, specifically speeding up solving MILPs. There has been an increased interest in data-driven heuristic designs for MILP for various decision-making in Branch-and Bound [49] such as node selection and branch variables selection. Those works treat MILP solvers as a white box; they repeatedly make decisions at each node or every branching decision during solving time, which requires special implementations of open-source solvers, e.g., SCIP. While our backdoor prediction treats the solvers as a black box, we only make one branching priority decision ahead of the solving, which gives us the advantage of being easily applied to all solvers. [10] is the first to use the learning-based approach to find backdoors and propose a two-step, learning-based model. In the first step, a scorer model is trained with a pairwise ranking loss to rank candidate backdoors from random sampling by solver runtime. In the second step, a classifier model is trained with cross-entropy loss to determine whether to use the best-scoring backdoors or the default MILP solver. Recently, contrastive learning has emerged as an effective approach for other tasks in CO, such as solving satisfiability problems (SAT) [8] and improving large neighborhood search for MILPs [22], based on learning representations by contrasting positive samples (e.g., high-quality solutions or heuristic decisions) against negative ones.

In this paper, we introduce a novel learning framework for predicting effective backdoors. Instead of using sampling methods [10], we employ the Monte-Carlo tree search (MCTS) [28] for data collection that allows us to find higher-quality backdoors compared to previous methods [10]. High-quality backdoors found by the MCTS method are used as positive samples for contrastive learning. We also collect backdoor samples that cause performance drops as negative samples. We represent MILPs as bipartite graphs with variables and constraints being the two sets of nodes in the graphs [13] and parameterize the ML model using a graph attention network [46]. We use a contrastive loss that encourages the model to predict backdoors similar to the positive samples but dissimilar to the negative ones [37] and use a greedy selection process to predict the most beneficial backdoors at test time. Compared to the previous model, contrastive learning is more efficient in training and more deterministic in evaluation. We conduct empirical experiments on several common problem domains, including Generalized Independent Set Problem (GISP) [20], Set Cover Problem (SC) [1], Combinatorial Auction (CA) [33], Maximal Independent Set (MIS) [45], Facility Location(FC) [5], and Neural Network Verification(NN) [36]. Our method achieves faster solve times than Gurobi and the state-of-the-art scorer+classifier model proposed in [10].

2 Background

In this section, we first define MILPs and then introduce backdoor for MILP solving.

2.1 Mixed Integer Linear Programming

A Mixed Integer Linear Programming (MILP) is defined as

\displaystyle\min\{c^{T}x\mid Ax\leq b,x\in\mathbb{R}^{n},x_{j}\in\{0,1\}\forall j\in I\}

where $A\in\mathbb{R}^{m\times n},b\in\mathbb{R}^{m},c\in\mathbb{R}^{n}$ , and the non-empty set $I\subseteq{1,...,n}$ indexes the binary variables. A solution to the MILP is a feasible assignment of values to the variables. In this paper, we focus on the formulation above that consists of continuous and binary variables due to MCTS data collection limited to this setting, but our contrastive learning framework can still be applied to MILP with non-binary integer variables.

2.2 Backdoors for MILP

Given a MILP instance defined by the tuple $(A,b,c,I)$ , a “strong backdoor" of size $K\ll|I|$ is a set of integer variables $B$ , $|B|=K$ , $B\subset I$ such that branching exclusively on variables from $B$ results in a provably optimal solution or a proof of infeasibility. The order in which decision variables are considered for branching can affect the performance of the solver’s primal heuristics, which in turn affects pruning in Branch-and-Bound. We refer to [7, 11] for a detailed discussion of why order matters in practice in MILP solvers. To distinguish from “strong backdoors”, we introduce term “backdoor” to denote a set of integer variables where prioritizing branching on them reduces solving time. Throughout the paper, we use “candidate backdoors” to describe a set of integer variables currently under evaluation, which may or may not yield improvements in solving time.

3 Related Work

In this section, we summarize related work on backdoors for CO problems, learning to solve MILPs, and contrastive learning for CO problems.

3.1 Backdoors For Combinatorial Problems

Backdoors are first introduced by [47] for SAT solving and, over the years, many approaches are proposed to find backdoors to improve SAT solving [38, 30]. Observing the connection between SAT and MILP, [7] generalize the concept of backdoors and propose two forms of random sampling to find backdoors: uniform sampling and biased sampling to favor ones that are more fractional in the LP relaxation solutions, while [11] propose another strategy that formulates finding backdoors as a Set Covering Problem. Recently, [28] contributed to a third strategy for finding backdoors using Monte-Carlo tree search, which balances exploration and exploitation by design. Utilizing a reward function based on the tree weight in the Branch-and-Bound tree, MCTS approximates the backdoor’s strength without the need to solve time-consuming MILP instances. However, the quality of the backdoors it finds varies and it is also time-consuming (5+ hours) for MCTS to find backdoors, which opens the floor for learning-based techniques to find backdoors [10].

3.2 Machine Learning for MILPs

Several studies have applied ML to improve heuristic decisions in Branch-and-Bound, such as which node to select next to expand [18, 41, 31], which variable to branch on next [26, 13, 15], which cut to select next [44, 39], which primal heuristics to run next [27, 4]. Another line of work focuses on improving speed to find primal solutions. [42, 43, 22] use ML to iteratively decide which subsets of variables to re-optimize for a given solution to a MILP. [17, 36] learn to predict high-quality solutions to MILPs with supervised learning.

3.3 Contrastive Learning for CO

Contrastive Learning is a discriminative approach that aims at grouping similar samples closer and dissimilar samples far from each other [25]. It has been successfully applied in both computer vision area [19] and natural language processing area [34]. The work of contrastive learning on graph representation has also been extensively studied [48], but it has not been explored much for CO problem domains. [35] derive a contrastive loss for decision-focused learning to solve CO problems with uncertain inputs. [8] use contrastive pre-training to learn good representations for the boolean satisfiability problem. [22] use contrastive learning to learn efficient and effective destroy heuristics in large neighborhood search for MILPs. [23] use contrastive learning to predict optimal solutions to MILPs in the predict-and-search algorithm.

Refer to caption — Figure 1: This figure illustrates the comprehensive training and evaluation phases for learning backdoors for MILPs with contrastive learning. In the training phase, we employ MCTS to gather backdoor samples on training instances and collect positive and negative samples. Simultaneously, a feature extractor transforms the MILP instances into a bipartite graph and collects features of the MILPs. The policy is represented by a graph attention network (GAT) and trained with a contrastive loss. In the evaluation phase, the MILP instance is converted into a bipartite graph as input for the GAT, generating a score vector as output. We greedily select variables with the highest score to obtain backdoors, which are solved using a MILP solver.

4 Contrastive Learning for Backdoor Prediction

Our goal is to learn a policy that takes as input a MILP instance and outputs a backdoor to speed up solving time. In this section, we introduce how we prepare data for contrastive learning, the policy network, and the contrastive loss used in training. Finally, we introduce how the learned policy is used in predicting backdoors. Figure 1 shows the full training and evaluation pipeline.

4.1 Data Collection

Diverging from previous approaches for finding or predicting backdoors [10, 11, 7], we incorporate contrastive learning to predict backdoors. The key challenge lies in determining high-quality positive samples and similar negative samples for effective training. a MILP instance can be represented as a tuple $P=(A,b,c,I)$ , where $A,b,c$ are the coefficients of the constraints or objectives and $I$ is the set of integer variables. Our objective is to collect a training dataset $D$ . Specifically, for every training instance $P$ , we collect positive samples $S_{p}$ and negative samples $S_{n}$ specific to $P$ .

We employ the Monte-Carlo Tree Search Algorithm proposed in [28] to collect candidate backdoors. MCTS will return a set of candidate backdoors with their associate tree weights in the Branch-and-Bound tree, which are approximations of the strength of backdoors. We select the top $k$ candidate backdoors $B_{1},B_{2},\dots,B_{k}$ with the highest tree weights, and solve the MILP $P$ with these candidate backdoors to obtain their runtimes $t_{1},t_{2},\dots,t_{k}$ . When we solve a MILP with a backdoor, we set the branching priority for variables in the backdoor higher than other variables, i.e., we always branch on variables from the backdoor whenever there is one that needs to be branched on. Positive samples $S_{p}$ are chosen as $p$ shortest-runtime backdoors better than Gurobi and negative samples $S_{n}$ are chosen as $q$ longest-runtime backdoors worse than Gurobi. The approach ensures that the positive samples are backdoors with high MILP runtime performance. In contrast, the negative samples are similar to positive samples but have worse runtime performance, enabling the ML model to learn to discriminate between backdoors and non-backdoors effectively. In experiments, we set $k=50$ and $p=q=5$ . Details of data collection design and sensitive analysis are provided in Appendix [3].

4.2 Data representation and Policy network

We represent the MILP $P=(A,b,c,I)$ as a bipartite graph as in [13]. The featured bipartite graph $P^{\prime}=(G,V,C,E)$ , with graph $G$ containing variable nodes and constraint nodes. Additionally, there is an edge $(i,j)$ in edge matrix $\epsilon$ between a variable node $i$ and a constraint node $j$ if variable $i$ appears in constraint $j$ with nonzero coefficient, i.e., $A_{ji}\neq 0$ . Constraint, variable, and edge features are represented as matrices $V\in\mathbb{R}^{n\times d},C\in\mathbb{R}^{m\times c^{\prime}},E\in\mathbb{R}^{m\times n\times e}$ . The bipartite graph representation ensures the MILP encoding is invariant to variable and constraint permutation. Additionally, we can use a variety of predictive models designed for variable-sized graphs, enabling deployment on problems with varying numbers of variables and constraints. We use features proposed in [13], which include 15 variable features (e.g., variable types, coefficients, upper and lower bound, root LP related features), 4 constraint features (e.g., constant, sense), and 1 edge feature for coefficients.

We learn a policy $\pi$ represented by a Graph Attention Network (GAT) [2], which takes the bipartite graph $P^{\prime}$ as input and outputs a score vector with one score per decision variable. To increase the modeling capacity and to manipulate node interactions proposed by our architecture, we use embedding layers in the network to first adjust the size of feature embeddings to $V^{1}\in\mathbb{R}^{n\times L},C^{1}\in\mathbb{R}^{m\times L},E^{1}\in\mathbb{R}^{|\epsilon|\times L}$ accordingly. Then, GAT performs two rounds of message passing. In the first round, each constraint node in $C^{1}$ attends to its neighbors using an attention structure with $H$ attention heads to get update constraint embeddings $C^{2}$ . Similarly, each variable node in $V^{1}$ attends to its neighbors to get updated variable embeddings $V^{2}$ with another set of attention weights in the second round. Finally, the network applies a multi-layer perceptron with a sigmoid function to obtain a score between 0 and 1 for each variable. In experiments, layer size $L$ and number of attention heads $H$ are set to 64 and 8. Full details of the network architecture are provided in Appendix [3].

Table 1: Average numbers of variables, constraints, and average Gurobi solving time of 100 test instances from each problem domain. For MILP domains, #variables is in the format of # binary variables + # continuous variables.

	Small ILP Domains				Large ILP Domains				MILP Domains
Name	GISP-S	SC-S	CA-S	MIS-S	GISP-L	SC-L	CA-L	MIS-L	FC	NN
#Variables	988	1,000	750	1,250	1,316	1,000	1,000	1,500	100 + 20,000	167 + 6968
#Constraints	3,253	1,200	282	3,946	4,571	1,500	377	5,941	40,000	6520
Gurobi runtime (s)	633	171	177	218	3,363	1,195	1,213	759	46.7	9.24

4.3 Training with Contrastive Loss and Applying Learned Policy

The model is trained with a contrastive loss to score each decision variable by learning to emulate superior backdoors and avoid inferior ones. Given a set of MILP instances for training, let $D=(P^{\prime},S_{p},S_{n})$ be the set of one training data where $P^{\prime}$ is the bipartite representation of MILP instance $P$ , $S_{p}$ is the set of positive samples and $S_{n}$ is the set of negative samples associated with $P$ . With similarity measured by dot products, we use the InfoNCE [37] contrastive loss $L(P^{\prime},S_{p},S_{n})=$

{\sum_{(P^{\prime},S_{p},S_{n})\in D}\frac{-1}{|S_{p}|}\sum_{a\in S_{p}}\log\frac{\exp(a^{T}\pi(P^{\prime})/\tau)}{\sum_{a^{\prime}\in S_{n}\cup\{a\}\exp(a^{\prime T}\pi(P^{\prime})/\tau)}}}

to train the policy $\pi$ , where $\tau$ is a temperature hyperparameter set to $0.07$ in experiment following [22]. By choosing faster solving time backdoors as positive samples and those similar to the positive ones but with lower quality as negative samples, the model learns to generate embeddings closer to positive samples and further away from negative ones.

During testing, given a MILP instance, we use the same data representation introduced in section 4.2 to convert it to a bipartite graph and use the graph as the input to GAT. Then, GAT outputs a score vector with one score for each variable. We greedily select the binary variables with the highest scores as the predicted backdoors by user-defined backdoor size. For the scorer model in [10], predicting the backdoor differs from our contrastive learning model. They first sample 50 candidate backdoors with the predefined backdoor size using the random biased sampling method [7] and input each backdoor to the scorer model to get an output score. They select the candidate backdoor with the highest score as the predicted backdoor. The backdoor predicted by the scorer model contains strong randomness within the 50 candidate backdoors it samples, thus causing the predicted one to have highly variate performance results. Our contrastive learning model is deterministic based on greedy selection and can also output a backdoor with any size requested by the users. Finally, we assign the branching priority for variables in the backdoor before the start of the tree search in the MILP solver and then collect statistics of solving the problem to optimality for later evaluation.

5 Empirical Evaluation

In this section, we introduce problem domains and instance sizes for evaluation. We explain the experiment setup, including data collection, baselines, metrics, and hyperparameters, and then present the results. Our code is available at [3].

5.1 Problem Domains and Instance Generation

We evaluate on several ILP problem domains, Generalized Independent Set Problem (GISP), Set Cover (SC), Combinatorial Auction (CA), and Maximum Independent Set (MIS) that are widely used in existing studies [42, 10, 22]. For each problem domain, we generate Small (S) and Large (L) instances for it. Following [10], GISP-S and GISP-L instances are generated with 150 nodes and 175 nodes, respectively, with the node reward set to 100 and edge removal cost set to 1; In order to keep a similar evaluation procedure (e.g., Gurobi being able to solve to optimality within certain runtime) and diverse hardness of the problems, we choose the size of the three other problem domains as follows: Following [22], SC-S instances have 1,000 variables and 1,200 constraints and SC-L instances have 1,000 variables and 1,500 constraints, with density set to 0.05 for both; Following [33], CA-S instances are generated with 150 items and 750 bids and CA-L instances are generated with 200 items and 1,000 bids according to the arbitrary relations; Following [22], MIS-S and MIS-L instances are generated according to the Erdos-Renyi random graph model [9], with an average degree of 4 and 1,250 and 1,500 nodes, respectively.

To show our method can work on general MILP cases, we also evaluate on Facility Location (FC) and Neural Network Verification (NN). Following [13], FC instances are generated with 100 facilities supplying 200 customers. Following [36], NN instances are a convolutional neural network is verified on each image in the MNIST dataset, giving rise to a corresponding dataset of MILPs. Table 1 shows average numbers of variables, constraints, and average Gurobi solving time of 100 test instances from each problem domain. More details of instance generation are included in the Appendix [3].

We refrained from evaluating our methodology on heterogeneous MILP datasets like MIPLIB [14] due to the challenges in training an effective model using backdoors collected from a variety of MILP distributions. Instead, we target homogeneous settings from Distributional MIPLIB [24] because many real-world scenarios require solving a homogeneous family of problems, where instances share similar structures but differ slightly in problem size or numerical coefficients. Since MCTS backdoor collection in [28] only works with mixed-binary instances due to the ease of implementation of the tree weight for binary problems, we have limited our problem domains to MILP with only binary variables. In practice, we believe contrastive learning-based training still applies when using LP-based sampling to collect data.

5.2 Experiment Setup

Table 2: Runtime (secs) of Gurobi (grb), the scorer model (scorer), the scorer with the classifier (scorer+cls), the contrastive learning model (cl) and the contrastive learning model with the classifier (cl + cls) on GISP-S, SC-S, CA-S and MIS-S. Scorer and cl are trained on 200 instances and the classifiers are trained on another 200 instances. We report the mean, the standard deviation, 25th, 50th, and 75th percentiles, and win/tie/loss (W/T/L) counts vs. Gurobi. We report the speed improvement as a percentage inside the parentheses for the mean and median. We bold the best-performing entries.

Dataset (#vars, #cons)	Solver	Mean	Std Dev	25 pct	Median	75 pct	W/T/L vs grb
GISP-S (988, 3253)	grb	633	154	513	615	731	-
	scorer	544 (14.1%)	117	453	543 (11.7%)	611	74/0/26
	scorer + cls	578 (8.7%)	128	502	586 (4.7%)	637	43/32/25
	cl	533 (15.8%)	120	441	524 (14.8%)	612	84/0/16
	cl + cls	560 (11.5%)	135	462	555 (9.8%)	653	59/27/14
SC-S (1000, 1200)	grb	171	189	50.8	101	216	-
	scorer	244 (-42.7%)	219	67.2	148 (-46.5%)	295	27/0/73
	scorer + cls	203 (-18.7%)	195	62.1	142 (-40.6%)	256	15/57/28
	cl	149 (12.9%)	163	42.5	79.8 (20.1%)	195	77/0/23
	cl + cls	153 (10.5%)	165	49.4	79.4 (21.4%)	197	52/27/21
CA-S (750,282)	grb	177	94.6	110	151	229	-
	scorer	224 (-26.6%)	120	125	217 (-43.7%)	293	17/0/83
	scorer + cls	194 (-9.6%)	112	119	191 (-26.5%)	243	14/52/34
	cl	156 (11.9%)	83.0	88.5	146 (3.3%)	195	68/0/32
	cl + cls	170 (4.0%)	96.2	103	139 (8.0%)	227	46/28/26
MIS-S (1250, 3946)	grb	218	231	79.7	147	267	-
	scorer	236 (-8.3%)	258	89.9	156 (-6.1%)	269	36/0/64
	scorer + cls	219 (-0.5%)	237	82.3	154 (-4.8%)	264	24/46/30
	cl	184 (15.6%)	179	76.8	127 (13.6%)	214	69/0/31
	cl + cls	190 (12.8%)	197	80.2	136 (7.5%)	227	44/35/21

Table 3: Runtime (secs) of Gurobi (grb), the scorer with the classifier (scorer+cls), the contrastive learning model (cl) on FC and NN. We report the mean, the standard deviation, 25th, 50th, and 75th percentiles, and win/tie/loss (W/T/L) counts vs. Gurobi. We report the speed improvement as a percentage inside the parentheses for the mean and median. We bold the best-performing entries.

Dataset (#binvars, #cons)	Solver	Mean	Std Dev	25 pct	Median	75 pct	W/T/L vs grb
FC (100, 40000)	grb	46.7	32.9	20.6	38.0	59.7	-
	scorer+cls	44.3 (5.1%)	30.2	25.1	35.5 (6.6%)	53.1	28/65/7
	cl	39.6 (15.2%)	20.2	25.2	33.4 (12.1%)	46.9	60/0/40
NN (167, 6520)	grb	9.24	20.4	3.46	4.83	7.94	-
	scorer+cls	8.14 (11.9%)	19.5	3.16	4.26 (11.8%)	6.95	62/21/17
	cl	7.81 (15.5%)	15.6	3.12	4.13 (14.5%)	6.75	80/0/20

Data Collection We generate 200 small instances for each problem domain. We use the MCTS framework to collect backdoors of size 8 for each instance over 5 hours with 10 parallel workers. We follow [28] and choose 8, which shows promising results on finding high-quality backdoors on 164 problem domains From MILPLIB2017. We select the top 50 backdoors measured by tree weight function, solve the problems using those backdoors with MILP solver, and record the runtime. For contrastive loss, we select the 5 fastest candidate backdoors as positive samples and 5 slowest candidate backdoors as negative samples.

Baselines We compare our contrastive learning model (cl) with several baselines: Gurobi (grb), the scorer model (scorer) and the scorer with the classifier (scorer+cls). scorer is the previous method [10] that predicts the score for candidate backdoors using supervised learning and scorer+cls is scorer with cls to predict whether to use the backdoor output by the score or not. The scorer and cl models are trained with 200 instances, with 160 and 40 instances for training and validation, respectively. For cls, we generate another 200 instances, with 160 and 40 instances for training and validation, respectively. To train the classifier for scorer+cls, we use random sampling based on the LP relaxation [7] to generate random 50 candidate backdoors and use scorer to select the best one to collect runtime statistics. Then, we assign labels for each instance based on the runtime of the selected backdoor over the Gurobi runtime and use a cross-entropy loss to train the classifier following [10]. To train the classifier for cl+cls, we collect the runtime statistics from the backdoor output directly from cl model and the rest is the same as scorer+cls.

We did not compare other learning for Branch-and-Bound [41, 13] baselines because we believe their works are conjunctive of ours. Our model just predicts a small set of variables before the solving, and during the solving; it is still applicable to use other decision-making approaches. Also, their approaches required the specific implementation of solvers, which limited them to open-source solver SCIP, while our approach can be applied to the current state-of-art solver Gurobi, which makes our baseline more competitive than theirs.

Metrics To compare the performance of the methods, we test on 100 instances for each problem domain and report summary statistics of runtime to solve the problem to optimality, including mean, standard deviation, 25th percentile, median, and 75th percentile. We also report the wins, ties, and losses over Gurobi. We consider the runtime of the full pipeline, including feature extractor, inference for the GAT model, sampling or greedy selection, and solving the MILP. When cls suggests using Gurobi, we record a “tie” to give insight into model predictions even though there is a slight loss in runtime due to the ML overhead.

Table 4: Evaluating generalization to larger problem size: Runtime (secs) of standard Gurobi (grb) and the contrastive learning model (cl) on GISP-L, SC-L, CA-L, MIS-L. The model is trained on 200 small instances for each problem domain. We report the mean, the standard deviation, 25th, 50th, and 75th percentiles, and win/loss (W/L) counts vs. Gurobi. We report the speed improvement as a percentage inside the parentheses for the mean and median. We bold the best-performing entries.

Dataset (#vars, #cons)	Solver	Mean	Std Dev	25 pct	Median	75 pct	W/L vs grb
GISP-L (1316, 4571)	grb	3,363	989	2,611	3,303	4,049	-
GISP-L (1316, 4571)	cl	2,879 (14.4%)	785	2,288	2,910 (11.9%)	3,331	77/23
SC-L (1000, 1500)	grb	1,195	1,692	161	519	1,597	-
SC-L (1000, 1500)	cl	1,003 (16.1%)	1,433	143	435 (16.0%)	1,309	78/22
CA-L (1000, 377)	grb	1,213	625	881	1,093	1,353	-
CA-L (1000, 377)	cl	1,096 (9.6%)	487	767	994 (9.1%)	1,159	69/31
MIS-L (1500, 5941)	grb	759	1,356	128	263	1,023	-
MIS-L (1500, 5941)	cl	550 (27.5%)	987	127	256 (2.7%)	501	75/25

Hyperparameters We solve MILPs with single-threaded Gurobi 10.0 on four 64-core machines with Intel 2.1GHz CPUs and 264GB of memory. Training is done on an NVIDIA Tesla V100 GPU with 112GB memory. For cl model, we use the Adam optimizer [29] with a learning rate of $5\times 10^{-4}$ and a weight decay of 0.01. We use a batch size of 32 and train for 100 epochs. For scorer and cls, the learning rate is set to $5\times 10^{-3}$ and a batch size of 128 and train for 100 epochs. Due to the fact of pairwise ranking loss in the scorer, the training for scorer is extremely expensive, which takes around 10-15 hours, while the training time for cl model is less than 30 minutes. The more efficient use of training data significantly improves our cl model.

Experiments We conduct the following sets of experiments:

•

We compare cl and cl+cls against grb, scorer, scorer+cls on small ILP domains;
•

We compare cl against grb, scorer+cls on MILP domains.
•

We test the generalizability of the cl model (trained on the small instances) on large ILP domains and compare it against grb.
•

Ablation study: For GISP-S, SC-S, we test how contrastive learning and MCTS data collection separately affect the performance of backdoor predicting.

Table 5: Evaluating the role of the backdoor collection strategy: Runtime (secs) of standard Gurobi (grb), the scorer model (scorer), and the contrastive learning model (cl) with different backdoor collection methods, MCTS, and sampling. We report the mean, the standard deviation, 25th, 50th, and 75th percentiles, and win/loss (W/L) counts vs. Gurobi. We report the speed improvement as a percentage inside the parentheses for the mean and median. We bold the best-performing entries.

Dataset (#vars, #cons)	Solver	Mean	Std Dev	25 pct	Median	75 pct	W/L vs grb
GISP-S (988, 3253)	grb	633	154	513	615	731	-
	scorer + Sampling	601 (5.0%)	152	518	580 (5.7%)	681	49/51
	scorer + MCTS	544 (14.1%)	117	453	543 (11.7%)	611	74/26
	cl + Sampling	565 (10.7%)	132	455	556 (9.6%)	634	67/33
	cl + MCTS	533 (15.8%)	120	441	524 (14.8%)	612	84/16
SC-S (1000, 1200)	grb	171	189	50.8	101	216	-
	scorer + Sampling	287 (-67.8%)	358	85.4	167 (-65.3%)	359	3/97
	scorer + MCTS	244 (-42.7%)	219	67.2	148 (-46.5%)	295	27/73
	cl + Sampling	186 (-8.8%)	196	48.5	107 (-5.9%)	254	43/57
	cl + MCTS	149 (12.9%)	163	42.5	79.8 (20.1%)	195	77/23

5.3 Results

Backdoor Collection Quality For each problem domain, we gathered solve times from 50 MCTS selected candidate backdoors across 200 instances. This resulted in a dataset comprising solve times for 10,000 candidate backdoors. Figure 4 displays histograms of the normalized runtimes, which we calculated by dividing the solve time with a candidate backdoor by the original solve time for each of the 10,000 candidate backdoors in every problem domain. We observe significant improvement in MCTS data collection over the previous sampling method. Specifically, we noted that over half of the candidate backdoors outperformed the baseline solve time in the GISP-S, FC, and NN problem domains. In stark contrast, the sampling method employed in [10] failed to identify any backdoors that improved the original solve times for SC-S, CA-S, and MIS-S domains. However, our approach demonstrated a considerable portion of candidate backdoors with better performance. The improvement of candidate backdoors directly leads to a higher quality of training data, which helps train more accurate models for backdoor prediction.

ILP domains performance We present results on 100 small test instances in Table 2. cl consistently performs better than scorer and has more than 10% improvement on the average runtime over grb in all problem domains, even in those where the scorer does not work. The scorer performs well on GISP-S with 14.1% speed improvement but performs worse than grb on SC-S, CA-S, MIS-S with 67.8%, 26.6%, 8.3% increase on the average runtime. When the problem domains become more varied and less structured, the scorer stopped functioning due to issues inherent in the sampling process. However, in cl, we benefit from greedy selection, and as the experiment results show, the backdoor output by cl has high quality, leading to 77, 68, 69 wins over Gurobi on SC-S, CA-S, and MIS-S, respectively. We can see cls indeed helps the scorer in most domains to improve the average performance, but cls does not fit well with our cl, dropping the performance range from 2.4% to 7.9%. Consider a real-world problem domain with a limited number of instances, using all the data to train cl would be a better and more elegant method.

MILP domains performance Following [10], we present results on 100 test instances from FC and NN in Table 3 to show our method can work in general MILP settings. cl still performs better than grb and scorer+cls in both domains, with around 15% improvement over grb and 10% improvement over scorer+cls on average runtime. Although scorer+cls have fewer losses compared to grb by transforming them into ties, cl have great improvement on win rate over scorer+cls, from 28 to 60 in FC and from 62 to 80 in NN.

Generalization performance We test cl trained on 100 small instances on 100 large instances of the same problem domain. This experiment shows that when the instances are large and MCTS becomes too inefficient to collect data, we can still use cl trained on smaller instances. As the results shown in Table 4, cl still consistently outperforms grb. On GISP-L and CA-L, cl achieves only a slight percentage drop on mean over grb compared to the small instances; On SC-L and MIS-L even achieve better performance. The results show that cl also performs well on larger and harder instances, as illustrated by Figure 3. The figure compares the runtime of grb and cl in terms of speed improvement and finish rate. On the left part, the dots under the red line are mostly easy instances, and our model almost gains 100% win for the hard instances. On the right side, the orange line is consistently above the blue line, except for the starting line, which is an easy instance. The results clearly show cl has great generalizability to the problem with the same domain and opens the potential for learning backdoors on much larger real-world problem domains.

Ablation Study We evaluate how contrastive learning and the MCTS backdoor collection contribute to the model performance. We collect another set of backdoor data on the same instances using LP-based sampling, the backdoor collection method used by [10]. We compare the performance of scorer + Sampling, scorer + MCTS, cl +Sampling, and cl + MCTS on GISP-S and SC-S and Table 5 reports the runtime statistics. Overall, cl + MCTS performs best on GISP-S and SC-S. The results show performance improvement for both scorer and cl when using MCTS, demonstrating the benefit of replacing random sampling with MCTS for data collection. The results also show that the cl consistently performs better than the scorer with the same data collection method, showing the benefit of contrastive learning over the previous learning-to-rank method. Additionally, the performance improvement in cl + Sampling shows the potential of our contrastive learning when trained with LP-based sampling backdoors to be applied to MILP domains with non-binary integers.

6 Conclusion

This paper introduces a contrastive learning framework to predict effective backdoors for MILPs. We utilize a Monte-Carlo tree search method to collect training data of better quality than previous methods. Empirical results on several common problem domains indicated that the contrastive learning model significantly outperforms existing baselines. It also achieved good generalization performance on larger-size instances. Beyond finding better backdoors, our model requires less time and resources for training and reduces randomness compared to the previous scorer model. The limitation is that the model requires offline data collection and training for each problem distribution to reach good accuracy. For future work, we want to develop some theory related to backdoors to understand the theoretical properties of backdoors. Moreover, given the recent successes of contrastive learning, it can be applied to other MILP learning problems, potentially becoming a foundational method for ML in combinatorial optimization.

Acnowledgement

The research was supported by the National Science Foundation (NSF) under grant number 2112533.

References

Balas and Ho [1980] E. Balas and A. Ho. Set covering algorithms using cutting planes, heuristics, and subgradient optimization: a computational study. In Combinatorial Optimization, pages 37–60. Springer, 1980.
Brody et al. [2021] S. Brody, U. Alon, and E. Yahav. How attentive are graph attention networks? arXiv preprint arXiv:2105.14491, 2021.
CAI et al. [2024] J. CAI, T. HUANG, and B. DILKINA. Learning Backdoors for Mixed Integer Linear Programs with Contrastive Learning, July 2024. URL https://github.com/caidog1129/Backdoor_cl.
Chmiela et al. [2021] A. Chmiela, E. Khalil, A. Gleixner, A. Lodi, and S. Pokutta. Learning to schedule heuristics in branch and bound. Advances in Neural Information Processing Systems, 34:24235–24246, 2021.
Cornuéjols et al. [1991] G. Cornuéjols, R. Sridharan, and J.-M. Thizy. A comparison of heuristics and relaxations for the capacitated plant location problem. European journal of operational research, 50(3):280–297, 1991.
Dilkina and Gomes [2010] B. Dilkina and C. P. Gomes. Solving connected subgraph problems in wildlife conservation. In CPAIOR, volume 6140, pages 102–116. Springer, 2010.
Dilkina et al. [2009] B. Dilkina, C. P. Gomes, Y. Malitsky, A. Sabharwal, and M. Sellmann. Backdoors to combinatorial optimization: Feasibility and optimality. In Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems: 6th International Conference, CPAIOR 2009 Pittsburgh, PA, USA, May 27-31, 2009 Proceedings 6, pages 56–70. Springer, 2009.
Duan et al. [2022] H. Duan, P. Vaezipoor, M. B. Paulus, Y. Ruan, and C. Maddison. Augment with care: Contrastive learning for combinatorial problems. In International Conference on Machine Learning, pages 5627–5642. PMLR, 2022.
Erdős et al. [1960] P. Erdős, A. Rényi, et al. On the evolution of random graphs. Publ. math. inst. hung. acad. sci, 5(1):17–60, 1960.
Ferber et al. [2021] A. Ferber, J. Song, B. Dilkina, and Y. Yue. Learning pseudo-backdoors for mixed integer programs, 2021.
Fischetti and Monaci [2011] M. Fischetti and M. Monaci. Backdoor branching. In International Conference on Integer Programming and Combinatorial Optimization, pages 183–191. Springer, 2011.
Floudas and Lin [2005] C. A. Floudas and X. Lin. Mixed integer linear programming in process scheduling: Modeling, algorithms, and applications. Annals of Operations Research, 139:131–162, 2005.
Gasse et al. [2019] M. Gasse, D. Chételat, N. Ferroni, L. Charlin, and A. Lodi. Exact combinatorial optimization with graph convolutional neural networks. Advances in neural information processing systems, 32, 2019.
Gleixner et al. [2021] A. Gleixner, G. Hendel, G. Gamrath, T. Achterberg, M. Bastubbe, T. Berthold, P. Christophel, K. Jarck, T. Koch, J. Linderoth, et al. Miplib 2017: data-driven compilation of the 6th mixed-integer programming library. Mathematical Programming Computation, 13(3):443–490, 2021.
Gupta et al. [2020] P. Gupta, M. Gasse, E. Khalil, P. Mudigonda, A. Lodi, and Y. Bengio. Hybrid models for learning to branch. Advances in neural information processing systems, 33:18087–18097, 2020.
Gurobi Optimization, LLC [2023] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.
Han et al. [2022] Q. Han, L. Yang, Q. Chen, et al. A gnn-guided predict-and-search framework for mixed-integer linear programming. In ICLR, 2022.
He et al. [2014] H. He, H. Daume III, and J. M. Eisner. Learning to search in branch and bound algorithms. Advances in neural information processing systems, 27, 2014.
Hjelm et al. [2018] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
Hochbaum and Pathria [1997] D. S. Hochbaum and A. Pathria. Forest harvesting and minimum cuts: a new approach to handling spatial constraints. Forest Science, 43(4):544–554, 1997.
Huang and Dilkina [2020] T. Huang and B. Dilkina. Enhancing seismic resilience of water pipe networks. In Proceedings of the 3rd ACM SIGCAS Conference on Computing and Sustainable Societies, pages 44–52, 2020.
Huang et al. [2023] T. Huang, A. M. Ferber, Y. Tian, B. Dilkina, and B. Steiner. Searching large neighborhoods for integer linear programs with contrastive learning. In International Conference on Machine Learning, pages 13869–13890. PMLR, 2023.
Huang et al. [2024a] T. Huang, A. M. Ferber, A. Zharmagambetov, Y. Tian, and B. Dilkina. Contrastive predict-and-search for mixed integer linear programs. In International Conference on Machine Learning. PMLR, 2024a.
Huang et al. [2024b] W. Huang, T. Huang, A. M. Ferber, and B. Dilkina. Distributional MIPLIB: a multi-domain library for advancing ml-guided milp methods. arXiv preprint arXiv:2406.06954, 2024b.
Jaiswal et al. [2020] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
Khalil et al. [2016] E. Khalil, P. Le Bodic, L. Song, G. Nemhauser, and B. Dilkina. Learning to branch in mixed integer programming. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
Khalil et al. [2017] E. B. Khalil, B. Dilkina, G. L. Nemhauser, S. Ahmed, and Y. Shao. Learning to run heuristics in tree search. In Ijcai, pages 659–666, 2017.
Khalil et al. [2022] E. B. Khalil, P. Vaezipoor, and B. Dilkina. Finding backdoors to integer programs: A monte carlo tree search framework. Proceedings of the AAAI Conference on Artificial Intelligence, 36(4):3786–3795, June 2022. ISSN 2159-5399. 10.1609/aaai.v36i4.20293. URL http://dx.doi.org/10.1609/aaai.v36i4.20293.
Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Kottler et al. [2008] S. Kottler, M. Kaufmann, and C. Sinz. Computation of renameable horn backdoors. In Theory and Applications of Satisfiability Testing–SAT 2008: 11th International Conference, SAT 2008, Guangzhou, China, May 12-15, 2008. Proceedings 11, pages 154–160. Springer, 2008.
Labassi et al. [2022] A. G. Labassi, D. Chételat, and A. Lodi. Learning to compare nodes in branch and bound with graph neural networks. Advances in Neural Information Processing Systems, 35:32000–32010, 2022.
Land and Doig [2010] A. H. Land and A. G. Doig. An automatic method for solving discrete programming problems. Springer, 2010.
Leyton-Brown et al. [2000] K. Leyton-Brown, M. Pearson, and Y. Shoham. Towards a universal test suite for combinatorial auction algorithms. In Proceedings of the 2nd ACM conference on Electronic commerce, pages 66–76, 2000.
Logeswaran and Lee [2018] L. Logeswaran and H. Lee. An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893, 2018.
Mulamba et al. [2020] M. Mulamba, J. Mandi, M. Diligenti, M. Lombardi, V. Bucarey, and T. Guns. Contrastive losses and solution caching for predict-and-optimize. arXiv preprint arXiv:2011.05354, 2020.
Nair et al. [2020] V. Nair, S. Bartunov, F. Gimeno, et al. Solving mixed integer programs using neural networks. arXiv preprint arXiv:2012.13349, 2020.
Oord et al. [2018] A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Paris et al. [2006] L. Paris, R. Ostrowski, P. Siegel, and L. Sais. Computing horn strong backdoor sets thanks to local search. In 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’06), pages 139–143. IEEE, 2006.
Paulus et al. [2022] M. B. Paulus, G. Zarpellon, A. Krause, L. Charlin, and C. Maddison. Learning to cut by looking ahead: Cutting plane selection via imitation learning. In International conference on machine learning, pages 17584–17600. PMLR, 2022.
Pohl [1970] I. Pohl. Heuristic search viewed as path finding in a graph. Artificial intelligence, 1(3-4):193–204, 1970.
Song et al. [2018] J. Song, R. Lanka, A. Zhao, A. Bhatnagar, Y. Yue, and M. Ono. Learning to search via retrospective imitation. arXiv preprint arXiv:1804.00846, 2018.
Song et al. [2020] J. Song, Y. Yue, B. Dilkina, et al. A general large neighborhood search framework for solving integer linear programs. Advances in Neural Information Processing Systems, 33:20012–20023, 2020.
Sonnerat et al. [2021] N. Sonnerat, P. Wang, I. Ktena, S. Bartunov, and V. Nair. Learning a large neighborhood search algorithm for mixed integer programs. arXiv preprint arXiv:2107.10201, 2021.
Tang et al. [2020] Y. Tang, S. Agrawal, and Y. Faenza. Reinforcement learning for integer programming: Learning to cut. In International conference on machine learning, pages 9367–9376. PMLR, 2020.
Tarjan and Trojanowski [1977] R. E. Tarjan and A. E. Trojanowski. Finding a maximum independent set. SIAM Journal on Computing, 6(3):537–546, 1977.
Veličković et al. [2017] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
Williams et al. [2003] R. Williams, C. P. Gomes, and B. Selman. Backdoors to typical case complexity. In IJCAI, volume 3, pages 1173–1178, 2003.
You et al. [2020] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823, 2020.
Zhang et al. [2023] J. Zhang, C. Liu, X. Li, H.-L. Zhen, M. Yuan, Y. Li, and J. Yan. A survey for solving mixed integer programming via machine learning. Neurocomputing, 519:205–217, 2023.

Appendix A Network Architecture

We give full details of the GAT architecture described in Section 3.2. The policy takes the bipartite graph $P^{\prime}=(G,V,C,E)$ as input and output a score vector with one score per decision variable. We use 2-layer MLPs with 64 hidden units per layer and ReLU as the activation function to map each node feature and edge feature to $\mathbb{R}^{L}$ where $L=64$ .

Let $\boldsymbol{v}_{j},\boldsymbol{c}_{i},\boldsymbol{e}_{i,j}\in\mathbb{R}^{L}$ be the embeddings of the $i$ -th variable, $j$ -th constraint and the edge connecting them output by the embedding layers $V^{1},C^{1},E^{1}$ . We perform two rounds of message passing through the GAT. In the first round, each constraint node $\boldsymbol{c}_{i}$ attends to its neighbors $\mathcal{N}_{i}$ using an attention structure with $H=8$ attention heads:

\displaystyle\boldsymbol{c}^{\prime}_{i}=\frac{1}{H}\sum_{i=1}^{H}\left(\alpha^{(h)}_{ii,1}\boldsymbol{\theta}^{(h)}_{c,1}\boldsymbol{c}_{i}+\sum_{j\in\mathcal{N}_{i}}\alpha^{(h)}_{ij,1}\boldsymbol{\theta}^{(h)}_{v,1}\boldsymbol{v}_{j}\right)

where $\boldsymbol{\theta}^{(h)}_{c,1}\in\mathbb{R}^{L\times L}$ and $\boldsymbol{\theta}^{(h)}_{v,1}\in\mathbb{R}^{L\times L}$ are learnable weights. The updated constraints embeddings $\boldsymbol{c}^{\prime}_{i}$ in updated embedding layer $C^{2}$ are averaged across $H$ attention heads using attention weights

\displaystyle\alpha^{(h)}_{ij,1}=\frac{\exp(\boldsymbol{w}^{T}_{1}\rho([\boldsymbol{\theta}^{(h)}_{c,1}\boldsymbol{c}_{i},\boldsymbol{\theta}^{(h)}_{v,1}\boldsymbol{v}_{j},\boldsymbol{\theta}^{(h)}_{e,1}\boldsymbol{e}_{i,j}]))}{\sum_{k\in\mathcal{N}_{i}}\exp(\boldsymbol{w}^{T}_{1}\rho([\boldsymbol{\theta}^{(h)}_{c,1}\boldsymbol{c}_{i},\boldsymbol{\theta}^{(h)}_{v,1}\boldsymbol{v}_{k},\boldsymbol{\theta}^{(h)}_{e,1}\boldsymbol{e}_{i,k}]))}

where the attention coefficients $\boldsymbol{w}_{1}\in\mathbb{R}^{3L}$ and $\boldsymbol{\theta}^{(h)}_{e,1}\in\mathbb{R}^{L\times L}$ are both learnable weights and $\rho(\cdot)$ refers to the LeakyReLU activation function with negative slope $0.2$ . In the second round, similarly, each variable node attends to its neighbors to get updated variable node embeddings

\displaystyle\boldsymbol{v}^{\prime}_{j}=\frac{1}{H}\sum_{i=1}^{H}\left(\alpha^{(h)}_{jj,2}\boldsymbol{\theta}^{(h)}_{c,2}\boldsymbol{c}^{\prime}_{i}+\sum_{j\in\mathcal{N}_{i}}\alpha^{(h)}_{ji,2}\boldsymbol{\theta}^{(h)}_{v,2}\boldsymbol{v}_{j}\right)

with attention weights

\displaystyle\alpha^{(h)}_{ji,2}=\frac{\exp(\boldsymbol{w}^{T}_{2}\rho([\boldsymbol{\theta}^{(h)}_{c,2}\boldsymbol{c}^{\prime}_{i},\boldsymbol{\theta}^{(h)}_{v,2}\boldsymbol{v}_{j},\boldsymbol{\theta}^{(h)}_{e,2}\boldsymbol{e}_{i,j}]))}{\sum_{k\in\mathcal{N}_{i}}\exp(\boldsymbol{w}^{T}_{2}\rho([\boldsymbol{\theta}^{(h)}_{c,2}\boldsymbol{c}^{\prime}_{i},\boldsymbol{\theta}^{(h)}_{v,2}\boldsymbol{v}_{j},\boldsymbol{\theta}^{(h)}_{e,2}\boldsymbol{e}_{i,k}]))}

where $\boldsymbol{w}_{2}\in\mathbb{R}^{3L}$ and $\boldsymbol{\theta}^{(h)}_{c,2},\boldsymbol{\theta}^{(h)}_{v,2},\boldsymbol{\theta}^{(h)}_{e,2}\in\mathbb{R}^{L\times L}$ are learnable weights. After the two rounds of message passing, the final representations of variables $\boldsymbol{v}^{\prime}$ in updated embedding layer $V^{2}$ are passed through a 2-layer MLP with 64 hidden units per layer to obtain a scalar value for each variable. Finally, we apply the sigmoid function to get a score between 0 and 1.

Appendix B Additional Details of Instance Generation

We present the MILP formulations for the Generalized Independent Set Problem (GISP), Set Cover Problem (SC), Combinatorial Auction (CA), Maximal Independent Set (MIS), Facility Location (FC), and Neural Network Verification (NN).

B.1 GISP

In a GISP instance, we are given a graph $G$ with vertex set $V$ and edge set where $E_{1}$ is a set of non-removable edges and $E_{2}$ is a set of removable edges. A revenue $w_{i}>0$ is associated with each vertex $i\in V$ , and a cost $c_{ij}>0$ is associated with each removable edge $(i,j)\in E_{2}$ . The goal is to find an independent set, i.e., a set of vertices such that no two vertices in the set are adjacent, that maximizes the difference between the total revenue associated with the vertices in the set and the total cost associated with the removal of edges with both endpoints in the set:

	$\displaystyle\max\sum_{i\in V}w_{i}x_{i}-\sum_{(i,j)\in E_{2}}c_{ij}y_{ij}$
	$\displaystyle\text{s.t. }x_{i}+x_{j}\leq 1,\forall(i,j)\in E_{1},$
	$\displaystyle x_{i}+x_{j}-y_{ij}\leq 1,\forall(i,j)\in E_{2},$
	$\displaystyle x_{i}\in\{0,1\},\forall i\in V,$
	$\displaystyle y_{ij}\in\{0,1\},\forall(i,j)\in E_{2}.$

B.2 SC

In a SC instance, we are given $m$ elements and a collection $S$ of $n$ sets whose union is the set of all elements. The goal is to select a minimum number of sets from $S$ such that the union of the selected set is still the set of all elements:

	$\displaystyle\max-\sum_{s\in S}x_{s}$
	$\displaystyle\text{s.t.}\sum_{s\in S:i\in s}x_{s}\geq 1,\forall i\in[m],$
	$\displaystyle x_{s}\in\{0,1\},\forall s\in S.$

B.3 CA

In a CA instance, we are given $n$ bids $\{(B_{i},p_{i}):i\in[n]\}$ for $m$ items, where $B_{i}$ is a subset of items and $p_{i}$ is its associated bidding price. The objective is to allocate items to bids such that the total revenue is maximized:

	$\displaystyle\max\sum_{i\in[n]}p_{i}x_{i}$
	$\displaystyle\text{s.t.}\sum_{i:j\in B_{i}}x_{i}\leq 1,\forall j\in[m],$
	$\displaystyle x_{i}\in\{0,1\},\forall i\in[n].$

B.4 MIS

In a MIS instance, we are given an undirected graph $G=(V,E)$ . The goal is to select the largest subset of nodes such that no two nodes in the subsets are connected by an edge in $G$ :

	$\displaystyle\max\sum_{v\in V}x_{v}$
	$\displaystyle\text{s.t.}x_{u}+x_{v}\leq 1,\forall(u,v)\in E,$
	$\displaystyle x_{v}\in\{0,1\},\forall v\in V.$

B.5 FC

In a FC instance, we are given a set of potential facility locations $I$ and a set of customers $J$ . Each facility $i\in I$ can serve multiple customers and has a fixed cost $f_{i}$ associated with opening it. Each customer $j\in J$ must be served by exactly one open facility, and the cost of serving customer $j$ from facility $i$ is denoted as $c_{ij}$ . The goal is to minimize the total cost of opening facilities and serving all customers:

	$\displaystyle\min\sum_{i\in I}f_{i}y_{i}+\sum_{i\in I}\sum_{j\in J}c_{ij}x_{ij}$
	$\displaystyle\text{s.t.}\sum_{i\in I}x_{ij}=1,\quad\forall j\in J,$
	$\displaystyle x_{ij}\leq y_{i},\quad\forall i\in I,j\in J,$
	$\displaystyle\sum_{j\in J}D_{j}x_{ij}\leq S_{i}y_{i},\quad\forall i\in I,$
	$\displaystyle y_{i}\in\{0,1\},\quad\forall i\in I,$
	$\displaystyle x_{ij}\in\{0,1\},\quad\forall i\in I,j\in J.$

B.6 NN

In a NN instance, we are given a neural network represented by a series of layers with nodes interconnected by weights. The goal is to verify whether, for all possible inputs within a specified range, the network’s output adheres to certain safety or correctness criteria. The problem is typically formulated as determining if there exists any input that leads to an undesired or unsafe output, which can be expressed as:

	$\displaystyle\max\{z_{k}\}-c_{k}\quad$
	$\displaystyle\text{s.t.}\quad x_{i+1}=f(W_{i}x_{i}+b_{i}),\quad\forall i,$
	$\displaystyle a\leq x_{0}\leq b,$
	$\displaystyle z_{k}\text{ must satisfy safety criteria},$
	$\displaystyle x_{i},z_{k}\in\mathbb{R}^{n}.$

Appendix C Additional Details on Hyperparameter Tuning

For scorer and scorer+cls, we use all the hyperparameters provided in Ferber et.al.’s code in our experiments.

We limit the number of backdoor samples $k$ to 50 for each instance during data collection due to the inefficiency of MCTS, thus giving us a restricted range of $p$ and $q$ . We ran several experiments with minor, reasonable variations of $p$ and $q$ $(p=5,q=15)(p=10,q=10)$ during training, and we did not observe much difference in the performance of the resulting models. For $L$ and $H$ , we are following previous work to avoid complicate ablation studies.

In our experiments, we greedily taking the $q$ worst candidate backdoors as negative samples. Other ways of determining negative samples have been experimented with, such as choosing the candidate backdoors with highest number of common variables but have worse performance or randomly sampling 5 backdoors from the 15 worst samples. The current strategy outperforms the others.

In Table 6, we summarize all the hyperparameters with their notations and values used in our experiments.

Table 6: Hyperparameters with their notations and values used.

Hyperparameter	Notation	Value
Number of candidate backdoors for each instance	$k$	50
Number of positive samples	$p$	5
Number of negative samples	$q$	5
Feature embedding dimension	$L$	64
Number of attention heads in the GAT	$H$	8
Temperature parameter in the contrastive loss	$\tau$	0.07
Learning rate		$5\times 10^{-4}$
Weight decay		0.01
Batch size		32
Number of training epochs		100

Appendix D Additional Figures

Figure 1 presents a runtime comparison between the collected backdoors and the standard Gurobi runtime for GISP-S, SC-S, CA-S, MIS-S, FC, and NN. The histograms provide a visual representation, highlighting the instances where the best backdoor among the 50 collected outperforms the standard Gurobi runtime. Notably, GISP-S and FC exhibits the highest backdoor quality, followed by SC-S and NN, where the backdoor quality is considered good. However, for CA-S and MIS-S, the backdoor quality is less favorable, with a majority of backdoors unable to surpass the standard Gurobi runtime.

Figure 2 compares the runtime of grb and cl in terms of speed improvement and finish rate on small instances. On the left part, the dots under the red line are mostly easy instances, and our model almost gains most win for the hard instances. On the right side, the majority of the orange line is above the blue line. The results clearly show cl is better than grb in terms of running time.