This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Parallelized Conflict Graph Cut Generation

Yongzheng Dai Chen Chen
Abstract

A conflict graph represents logical relations between binary variables, and effective use of the graph can significantly accelerate branch-and-cut solvers for mixed-integer programming (MIP). In this paper we develop efficient parallel conflict graph management: conflict detection; maximal clique generation; clique extension; and clique merging. We leverage parallel computing in order to intensify computational effort on the conflict graph, thereby generating a much larger pool of cutting planes than what can be practically achieved in serial. Computational experiments demonstrate that the expanded pool of cuts enabled by parallel computing lead to substantial reductions in total MIP solve time, especially for more challenging cases.

1 Introduction

Various progress and benchmarking reports from the literature [9, 7, 3, 22] indicate the increasing importance of parallelization as well as preprocessing subroutines in solvers for mixed-integer programming (MIP). We build upon this progress by parallelizing conflict graph (CG) management, a key subroutine in branch-and-cut solvers [19, 5, 1, 10] that begins at the preprocessing stage and is subsequently deployed throughout the search tree (see, e.g. [34]). This work focuses on modest levels of parallelism—say, <100<100 cores with shared memory—typical of personal computing setups now and in the near future; in contrast, work on massively parallel MIP (e.g. [29, 13, 31, 23, 32, 28]) address certain issues associated with distributed computing such as higher communication costs. Moreover, our work deals with general-purpose MIP; for instance, certain stochastic optimization problems have special problem structures amenable to specially tailored decomposition-based distributed schemes (see e.g. [27, 30, 25]).

Perhaps the most closely related work on parallel MIP is by Gleixner, Gottwald, and Hoen [14], in which a host of preprocessing techniques are parallelized. Using 32 threads, their solver PaPILO is reported to reduce presolve time by almost 50% in shifted geometric mean, with 5x speedup on preprocessing-intensive instances. Our method attains over 80% reduction using 64 threads (see Figure 2); we emphasize, however, that a direct comparison is misguided as both our works are complementary due to parallelization of different procedures. Likewise, our work complements other efforts such as parallel LP solvers (e.g. [21, 20, 4, 18]), and concurrent solves (see e.g. [22]). Our approach is further differentiated from the literature as we do not solely accelerate existing serial algorithms—indeed, the small fraction of total runtime from typical CG management suggests rather modest potential from this due to Amdahl’s law [17]— but instead modify the serial CG procedures of Brito and Santos [10] to generate more cuts. We observe empirically that our more intensive cut management scheme is ineffective in serial implementation, but attains substantial overall time speedups when executed in parallel.

2 Parallel Conflict Graph Management

The serial algorithm development in this section follows predominantly the recent work of Brito and Santos [10] on CG management, which has been implemented in the COIN-OR branch-and-cut solver. Furthermore, in parallel computing discussions here it is assumed we have kk threads with shared memory among cores. Moreover, theoretical results herein hold irrespective of the particular PRAM configuration (EREW, CRCW, etc.) due to the lack of conflicting transactions in our algorithms.

Consider a mixed integer program (MIP) in the following generic form:

min\displaystyle\min cTx\displaystyle c^{T}x (1)
s.t. Axb\displaystyle Ax\circ b
xu\displaystyle\ell\leq x\leq u
xj for all j\displaystyle x_{j}\in\mathbb{Z}\mbox{ for all }j\in\mathcal{I}

with parameters Am×nA\in\mathbb{R}^{m\times n}, cnc\in\mathbb{R}^{n}, bmb\in\mathbb{R}^{m}, ({})n\ell\in(\mathbb{R}\cup\{-\infty\})^{n}, and u({})nu\in(\mathbb{R}\cup\{\infty\})^{n}; variables xnx\in\mathbb{R}^{n} with xjx_{j}\in\mathbb{Z} for j𝒩={1,,n}j\in\mathcal{I}\subseteq\mathcal{N}=\{1,...,n\}; and constraints with relations i{=,,}\circ_{i}\in\{=,\leq,\geq\} for each row i={1,,m}i\in\mathcal{M}=\{1,...,m\}. Furthermore, let :={j| 0xj1}\mathcal{B}:=\{j\in\mathcal{I}\ |\ 0\leq x_{j}\leq 1\} be the set of indices for all binary variables.

A conflict graph over \mathcal{B} has one node for each binary variable xjx_{j}, representing an assigned value xj=1x_{j}=1, and another node for the complement variable x¯j:=1xj\bar{x}_{j}:=1-x_{j} assigned to 1. An edge between the nodes represents conflict, i.e. an infeasible assignment; for instance, one can automatically place an edge between each variable and its complement. Conflicts need to be inferred or detected from the problem formulation or else during the branch-and-bound procedure. In this paper we consider two types of constraints from which conflicts can be extracted: set packing constraints and conflicting knapsack constraints. A set packing constraint is defined as

j𝒮xj1,\sum_{j\in\mathcal{S}}x_{j}\leq 1, (2)

for some 𝒮\mathcal{S}\subseteq\mathcal{B}. Since each variable in 𝒮\mathcal{S} has a conflict with all others, 𝒮\mathcal{S} forms a clique in the conflict graph.

A knapsack constraint is defined as

jajxjb\sum_{j\in\mathcal{B}}a_{j}x_{j}\leq b (3)

with aj0,ja_{j}\geq 0,j\in\mathcal{B}. Suppose WLOG that a(1),a(2)a_{(1)},a_{(2)} are the two largest elements from coefficients (aj)j(a_{j})_{j\in\mathcal{B}}. Then if a(1)+a(2)>ba_{(1)}+a_{(2)}>b, we call this a conflicting knapsack constraint as a CG clique can be generated from x1,x2x_{1},x_{2} (and possibly other variables); in the absence of this condition, no conflicts can be inferred from the knapsack.

Set packing and knapsack constraints can, in turn, be inferred from the general MIP formulation. For a given mixed integer constraint AixibiA_{i\cdot}x\circ_{i}b_{i}, if i\circ_{i} is \leq, we extract a pure binary variables constraint (PBC) denoted as PBC(i)\mbox{PBC}(i):

j:Aij>0Aijxjj:Aij<0Aijx¯jbiinf{jAijxj}j:Aij<0Aij.\sum_{j\in\mathcal{B}:A_{ij}>0}A_{ij}x_{j}-\sum_{j\in\mathcal{B}:A_{ij}<0}A_{ij}\bar{x}_{j}\leq b_{i}-\inf\{\sum_{j\not\in\mathcal{B}}A_{ij}x_{j}\}-\sum_{j\in\mathcal{B}:A_{ij}<0}A_{ij}. (4)

Note that if i\circ_{i} is \geq, we can rewrite the constraint as Aixbi-A_{i\cdot}x\leq-b_{i}, and if i\circ_{i} is ==, we can split the constraint into two constraints AixbiA_{i\cdot}x\leq b_{i} and Aixbi-A_{i\cdot}x\leq-b_{i}. So we only consider constraints with the form of AixbiA_{i\cdot}x\leq b_{i} in the remainder of this section.

The remainder of this section describes both serial and parallel algorithms for certain aspects of CG management, presented in the order of execution in branch-and-bound code. Both serial and parallel algorithms are analyzed in terms of worst-case and average-case complexity.

2.1 Serial Pure Binary Constraint Extraction

At the preprocessing stage, given a MIP=(,𝒩,,A,b,c,,,u)\mbox{MIP}=(\mathcal{M},\mathcal{N},\mathcal{I},A,b,c,\circ,\ell,u), we perform a one-round simple presolve to detect set packing constraints and conflicting knapsack constraints for subsequent conflict graph cuts generation, which is described in Alg 1. This applies well-studied techniques from the CG literature (see e.g. Achterberg et al. [2]).

All techniques mentioned in line 3 can be found in [2, Sec 3.1 and 3.2]. For AixbiA_{i\cdot}x\geq b_{i}, we transfer it as Aixbi-A_{i\cdot}x\leq-b_{i}. And for Aix=biA_{i\cdot}x=b_{i}, we treat it as two inequalities AixbiA_{i\cdot}x\leq b_{i} and AixbiA_{i\cdot}x\geq b_{i}.

In line 4 of Alg 1, if AixbiA_{i\cdot}x\leq b_{i} is a set packing constraint, we call it an original set packing (OSP) constraint and collect all such constraints in the set 𝒮osp\mathcal{S}_{osp} (line 6).

In line 10, for xjx_{j} with jj such that Aij0A_{ij}\neq 0, the lower or upper bounds are tightened as follows: if Aij>0A_{ij}>0, uj:=min{uj,bi/Aij}u_{j}:=\min\{u_{j},b_{i}/A_{ij}\}; otherwise, j:=min{j,bi/Aij}\ell_{j}:=\min\{\ell_{j},b_{i}/A_{ij}\}.

In line 11 of Alg 1, if PBC(i)\mbox{PBC}(i) is a set packing constraint, we call it a inferred set packing constraint (ISP) and collect all such constraints in the set 𝒮isp\mathcal{S}_{isp}.

In line 14 we collect all (i.e. both original and PBC-inferred) conflicting knapsack constraints (CK) in the constraint set 𝒮ck\mathcal{S}_{ck} (line 15).

Input: MIP=(,𝒩,,A,b,c,,,u)\mbox{MIP}=(\mathcal{M},\mathcal{N},\mathcal{I},A,b,c,\circ,\ell,u)
Output: MIP after applying Simple Presolve, 𝒮osp\mathcal{S}_{osp}, 𝒮isp\mathcal{S}_{isp}, 𝒮ck\mathcal{S}_{ck}
1 Set osp\mathcal{I}_{osp}, 𝒮isp\mathcal{S}_{isp}, 𝒮ck:=\mathcal{S}_{ck}:=\emptyset;
2 Remove empty constraints and singletons from MIP and conduct a one-round single-row bound strengthening to MIP;
3 for ii\in\mathcal{M} do
4       if AixbiA_{i\cdot}x\leq b_{i} is a set packing constraint then
5             Remove ii from \mathcal{M};
6             Set 𝒮osp:=𝒮osp{Aixbi}\mathcal{S}_{osp}:=\mathcal{S}_{osp}\cup\{A_{i\cdot}x\leq b_{i}\};
7      else
8             Rewrite AixbiA_{i\cdot}x\leq b_{i} as PBC(i)\mbox{PBC}(i);
9             if PBC(i)\mathrm{PBC}(i) is a singleton then
10                  Update uju_{j} and j\ell_{j} for jj such that Aij0A_{ij}\neq 0 and jj\in\mathcal{B};
11             else if PBC(i)\mathrm{PBC}(i) is a set packing constraint then
12                  Set 𝒮isp:=𝒮isp{PBC(i)}\mathcal{S}_{isp}:=\mathcal{S}_{isp}\cup\{\mbox{PBC}(i)\};
13             else
14                   if PBC(i)\mathrm{PBC}(i) is a conflicting knapsack constraint then
15                        Set 𝒮ck:=𝒮ck{PBC(i)}\mathcal{S}_{ck}:=\mathcal{S}_{ck}\cup\{\mbox{PBC}(i)\};
16                   end if
17                  
18             end if
19            
20       end if
21      
22 end for
return MIP after applying Simple Presolve, 𝒮osp\mathcal{S}_{osp}, 𝒮isp\mathcal{S}_{isp}, 𝒮ck\mathcal{S}_{ck}.
Algorithm 1 One-round Simple Presolve

The complexity of Alg 1 is O(NNZ)O(NNZ), where NNZNNZ is the number of nonzero elements in AA. We implement this procedure solely in serial since it executes very quickly in practice.

2.2 Parallel Maximal Clique Detection from Knapsack Constraint

Following PBC generation, we proceed to detect maximal cliques from conflicting knapsack constraints with the Clique Detection method of Brito and Santos [10, Algorithm 1]. We modify the method, described as (serial) Alg 2: namely, instead of returning all detected maximal cliques in a single outputted set, we separately return the first detected maximal clique (see line 7 in Alg 2) in 𝒮org\mathcal{S}_{org} (the original maximal clique) and all other cliques in 𝒮other\mathcal{S}_{other} (other maximal cliques). This is used for more intensive clique extension and merging applied specifically to the original cliques, described in Section 2.4. Alg 2 for 𝒮ck\mathcal{S}_{ck} is, in turn, called in parallel via Alg 3.

Input: Linear Constraint jajxjb\sum_{j\in\mathcal{B}}a_{j}x_{j}\leq b
Output: 𝒮org\mathcal{S}_{org}, 𝒮other\mathcal{S}_{other}
1 Sort index set ={j1,,jn}\mathcal{B}=\{j_{1},...,j_{n}\} by non-decreasing coefficient value aj1ajna_{j_{1}}\leq...\leq a_{j_{n}};
2 if ajn1+ajnba_{j_{n-1}}+a_{j_{n}}\leq b then
3      return ,\emptyset,\emptyset
4 end if
5Set 𝒮org\mathcal{S}_{org}, 𝒮other:=\mathcal{S}_{other}:=\emptyset;
6 Find the smallest ϕ\phi such that ajϕ+ajϕ+1>ba_{j_{\phi}}+a_{j_{\phi+1}}>b;
7 Set 𝒮org:={xjϕ,,xjn}\mathcal{S}_{org}:=\{x_{j_{\phi}},...,x_{j_{n}}\};
8 for i=ϕ1:1i=\phi-1:1 do
9       Find the smallest σ\sigma such that aji+ajσ>ba_{j_{i}}+a_{j_{\sigma}}>b;
10       if sigmasigma exists then
11             Set 𝒮other:=𝒮other{xji,xjσ,,xjn}\mathcal{S}_{other}:=\mathcal{S}_{other}\cup\{x_{j_{i}},x_{j_{\sigma}},...,x_{j_{n}}\};
12            
13      else
14             Break;
15            
16       end if
17      
18 end for
return 𝒮org\mathcal{S}_{org}, 𝒮other\mathcal{S}_{other}.
Algorithm 2 Clique Detection

Furthermore, in line 2 of Alg 3, we randomly shuffle and then partition 𝒮ck\mathcal{S}_{ck} into subsets of equal cardinality (modulo the last processor) from Alg 1 as a simple and fast heuristic for load balancing. Moreover, this can be efficiently parallelized [6, 33]. The same trick is used in Alg 5 and Alg 7. Note that the general problem of optimal load balancing—dividing up tasks with known computational load as evenly as possible across kk cores—is an NP-hard partitioning problem [11, 12]. The shuffling heuristic is justified both by average-case analysis as well as computational experiments indicating high parallel efficiency on hard instances.

Input: 𝒮ck\mathcal{S}_{ck}, kk threads
Output: 𝒞org\mathcal{C}_{org}, 𝒞other\mathcal{C}_{other}
1 Set 𝒞org\mathcal{C}_{org}, 𝒞other:=\mathcal{C}_{other}:=\emptyset;
2 Randomly shuffle 𝒮ck\mathcal{S}_{ck} and partition evenly by cardinality into subsets 𝒮ck1,,𝒮ckk\mathcal{S}_{ck}^{1},...,\mathcal{S}_{ck}^{k};
3 for i{1,,k}i\in\{1,...,k\} do parallel
4       Set 𝒞orgi\mathcal{C}_{org}^{i}, 𝒞otheri:=\mathcal{C}_{other}^{i}:=\emptyset;
5       for c𝒮ckkc\in\mathcal{S}_{ck}^{k} do
6             Set 𝒮org\mathcal{S}_{org}, 𝒮other\mathcal{S}_{other} from Clique Detection (Alg 2) for constraint cc;
7             Set 𝒞orgi:=𝒞orgi{𝒮org}\mathcal{C}_{org}^{i}:=\mathcal{C}_{org}^{i}\cup\{\mathcal{S}_{org}\}, 𝒞otheri:=𝒞otheri{𝒮other}\mathcal{C}_{other}^{i}:=\mathcal{C}_{other}^{i}\cup\{\mathcal{S}_{other}\};
8            
9       end for
10      Set 𝒞org:=𝒞org𝒞orgi\mathcal{C}_{org}:=\mathcal{C}_{org}\cup\mathcal{C}_{org}^{i}, 𝒞other:=𝒞other𝒞otheri\mathcal{C}_{other}:=\mathcal{C}_{other}\cup\mathcal{C}_{other}^{i};
11      
12 end for
return 𝒞org\mathcal{C}_{org}, 𝒞other\mathcal{C}_{other}.
Algorithm 3 Parallel Clique Detection

The complexity for Algorithm 2 is O(nlogn)O(n\log n) (on average) and O(n2)O(n^{2}) (worst case) (see [10, Page 4, Paragraph 2-3]), where nn is the number of elements in the constraint. The average-case complexity (average over random shuffles) for Algorithm 3 is O(mnlogn/k)O(mn\log n/k), where mm is number of constraints, and worst case is in O(mn2/k)O(mn^{2}/k) (see Proposition A.1 in Appendix A).

2.3 Parallel Conflict Graph Construction

Following clique detection from PBC-derived conflicting knapsack constraints, we proceed to CG construction. Suppose there are nn_{\mathcal{B}} binary variables in \mathcal{B}, and so nn_{\mathcal{B}} complementary variables. We represent the CG with a sparse matrix G{0,1}2n×2nG\in\{0,1\}^{2n_{\mathcal{B}}\times 2n_{\mathcal{B}}}, with the first set of rows j{1,,n}j\in\{1,...,n_{\mathcal{B}}\} representing the original variables xjx_{j} and the next set j{n+1,,2n}j\in\{n_{\mathcal{B}}+1,...,2n_{\mathcal{B}}\} representing the complements x¯jn\bar{x}_{j-n_{\mathcal{B}}}. We build GG by Alg 5 in parallel.

We choose a sparse adjacency matrix here instead of a clique table in order to enable faster clique extension (see Section 2.4) by avoiding redundant computation. A clique table, however, is substantially more efficient with memory, so as a workaround to the sparse data structure we set limits on the number of nonzeros considered (see Section 2.6). Moreover, on cliques that we choose not to extend (namely CotherC_{other}), we adopt the more memory-efficient data structure described in Section 2.3 of [10].

The initialization of the clique set 𝒞\mathcal{C} in line 1 includes: the trivial pairwise conflicts between variables and their complements (i.e. Gj,j+n=1G_{j,j+n_{\mathcal{B}}}=1 for all j=1,,nj=1,...,n_{\mathcal{B}}); the clique of 𝒮\mathcal{S} from set packing constraints obtained by Alg 1; and cliques extracted in Alg 3.

As with Alg 3, in line 2 random shuffle and partition is applied as a load balancing heuristic.

In line 5, each G(i)G(i) is updated in parallel using the clique subset 𝒞i\mathcal{C}^{i} via Alg 4.

In lines 7-12, we reduce the individual G(i)G(i) to the global GG via binary combination (see Fig 1). In line 9, the OR logical merge is applied to every element in CGs, e.g., G(i)j1,j2:=max{G(i2d1)j1,j2,G(i)j1,j2}G(i)_{j_{1},j_{2}}:=\max\{G(i-2^{d-1})_{j_{1},j_{2}},G(i)_{j_{1},j_{2}}\}.

Input: Clique set 𝒞\mathcal{C}
Output: G
1 Initialize G{0,1}2n×2nG\in\{0,1\}^{2n_{\mathcal{B}}\times 2n_{\mathcal{B}}};
2 for Q𝒞Q\in\mathcal{C} do
3       for (xi,xj)s.t.xi,xjQ\forall(x_{i},x_{j})\ \mbox{s.t.}\ x_{i},x_{j}\in Q do
4             Set Gij=1G_{ij}=1;
5            
6       end for
7      
8 end for
return G.
Algorithm 4 Conflict Graph Construction
Input: clique set 𝒞\mathcal{C}, kk threads, dimension nn_{\mathcal{B}}
Output: GG
1 Initialize G{0,1}2n×2nG\in\{0,1\}^{2n_{\mathcal{B}}\times 2n_{\mathcal{B}}};
2 Randomly shuffle 𝒞\mathcal{C} and partition evenly by cardinality into subsets 𝒞1,,𝒞k\mathcal{C}^{1},...,\mathcal{C}^{k};
3 for i{1,,k}i\in\{1,...,k\} do parallel
4       Set G(i){0,1}2n×2nG(i)\in\{0,1\}^{2n_{\mathcal{B}}\times 2n_{\mathcal{B}}} as an all-zeroes sparse matrix;
5       Update G(i)G(i) by cliques from 𝒞i\mathcal{C}^{i};
6      
7 end for
8
9for d{1,,log(k)}d\in\{1,...,\lceil\log(k)\rceil\} do
10       for i{2d,2d+1,,k}i\in\{2^{d},2^{d+1},...,k\} do parallel
11             Set G(i):=G(i2d1)ORG(i)G(i):=G(i-2^{d-1})\ \texttt{OR}\ G(i);
12            
13       end for
14      
15 end for
16Set G:=GORG(1)G:=G\ \texttt{OR}\ G(1);
return GG.
Algorithm 5 Parallel Conflict Graph Construction
Update CGsCombine CGsThread Number123456782468488
Figure 1: Parallel Structure for Algorithm 5

For a clique set 𝒞\mathcal{C}, suppose for each clique Q𝒞Q\in\mathcal{C}, whether a variable xjQx_{j}\in Q is a Bernoulli distribution with the probability pp. Then the average runtime complexity of Alg 5 is O(mn2p2/k+logkn2)O(mn_{\mathcal{B}}^{2}p^{2}/k+\log k\cdot n_{\mathcal{B}}^{2}) (see Proposition A.4 in Appendix A). Furthermore, the worst case runtime complexity of Alg 5 is O(mn2/k+logkn2)O(mn_{\mathcal{B}}^{2}/k+\log k\cdot n_{\mathcal{B}}^{2}) (see Corollary A.5 in Appendix A).

2.4 Parallel Clique Extension and Merging

After generating the conflict graph, we apply clique strengthening [1, 2, 10]—in particular, we modify the Clique Extension of Brito and Santos [10]: a greedy algorithm used to generate one strengthened clique based on the original clique and the CG (see [10, Algorithm 2]). Our modification (see Alg 6) involves (potentially) multiple clique strengthenings, and the increase in computational load is made practical by parallelization.

In line 1 of Alg 6, we generate a list LL including all variables uu that do not belong to the clique, but for which uu conflicts with all variables in the clique. In lines 19-20, we isolate a largest extended clique Clq\mbox{Clq}^{\prime} from the clique set 𝒞\mathcal{C}. We distinguish this longest extended clique from all other cliques in order to perform cut management, described in Section 2.5.

Input: clique Clq, CG
Output: longest clique Clq\mbox{Clq}^{\prime}, extended clique set 𝒞\mathcal{C}
1 Set L:={u:CGuv=1,vClq,uClq}L:=\{u:\mathrm{CG}_{uv}=1,v\in\mbox{Clq},u\not\in\mbox{Clq}\};
2 if L=L=\emptyset then
3      return Clq, \emptyset.
4 end if
5Set 𝒞:=\mathcal{C}:=\emptyset;
6 for uLu\in L do
7       if 𝒞=\mathcal{C}=\emptyset then
8             Set 𝒞:={{u}}\mathcal{C}:=\{\{u\}\};
9            
10      else
11             for Q𝒞Q\in\mathcal{C} do
12                   if CGuw=1,wQ\mathrm{CG}_{uw}=1,\forall w\in Q then
13                         Set Q:=Q{u}Q:=Q\cup\{u\};
14                        
15                  else
16                         Set 𝒞:=𝒞{{u}}\mathcal{C}:=\mathcal{C}\cup\{\{u\}\};
17                        
18                   end if
19                  
20             end for
21            
22       end if
23      
24 end for
25Find a longest Q𝒞Q^{\prime}\in\mathcal{C} and remove it from 𝒞\mathcal{C};
26 Set Clq:=ClqQ\mbox{Clq}^{\prime}:=\mbox{Clq}\cup Q^{\prime};
27 for Q𝒞Q\in\mathcal{C} do
28       Set Q:=ClqQQ:=\mbox{Clq}\cup Q;
29      
30 end for
return Clq,𝒞\mbox{Clq}^{\prime},\mathcal{C}.
Algorithm 6 Clique Extension

Strengthening each clique is an independent procedure, and so we propose to apply strengthening in parallel as Alg 7.

Input: clique set 𝒞\mathcal{C}, CG, kk threads
Output: longest extended clique set 𝒞long\mathcal{C}^{long}, other extended clique set 𝒞other\mathcal{C}^{other}
1 Set 𝒞long,𝒞other:=\mathcal{C}^{long},\mathcal{C}^{other}:=\emptyset;
2 Randomly shuffle 𝒞\mathcal{C} and partition evenly by cardinality into subsets 𝒞1,,𝒞k\mathcal{C}_{1},...,\mathcal{C}_{k};
3 for i={1,,k}i=\{1,...,k\} do parallel
4       Set 𝒞ilong,𝒞iother:=\mathcal{C}^{long}_{i},\mathcal{C}^{other}_{i}:=\emptyset;
5       for Q𝒞iQ\in\mathcal{C}_{i} do
6             Set Clq,𝒞:=Clique Extension(Q,CG)\mbox{Clq}^{\prime},\mathcal{C}^{\prime}:=\texttt{Clique Extension}(Q,CG);
7             Set 𝒞ilong:=𝒞ilong{Clq}\mathcal{C}^{long}_{i}:=\mathcal{C}^{long}_{i}\cup\{\mbox{Clq}^{\prime}\};
8             Set 𝒞iother:=𝒞iother𝒞\mathcal{C}^{other}_{i}:=\mathcal{C}^{other}_{i}\cup\mathcal{C}^{\prime};
9            
10       end for
11      Set 𝒞long:=𝒞long𝒞ilong\mathcal{C}^{long}:=\mathcal{C}^{long}\cup\mathcal{C}^{long}_{i};
12       Set 𝒞other:=𝒞other𝒞iother\mathcal{C}^{other}:=\mathcal{C}^{other}\cup\mathcal{C}^{other}_{i};
13      
14 end for
return 𝒞long,𝒞other\mathcal{C}^{long},\mathcal{C}^{other}.
Algorithm 7 Parallel Clique Extension

Complexity for Alg 6 is O(n2)O(n_{\mathcal{B}}^{2}) (see Lemma A.6 in Appendix A) for one clique; thus O(mn2)O(mn_{\mathcal{B}}^{2}) for mm cliques. Alg 7 has complexity O(mn2/k)O(mn_{\mathcal{B}}^{2}/k) (see Proposition A.7 in Appendix A).

After obtaining extended cliques from Alg 7, we check for domination between extended cliques and remove dominated cliques. For a clique defined as a set pack (Equation (2)), we will only store the index set 𝒮\mathcal{S}; thus, to check the domination between two cliques 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2}, we only need to check whether 𝒮1𝒮2\mathcal{S}_{1}\subseteq\mathcal{S}_{2} or 𝒮2𝒮1\mathcal{S}_{2}\subseteq\mathcal{S}_{1}, which can be performed in O(n)O(n_{\mathcal{B}}) due to sparse data structures.

Given mm cliques, the domination checking process is in O(m2n)O(m^{2}n_{\mathcal{B}}). The parallel version of this, Alg 8, has complexity O(m2n/k)O(m^{2}n/k).

Input: clique set 𝒞\mathcal{C}, kk threads
Output: modified 𝒞\mathcal{C}^{\prime}
1 for (Qi,Qj)s.t.Qi,Qj𝒞\forall(Q_{i},Q_{j})\ \mathrm{s.t.}\ Q_{i},Q_{j}\in\mathcal{C} do parallel
2       if QidominatesQjQ_{i}\ \mathrm{dominates}\ Q_{j} then
3            Remove QjQ_{j} from 𝒞\mathcal{C};
4      else if QjdominatesQiQ_{j}\ \mathrm{dominates}\ Q_{i} then
5            Remove QiQ_{i} from 𝒞\mathcal{C};
6       end if
7      
8 end for
return modified 𝒞\mathcal{C}.
Algorithm 8 Parallel Clique Merging

2.5 Cut Management

Cut management applies the results from previous subsections (which are conducted in presolve) throughout the branch-and-cut tree.

In Alg 1, three constraint sets are generated: 𝒮osp,𝒮isp,𝒮ck\mathcal{S}_{osp},\mathcal{S}_{isp},\mathcal{S}_{ck}. Subsequently, cliques are extracted from 𝒮ck\mathcal{S}_{ck} using Alg 3, yielding 𝒞org\mathcal{C}_{org} and 𝒞other\mathcal{C}_{other}. Applying Alg 5 on the four sets 𝒮osp,𝒮isp,𝒞org,𝒞other\mathcal{S}_{osp},\mathcal{S}_{isp},\mathcal{C}_{org},\mathcal{C}_{other}, the CG is constructed. Cliques are then strengthened/extended from the three sets 𝒮osp,𝒮isp,𝒞org\mathcal{S}_{osp},\mathcal{S}_{isp},\mathcal{C}_{org} with parallel Alg 7. As a result, this process yields 77 clique sets: 𝒞osplong,𝒞ospother\mathcal{C}^{long}_{osp},\mathcal{C}^{other}_{osp} (from 𝒮osp\mathcal{S}_{osp}), 𝒞isplong,𝒞ispother\mathcal{C}^{long}_{isp},\mathcal{C}^{other}_{isp} (from 𝒮isp\mathcal{S}_{isp}), 𝒞orglong,𝒞orgother\mathcal{C}^{long}_{org},\mathcal{C}^{other}_{org} (from 𝒞org\mathcal{C}_{org}), and 𝒞other\mathcal{C}_{other}. However, the number of cliques from these sets can be quite large, and so judicious management is critical for practical application. For instance, incorporating all corresponding inequalities at the root node is impractical both due to resource limitations on linear programming solves as well as potential numerical issues from excessive cuts.

Our cut management procedure applies only a few key inequalities from the longest cliques are added at the root node and apply the remaining cuts throughout the search tree:

First, we replace 𝒮org\mathcal{S}_{org} in the MIP formulation (deleting it in line 5 of Alg 1) with the strengthened constraints from 𝒞osplong\mathcal{C}^{long}_{osp}.

Second, all cliques from 𝒞ospother\mathcal{C}_{osp}^{other}, 𝒞ispother\mathcal{C}_{isp}^{other}, 𝒞orgother\mathcal{C}_{org}^{other}, and 𝒞other\mathcal{C}_{other} are incorporated via user cuts, which are placed in the cut pool and may be added to the model at any node in the branch-and-cut search tree to cut off relaxation solutions (see e.g. [16, Page 769, Lazy Attribute]).

Third, for cliques from 𝒞isplong\mathcal{C}_{isp}^{long} and 𝒞orglong\mathcal{C}_{org}^{long}, we either add them to the root node or else place them as user cuts depended on following rules:

  1. 1.

    If the size of 𝒮org\mathcal{S}_{org} is excessively large (greater than 20000) or small (less than 100), we add cliques from 𝒞isplong\mathcal{C}_{isp}^{long}.

  2. 2.

    If 2000<|𝒮org|200002000<|\mathcal{S}_{org}|\leq 20000 and |𝒞orglong|/|𝒮osp|<0.02|\mathcal{C}_{org}^{long}|/|\mathcal{S}_{osp}|<0.02, add cliques from 𝒞orglong\mathcal{C}_{org}^{long} to the root node; otherwise, place them as user cuts.

  3. 3.

    If 100<|𝒮org|2000100<|\mathcal{S}_{org}|\leq 2000 and |𝒞orglong|/|𝒮osp|<1|\mathcal{C}_{org}^{long}|/|\mathcal{S}_{osp}|<1, add cliques from 𝒞orglong\mathcal{C}_{org}^{long} to the root node; otherwise, place them as user cuts.

  4. 4.

    If the size of 𝒮org\mathcal{S}_{org} is excessively large (greater than 12000) or small (less than 1500), add cliques from 𝒞isplong\mathcal{C}_{isp}^{long} to the root node; otherwise, place them as user cuts.

This heuristic attempts to prioritize some inequalities and place them directly in the formulation at the root node, leaving the remaining inequalities for Gurobi to manage in the search tree via user cuts.

2.6 Computation Limits

CG Management is time-consuming (see e.g. [2, Page 491 Paragraph 2]) and memory-consuming. For instance, in CG construction, the sparse adjacency matrix costs O(n2)O(n^{2}) memory for the clique with length nn. However, we are not obliged to consider every possible variable, nor identify every clique, etc.; as such, to avoid excessive memory or time costs, we set the following limits.

For CG construction in Alg 4, if the length of the clique QQ is greater than 30003000, we randomly select 30003000 elements from QQ to construct CG. For Clique Extension in Alg 7, we set each thread at most process 10610^{6} nonzero terms and return generated extended cliques to the main thread. For Clique Merging in Alg 8, if the number of cliques is more than 10510^{5}, we give up to conduct Alg 8.

3 Numerical Experiments

All code and data can be found in our repository111https://github.com/foreverdyz/ParallelCliqueMerge.

3.1 Experimental Setup

3.1.1 Test Set

Experiments are conducted on the benchmark set of the MIPLIB 2017 Collection [15], consisting of 240240 instances. Because our parallel presolve method focuses on set packing constraints and conflicting knapsack constraints, we remove 6767 instances where the simple presolve procedure yields 22 or fewer such constraints. The results below are run on the remaining 173173 cases.

3.1.2 Software and Hardware

All algorithms are implemented in Julia 1.10.2 [8] and a desktop running 64-bit Windows 11 PRO with an AMD Ryzen Threadripper PRO 5975WX 32-Core CPU and 64 GB. This CPU has 32 physical cores and 64 logical processors. We solve both original MIPs and MIPs after applying the conflict graph management with Gurobi 11.0.0 [16] and an alpha (prototype) version of JuMP 1.28 [24].

3.1.3 Configurations

Experiments are run on 11 (serial), 22, 44, 88, 1616, 3232, 6464 threads. For benchmark experiments (MIP without our CG) we apply a single core for Gurobi with 10 threads active; this is the empirically-observed best setting over the benchmark as performance degredation is observed with more threads.

3.1.4 Time Measurement

Measured presolving time excludes overhead from the reading time of the input file as well as the runtime of Alg 1 because Alg 1 is already implemented in Gurobi [2] and the focus of experiments is on the effects of novelties. Moreover, for the cases where 𝒮isp\mathcal{S}_{isp} and 𝒮ck\mathcal{S}_{ck} from Alg 1 are empty sets, we do not deploy CG management since Gurobi has already implemented CG management for cliques from 𝒮osp\mathcal{S}_{osp}—note that such cases are nevertheless included in our performance comparisons. Times are always given in seconds and represent wall-clock measurements. Furthermore, for all aggregations, we calculate the geometric mean—note that we apply a 1 second shift to runtime means, as is typical in MIP literature (e.g. [15]).

3.2 Parallel CG Performance

In the following experiments we analyze the parallel efficiency of our parallel CG management algorithms (excluding branch-and-cut solver time). We select the 163/173163/173 cases for which the CG procedure can be run with a single thread within a time of 600600 seconds. We furthermore exclude in the subsection 5555 cases which can be trivially handled in less than 0.10.1 seconds with a single thread. On the remaining 108/173108/173 cases, parallel speed-ups (serial runtime divided by the parallel runtime) are presented in Fig. 2. The parallel efficiency, naturally, depends on the total work available. When CG takes more than 50 seconds we observe substantial speedups up to 64 threads, and on more modest instances there is a tail-off at 32 threads.

11224488161632326464112244881616Number of Threads [mm]Speed-upSerial Runtime 101\geq 10^{-1}Serial Runtime 1\geq 1Serial Runtime 10\geq 10Serial Runtime 50\geq 50
Figure 2: Speed-up vs. different number of threads, and different lines represent different ranges of serial runtimes.

3.3 Gurobi Performance

In this section we consider the impact of our CG procedure on total solver time, using Gurobi as our benchmark. As aforementioned, we focus on 173 cases from MIPLIB that have constraints applicable to our CG procedure. We exclude from comparison in this subsection an additional 74 cases for which our method cannot derive additional inequalities beyond standard CG techniques, namely those instances for which 𝒮isp𝒮ck=\mathcal{S}_{isp}\cup\mathcal{S}_{ck}=\emptyset. For these 74 excluded cases our procedure will simply reduce to a standard (albeit parallelized) CG procedure that is already incorporated into most solvers’ presolve routines (including Gurobi). This leaves us with 99 cases; among these 99, 3 more are excluded in the aggregated results as they exceed the excessive clique size caused issues in either JuMP (1 instance) or our procedure (2 instances). Note that a simple maximum clique-size limit would allow for skipping the CG procedure with minimal effect on runtime in these 3 instances. Thus 96/17396/173 cases are considered for comparison purposes to analyze the impact of our CG routine on Gurobi.

We present results in two parts: one with the 9696 cases excluding CG management/overhead time, and one with 9696 cases including CG management time and varying the number of threads used. Since our CG implementation will duplicate certain similar efforts in Gurobi’s CG itself, we expect a fully-integrated implementation to have less overhead than what is reported; hence, the two times presented provide optimistic (without overhead), and conservative (with overhead) bounds on potential impact, respectively.

3.3.1 Gurobi excluding CG management time

Table 1 compares Gurobi with and without our CG procedure: Runtime Comp. is the total solve time of Gurobi+CG ignoring CG time divided by that of Gurobi alone; Nodes Comp. is the same ratio in terms of number of search tree nodes; Faster and Slower are the number of solves for which the CG procedure led to 5%5\% faster or slower (respectively) runtimes. Excluded from this table are 88 cases that Gurobi cannot solve within 36003600 seconds; including our CG procedure solves 11 of them. The CG procedure appears especially effective on harder instances, seemingly driven by greater reduction in search tree nodes.

Table 1: Performance comparison between default Gurobi and Gurobi + CG management.
Bracket Number of Cases Runtime Comp. Nodes Comp. Faster Slower
All 88 0.88 0.60 42 24
1\geq 1 sec 80 0.83 0.57 42 22
10\geq 10 sec 62 0.76 0.46 37 16
100\geq 100 sec 35 0.69 0.47 25 8
1000\geq 1000 sec 7 0.60 0.35 7 0

3.3.2 Gurobi including CG management time

Table 2 compares Gurobi with and without our parallel CG procedure: Runtime Comp. is the total solve time of Gurobi+CG with CG time on different numbers of threads divided by that of Gurobi alone; Nodes Comp. is the same ratio in terms of number of search tree nodes; Faster and Slower are the number of solves for which the CG procedure led to 5%5\% faster or slower (respectively) runtimes. Again, excluded from this table are 88 cases unsolved by Gurobi within 36003600 seconds, 1 of which can be solved after including our CG procedure.

Table 3 breaks out the results on 64 threads in terms of problem difficulty. The benefit of our parallel CG management procedure increases a problem difficulty increases, with 40% overall runtime reduction on instances that takes default Gurobi over 1000 seconds. On such problems our CG procedure overhead is relatively negligible, as roughly the same reduction is reported in Table 1. This suggests that perhaps further gains on average speedup can be had if problem difficulty can be reasonably estimated a priori—for instance, by only activating parallel CG on sufficiently large instances.

Considering these total time results, our CG procedure has modest, and possibly slight deceleration in serial, but demonstrates significant advantage in parallel. The gains from parallelization tails off substantially around 16 threads but remains positive or breakeven up to 64 threads. This suggests promise for personal computers, but more limited gains with massively parallel compute.

Table 2: Performance comparison between default Gurobi and Gurobi + CG management on different threads.
Bracket Runtime Comp. Faster Slower
1 Thread 1.10 36 41
2 Threads 1.04 38 40
4 Threads 0.96 38 39
8 Threads 0.93 38 36
16 Threads 0.91 39 35
32 Threads 0.90 39 34
64 Threads 0.90 39 34
Table 3: Performance comparison between default Gurobi and Gurobi + CG management on 64 threads.
Bracket Number of Cases Runtime Comp. Nodes Comp. Faster Slower
All 88 0.90 0.60 39 34
1\geq 1 sec 80 0.87 0.57 39 29
10\geq 10 sec 62 0.79 0.46 38 20
100\geq 100 sec 35 0.70 0.47 26 8
1000\geq 1000 sec 7 0.60 0.35 7 0

4 Conclusion

We develop an efficient parallel CG management that enables the effective use of a larger number of valid inequalities than could otherwise be practically found in serial. Computational experiments demonstrate that this approach yields substantial overall speedups to Gurobi primarily due to search tree node reduction from the larger pool of generated cuts and valid inequalities. Rather than accelerate an existing subroutine in solvers, we leverage parallelism and intensify effort by modifying the serial subroutine to conduct more computation given the same time budget. This parallelization approach enables substantial parallel acceleration from personal computing setups, and could yield benefits on other subroutines used in branch-and-cut solvers.

Acknowledgements

This work was funded by the Office of Naval Research under grant N00014-23-1-2632.

References

  • [1] Achterberg, T.: Conflict analysis in mixed integer programming. Discrete Optimization 4(1), 4–20 (2007)
  • [2] Achterberg, T., Bixby, R.E., Gu, Z., Rothberg, E., Weninger, D.: Presolve reductions in mixed integer programming. INFORMS Journal on Computing 32(2), 473–506 (2020)
  • [3] Achterberg, T., Wunderling, R.: Mixed integer programming: Analyzing 12 years of progress. In: Facets of combinatorial optimization: Festschrift for martin grötschel, pp. 449–481. Springer (2013)
  • [4] Andersen, E.D., Gondzio, J., Mészáros, C., Xu, X., et al.: Implementation of interior point methods for large scale linear programming. HEC/Université de Geneve (1996)
  • [5] Atamtürk, A., Nemhauser, G.L., Savelsbergh, M.W.: Conflict graphs in solving integer programming problems. European Journal of Operational Research 121(1), 40–55 (2000)
  • [6] Bacher, A., Bodini, O., Hollender, A., Lumbroso, J.: Mergeshuffle: A very fast, parallel random permutation algorithm (2015)
  • [7] Berthold, T., Gamrath, G., Gleixner, A., Heinz, S., Koch, T., Shinano, Y.: Solving mixed integer linear and nonlinear problems using the scip optimization suite (2012)
  • [8] Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: A fresh approach to numerical computing. SIAM review 59(1), 65–98 (2017)
  • [9] Bixby, R., Rothberg, E.: Progress in computational mixed integer programming–a look back from the other side of the tipping point. Annals of Operations Research 149(1),  37 (2007)
  • [10] Brito, S.S., Santos, H.G.: Preprocessing and cutting planes with conflict graphs. Computers & Operations Research 128, 105176 (2021)
  • [11] Chopra, S., Rao, M.R.: The partition problem. Mathematical programming 59(1-3), 87–115 (1993)
  • [12] Devine, K.D., Boman, E.G., Karypis, G.: Partitioning and load balancing for emerging parallel applications and architectures. In: Parallel processing for scientific computing, pp. 99–126. SIAM (2006)
  • [13] Eckstein, J., Hart, W.E., Phillips, C.A.: Pebbl: an object-oriented framework for scalable parallel branch and bound. Mathematical Programming Computation 7, 429–469 (2015)
  • [14] Gleixner, A., Gottwald, L., Hoen, A.: Papilo: A parallel presolving library for integer and linear optimization with multiprecision support. INFORMS Journal on Computing (2023)
  • [15] Gleixner, A., Hendel, G., Gamrath, G., Achterberg, T., Bastubbe, M., Berthold, T., Christophel, P.M., Jarck, K., Koch, T., Linderoth, J., Lübbecke, M., Mittelmann, H.D., Ozyurt, D., Ralphs, T.K., Salvagnin, D., Shinano, Y.: MIPLIB 2017: Data-Driven Compilation of the 6th Mixed-Integer Programming Library. Mathematical Programming Computation (2021)
  • [16] Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2023), https://www.gurobi.com
  • [17] Gustafson, J.L.: Reevaluating amdahl’s law. Communications of the ACM 31(5), 532–533 (1988)
  • [18] Hall, J.: Towards a practical parallelisation of the simplex method. Computational Management Science 7, 139–170 (2010)
  • [19] Hoffman, K.L., Padberg, M.: Solving airline crew scheduling problems by branch-and-cut. Management science 39(6), 657–682 (1993)
  • [20] Huangfu, Q., Hall, J.J.: Parallelizing the dual revised simplex method. Mathematical Programming Computation 10(1), 119–142 (2018)
  • [21] Klabjan, D., Johnson, E.L., Nemhauser, G.L.: A parallel primal–dual simplex algorithm. Operations Research Letters 27(2), 47–55 (2000)
  • [22] Koch, T., Berthold, T., Pedersen, J., Vanaret, C.: Progress in mathematical programming solvers from 2001 to 2020. EURO Journal on Computational Optimization 10, 10–31 (2022)
  • [23] Koch, T., Ralphs, T., Shinano, Y.: Could we use a million cores to solve an integer program? Mathematical Methods of Operations Research 76, 67–93 (2012)
  • [24] Lubin, M., Dowson, O., Garcia, J.D., Huchette, J., Legat, B., Vielma, J.P.: Jump 1.0: recent improvements to a modeling language for mathematical optimization. Mathematical Programming Computation 15, 581–589 (2023)
  • [25] Munguía, L.M., Oxberry, G., Rajan, D., Shinano, Y.: Parallel pips-sbb: multi-level parallelism for stochastic mixed-integer programs. Computational Optimization and Applications 73, 575–601 (2019)
  • [26] Nguyen, D.: A probabilistic approach to the moments of binomial random variables and application. The American Statistician 75(1), 101–103 (2021)
  • [27] Papavasiliou, A., Oren, S.S., Rountree, B.: Applying high performance computing to transmission-constrained stochastic unit commitment for renewable energy integration. IEEE Transactions on Power Systems 30(3), 1109–1120 (2014)
  • [28] Perumalla, K., Alam, M.: Design considerations for gpu-based mixed integer programming on parallel computing platforms. In: 50th International Conference on Parallel Processing Workshop. pp. 1–7 (2021)
  • [29] Phillips, C.A., Eckstein, J., Hart, W.: Massively parallel mixed-integer programming: Algorithms and applications. In: Parallel Processing for Scientific Computing, pp. 323–340. SIAM (2006)
  • [30] Ryan, K., Rajan, D., Ahmed, S.: Scenario decomposition for 0-1 stochastic programs: Improvements and asynchronous implementation. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). pp. 722–729. IEEE (2016)
  • [31] Shinano, Y., Achterberg, T., Berthold, T., Heinz, S., Koch, T., Winkler, M.: Solving open mip instances with parascip on supercomputers using up to 80,000 cores. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). pp. 770–779. IEEE (2016)
  • [32] Shinano, Y., Heinz, S., Vigerske, S., Winkler, M.: Fiberscip—a shared memory parallelization of scip. INFORMS Journal on Computing 30(1), 11–30 (2018)
  • [33] Shun, J., Gu, Y., Blelloch, G.E., Fineman, J.T., Gibbons, P.B.: Sequential random permutation, list contraction and tree contraction are highly parallel. In: Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015. pp. 431–448. SIAM (2015)
  • [34] Witzig, J., Berthold, T., Heinz, S.: Experiments with conflict analysis in mixed integer programming. In: Integration of AI and OR Techniques in Constraint Programming: 14th International Conference, CPAIOR 2017, Padua, Italy, June 5-8, 2017, Proceedings 14. pp. 211–220. Springer (2017)

Appendix A Complexity Results

Proposition A.1.

Alg 3 has an average runtime of O(mnlogn/k)O(mn\log n/k) and a worst case runtime of O(mn2/k)O(mn^{2}/k), where nn is the number of variables, mm is the number of constraints, and kk is the number of threads.

Proof.

Randomly shuffling and partitioning in line 2 is O(m/k)O(m/k) [6, 33].

In lines 3-10, there is a parallel for loop with kk threads. In each iteration, there are at most mk\lceil\frac{m}{k}\rceil constraints, and we perform Alg 2 on each constraint. Therefore, the complexity for lines 3-10 is O(mknlogn)O(\frac{m}{k}n\log n) (on average) or O(mkn2)O(\frac{m}{k}n^{2}) (worst case) due to the underlying serial complexity of Alg 2.

Consequently, Alg 3 has an average (over random shuffles) runtime of O(mnlogn/k)O(mn\log n/k) and a worst case runtime of O(mn2/k)O(mn^{2}/k). ∎∎

Lemma A.2.

For a clique set 𝒞\mathcal{C}, suppose that for each clique Q𝒞Q\in\mathcal{C}, whether a variable xjx_{j} is in QQ is given by a Bernoulli distribution with independent probability pp. Then Alg 4 has an average runtime of O(mn2p2)O(mn_{\mathcal{B}}^{2}p^{2}), where nn_{\mathcal{B}} is the total number of binary variables, and mm is the number of cliques in 𝒞\mathcal{C}.

Proof.

In line 2, there are mm cliques.

For Q𝒞Q\in\mathcal{C}, let nQn_{Q} be the number of binary variables in QQ. Then nQn_{Q} is in a binomial distribution B(n,p)B(n_{\mathcal{B}},p); thus 𝔼[nQ]=np\mathbb{E}[n_{Q}]=n_{\mathcal{B}}p and 𝔼[nQ2]=np(1p)+n2p2\mathbb{E}[n_{Q}^{2}]=n_{\mathcal{B}}p(1-p)+n_{\mathcal{B}}^{2}p^{2} [26]. In line 3, there are nQ(nQ1)/2n_{Q}(n_{Q}-1)/2 pairs of variables (xi,xj)(x_{i},x_{j}). We have

𝔼[nQ(nQ1)]=\displaystyle\mathbb{E}[n_{Q}(n_{Q}-1)]= 𝔼[nQ2nQ]\displaystyle\mathbb{E}[n_{Q}^{2}-n_{Q}]
=\displaystyle= 𝔼[nQ2]𝔼[nQ]\displaystyle\mathbb{E}[n_{Q}^{2}]-\mathbb{E}[n_{Q}]
=\displaystyle= np(1p)+n2p2np\displaystyle n_{\mathcal{B}}p(1-p)+n_{\mathcal{B}}^{2}p^{2}-n_{\mathcal{B}}p
=\displaystyle= (n2n)p2O(n2p2).\displaystyle(n_{\mathcal{B}}^{2}-n_{\mathcal{B}})p^{2}\in O(n_{\mathcal{B}}^{2}p^{2}).

Therefore, 𝔼[mnQ(nQ1)/2]=m𝔼[nQ(nQ1)/2]O(mn2p2)\mathbb{E}[mn_{Q}(n_{Q}-1)/2]=m\mathbb{E}[n_{Q}(n_{Q}-1)/2]\in O(mn_{\mathcal{B}}^{2}p^{2}). Since line 4 can be completed in O(1)O(1), Alg 4 has an average runtime in O(mn2p2)O(mn_{\mathcal{B}}^{2}p^{2}). ∎∎

Corollary A.3.

Alg 4 has a worst case runtime of O(mn2)O(mn_{\mathcal{B}}^{2}), where nn_{\mathcal{B}} is the total number of binary variables, and mm is the number of cliques in 𝒞\mathcal{C}.

Proof.

In line 2, there are mm cliques.

For each clique, there are at most nn_{\mathcal{B}} variables. Therefore, in line 3, there are at most n(n1)/2n_{\mathcal{B}}(n_{\mathcal{B}}-1)/2 pairs of variables (xi,xj)(x_{i},x_{j}). So there are at most mn(n1)/2O(mn2)mn_{\mathcal{B}}(n_{\mathcal{B}}-1)/2\in O(mn_{\mathcal{B}}^{2}) iterations. Since line 4 can be completed in O(1)O(1), Alg 4 has a worst-case runtime in O(mn2)O(mn_{\mathcal{B}}^{2}). ∎∎

Proposition A.4.

For a clique set 𝒞\mathcal{C}, suppose that for each clique Q𝒞Q\in\mathcal{C}, whether a variable xjx_{j} is in QQ is given by a Bernoulli distribution with the probability pp. Then Alg 5 has an average runtime of O(mn2p2/k+logkn2)O(mn_{\mathcal{B}}^{2}p^{2}/k+\log k\cdot n_{\mathcal{B}}^{2}), where nn_{\mathcal{B}} is the total number of binary variables, mm is the number of cliques, and kk is the number of threads.

Proof.

Randomly shuffling and partitioning in line 2 can be done in parallel in O(m/k)O(m/k) [6, 33].

In lines 3-6, there is a parallel for loop involving kk threads. In each iteration, we call Alg 4 to at most mk\lceil\frac{m}{k}\rceil cliques. Due to Lemma A.2, the average complexity of lines 3-6 is O(mn2p2/k)O(mn_{\mathcal{B}}^{2}p^{2}/k).

In lines 7-11, there is a binary combination with logk\lceil\log k\rceil iterations. In the ii-th iteration, we perform k/2i\lfloor k/2^{i}\rfloor OR operations in parallel. The runtime of each OR is in O(n2)O(n_{\mathcal{B}}^{2}); thus the complexity of lines 7-11 (the binary combination) is O(logkn2)O(\log k\cdot n_{\mathcal{B}}^{2}).

Line 12 is another OR operation in which the complexity is O(n2)O(n_{\mathcal{B}}^{2}).

Consequently, Alg 5 has an average runtime of O(mn2p2/k+logkn2)O(mn_{\mathcal{B}}^{2}p^{2}/k+\log k\cdot n_{\mathcal{B}}^{2}). ∎∎

Corollary A.5.

Alg 5 has a worst-case runtime of O(mn2/k+logkn2)O(mn_{\mathcal{B}}^{2}/k+\log k\cdot n_{\mathcal{B}}^{2}), where nn_{\mathcal{B}} is the number of binary variables, mm is the number of cliques, and kk is the number of threads.

Proof.

The proof is similar that of Proposition A.4. The only difference is in lines 3-6. In the worst case, in each iteration, due to Corollary A.3, the complexity is O(mn2)O(mn_{\mathcal{B}}^{2}); thus the worst case complexity of lines 3-6 is O(mn2/k)O(mn_{\mathcal{B}}^{2}/k). Therefore, Alg 5 has a worst case runtime of O(mn2/k+logkn2)O(mn_{\mathcal{B}}^{2}/k+\log k\cdot n_{\mathcal{B}}^{2}). ∎

Lemma A.6.

Alg 6 has complexity O(n2)O(n_{\mathcal{B}}^{2}), where nn_{\mathcal{B}} is the total number of binary variables.

Proof.

A clique Clq has at most nn_{\mathcal{B}} variables. In line 1, checking whether CGuv=1\mathrm{CG}_{uv}=1 for a given uClqu\not\in\mathrm{Clq} and all vClqv\in\mathrm{Clq} takes at most O(n)O(n_{\mathcal{B}}) operations. So line 1 has complexity O(n2)O(n_{\mathcal{B}}^{2}).

In line 6, the iteration number is O(n)O(n_{\mathcal{B}}). In line 10, the iteration number is also O(n)O(n_{\mathcal{B}}). Since lines 11-15 can be completed in O(1)O(1), the complexity for lines 6-18 is O(n2)O(n_{\mathcal{B}}^{2}).

In line 19, the size of 𝒞\mathcal{C} is O(n)O(n_{\mathcal{B}}); thus line 19 can be completed in O(n)O(n_{\mathcal{B}}).

As a result, Alg 6 has complexity O(n2)O(n_{\mathcal{B}}^{2}). ∎

Proposition A.7.

Alg 6 has complexity O(mn2/k)O(mn_{\mathcal{B}}^{2}/k), where nn_{\mathcal{B}} is the number of binary variables, mm is the number of cliques, and kk is the number of threads.

Proof.

Randomly shuffling and partitioning in line 2 is O(m/k)O(m/k) [6, 33].

In lines 3-10, there is a parallel for loop. In each iteration, we perform Alg 6 for at most m/k\lceil m/k\rceil cliques. Due to Lemma A.6, the runtime complexity of lines 3-10 is O(mn2/k)O(mn_{\mathcal{B}}^{2}/k).

As a result, Alg 6 has complexity O(mn2/k)O(mn_{\mathcal{B}}^{2}/k). ∎∎