A Customized Augmented Lagrangian Method for Block-Structured Integer Programming

Rui Wang, Chuwen Zhang, Shanwen Pu, Jianjun Gao, Zaiwen Wen Rui Wang and Zaiwen Wen are with the Beijing International Center for Mathematical Research, Peking University, Beijing 100871, China (email: [email protected]; [email protected]). Chuwen Zhang, Shanwen Pu and Jianjun Gao are with the School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai 200433, China (email: [email protected]; [email protected];[email protected]).This research is supported in part by the National Natural Science Foundation of China (NSFC) grants 12331010, 72150001, 72225009, 72394360, 72394365.

Abstract

Integer programming with block structures has received considerable attention recently and is widely used in many practical applications such as train timetabling and vehicle routing problems. It is known to be NP-hard due to the presence of integer variables. We define a novel augmented Lagrangian function by directly penalizing the inequality constraints and establish the strong duality between the primal problem and the augmented Lagrangian dual problem. Then, a customized augmented Lagrangian method is proposed to address the block-structures. In particular, the minimization of the augmented Lagrangian function is decomposed into multiple subproblems by decoupling the linking constraints and these subproblems can be efficiently solved using the block coordinate descent method. We also establish the convergence property of the proposed method. To make the algorithm more practical, we further introduce several refinement techniques to identify high-quality feasible solutions. Numerical experiments on a few interesting scenarios show that our proposed algorithm often achieves a satisfactory solution and is quite effective.

Index Terms:

Integer programming, augmented Lagrangian method, block coordinate descent, convergence

1 Introduction

In this paper, we consider a block-structured integer programming problem:


$\displaystyle\min$	$\displaystyle~{}~{}\boldsymbol{c}^{\top}\textbf{x}$	(1a)
s.t.	$\displaystyle~{}~{}\boldsymbol{A}\textbf{x}\leq\boldsymbol{b},$	(1b)
	$\displaystyle~{}~{}\textbf{x}_{j}\in\mathcal{X}_{j},~{}j=1,2,...,p,$	(1c)

where $\textbf{x}_{j}\in\mathbb{R}^{n_{j}}$ is the $j$ -th block variable of $\textbf{x}\in\mathbb{R}^{n}$ , i.e., $\textbf{x}=(\textbf{x}_{1};...;\textbf{x}_{p})$ for $p\geq 1$ with $n=\sum_{j=1}^{p}n_{j}$ . In (1), $\boldsymbol{A}\in\mathbb{R}^{m\times n},\boldsymbol{b}\in\mathbb{R}^{m},\boldsymbol{c}\in\mathbb{R}^{n}$ and the constraint $\mathcal{X}_{j}$ is the set of $0,1$ vectors in a polyhedron, i.e.,

\mathcal{X}_{j}:=\{\textbf{x}_{j}\in\{0,1\}^{n_{j}}:\boldsymbol{B}_{j}\textbf{x}_{j}\leq\boldsymbol{d}_{j}\},~{}j=1,2,...,p.

The constraints (1c) can be reformulated as $\textbf{x}\in\mathcal{X}:=\{\textbf{x}\in\{0,1\}^{n}:~{}\boldsymbol{B}\textbf{x}\leq\boldsymbol{d}\}$ where the block diagonal matrix $\boldsymbol{B}\in\mathbb{R}^{q\times n}$ is formed by the small submatrices $\boldsymbol{B}_{j}$ as the main diagonal and $\boldsymbol{d}=(\boldsymbol{d}_{1};...;\boldsymbol{d}_{p})$ . Correspondingly, $\boldsymbol{c}$ and $\boldsymbol{A}$ can be rewritten as $\boldsymbol{c}=(\boldsymbol{c}_{1};\boldsymbol{c}_{2};...;\boldsymbol{c}_{p})$ with $\boldsymbol{c}_{j}\in\mathbb{R}^{n_{j}}$ and $\boldsymbol{A}=(\boldsymbol{A}_{1}~{}\boldsymbol{A}_{2}~{}...~{}\boldsymbol{A}_{p})$ with $\boldsymbol{A}_{j}\in\mathbb{R}^{m\times n_{j}}$ .

Assume that these constraints (1c) are “nice” in the sense that an integer program with just these constraints is easy. Therefore, if the coupling constraints (1b) are ignored, the remaining problem which is only composed of the constraints (1c) is easier to solve than the original problem (1). For convenience, we assume $\mathcal{X}$ is not empty and (1) is feasible. Denote by $f^{\text{IP}}$ the optimal value of the problem (1). This block structure is closely related to an important model known as “ $n$ -fold integer programming (IP)” studied extensively in computer vision [1, 2], machine learning [3, 4] and theoretical computer science [5, 6], etc. The theoretical foundations of $n$ -fold IPs have significant implications for efficient algorithm development in various fields. For example, an algorithmic theory of integer programming based on $n$ -fold IP was proposed in [5]. Recent advancements have been provided in [6]. Furthermore, the progress in theory and application of integer programming with a block structure was summarized in [7, 8]. While existing works on $n$ -fold IP mainly focus on asymptotic analyses, this work aims to develop an efficient augmented Lagrangian approach tailored to the block structure for practical efficiency with a convergence guarantee.

1.1 Related Work

The branch and bound algorithm for general integer programming (IP) was first introduced by Land and Doig [9]. Gomory [10] developed a cutting plane algorithm for integer programming problems. These two approaches are at the heart of current state-of-the-art software for integer programming. Unfortunately, these methods often suffer from high computational burdens due to the discrete constraint, thus they are not good choices for solving some large-scale practical problems. Therefore, it is necessary to develop efficient approaches to obtain feasible and desirable solutions, even if they may not be globally optimal. Considering the block structure of the integer linear programming (1), a natural idea is to decompose a large global problem into smaller local subproblems. There are three typical decomposition methods for solving such problem: Benders [11], Dantzig-Wolfe (DW) [12] and Lagrangian decompositions [13]. The Benders decomposition method is used to deal with mixed integer programming problems by decomposing them into a master problem and a primal subproblem. It generates cuts that are added to the master problem by solving the dual of the primal subproblem. This method is not suitable for (1), since our primal subproblem is an integer programming problem, which is still NP-hard. The DW decomposition method can solve block structured integer programming problem (1) based on resolution theorem. To improve the tractability of large-scale problems, the DW decomposition relies on column generation. However, it might be harder or even intractable to solve the master problem by column generation if further constraints are applied to $\mathcal{X}_{j}$ [14, Chapter 8.2]. The Lagrangian decomposition introduces Lagrange multipliers and constructs a sequence of simpler subproblems. Although the method itself has a limitation due to its inherent solution symmetry for certain practical problems, the Lagrangian duality bound is useful in the branch-and-bound procedure. A Lagrangian heuristic algorithm has also been proposed in [15] for solving a real-world train timetabling problem, which decomposed the problem into smaller subproblems using Lagrangian relaxation and developed a heuristic algorithm based on subgradient optimization to generate feasible solutions.

Many techniques from continuous optimization including the alternating direction method of multipliers (ADMM) have been applied to solve integer programming recently. The authors in [16] proposed an $\ell_{p}$ -box ADMM for solving a binary integer programming, where the binary constraint is replaced by the intersection of a box and an $\ell_{p}$ -norm sphere. The authors in [17] proposed augmented Lagrangian method (ALM) and ADMM based on the $\ell_{1}$ augmented Lagrangian function for two-block mixed integer linear programming (MILP). For the multi-block MILP problem, an extended ADMM was proposed in [18] to employ a release-and-fix approach to solve the subproblems. There are also some heuristic methods based on ADMM [19, 20, 21, 22] that have been successfully applied to various practical integer programming problems.

There are some other widespread approaches on the relaxation of the binary constraint including linear programming (LP) relaxation [23, 24] and semidefinite relaxation [25, 26]. For the multi-block MILP problem, several inexact methods have been proposed including a distributed algorithm relying on primal decomposition [27], a dual decomposition method [28], and a decomposition-based outer approximation method [29]. By introducing continuous variables to replace the discrete variables, the exact penalty methods [30, 31, 32] have been studied for solving the nonlinear IP problems. Then the problem is transformed into an equivalent nonlinear continuous optimization problem.

1.2 Contributions

In this paper, we propose a customized ALM for solving (1). Our main contributions are listed below.

(i)

We define a novel augmented Lagrangian (AL) function that differs from the classical AL function and establish strong duality theory for the augmented Lagrangian relaxation of the block-structured integer programming (1), which motivates us to utilize the ALM for solving the problem (1).
(ii)

Based on the special structure of (1), we propose two block coordinate descent (BCD)-type methods that are well-suited for solving the resulting subproblems in our ALM framework. These methods utilize classical update and proximal linear update techniques, denoted as ALM-C and ALM-P, respectively. We also analyze their convergence properties under proper conditions.
(iii)

To address challenges in finding the global optimal solution for practical problems such as train timetabling, we introduce refinement strategies and propose a customized ALM to enhance the quality of solutions generated by the ALM. Our numerical experiments demonstrate the effectiveness of the proposed method in solving large-scale instances.

Note that the ADMM-based method in [21] solves each subproblem only once per iteration for a fixed Lagrangian multiplier. On the other hand, our ALM method involves multiple iterations using the BCD method to minimize each AL function until certain rules are satisfied. Once the AL subproblem is solved exactly, the strong duality guarantees that the ALM can converge to a global minimizer of the problem (1). Therefore, achieving high accuracy in minimizing the AL function allows our ALM to achieve superior solutions with fewer iterations, resulting in significantly reduced computational time compared to the ADMM. This claim is supported by our numerical tests. Additionally, we introduce Assumptions 3.1 and 3.2, derived from the structural characteristics of the practical problem in [20] and [21], and subsequently analyze the theoretical properties of the ALM-C method under these assumptions. However, our ALM-P method has wider applicability and does not require these specific assumptions. Moreover, the ALM can be regarded as a dual ascent algorithm with respect to dual variables, hence ensuring its convergence. In contrast, the convergence analysis of the ADMM for solving this kind of problem remains unclear.

The existing ALM-based methods for solving integer programming in [33] mainly focused on the duality of the augmented Lagrangian dual for MILP with equality constraints, but numerical experiments were not available. Moreover, our approach differs significantly in that we handle inequality constraints directly, without introducing slack variables. This approach has two key benefits: it reduces the number of variables, thus decreasing computational burden in high-dimensional cases, and it allows for more customized algorithmic design based on the inherent structure of the problem. In [17], the $\ell_{1}$ norm was considered as an augmented term in the AL function for MILP such that the minimization of the AL function can be decomposed into multiple low-dimensional $\ell_{1}$ -penalty subproblems due to the separable blocks. These subproblems were then solved in parallel using the Gurobi solver. In our work, we take a different approach by using a quadratic term in the AL function which allows the augmented Lagrangian subproblem to be reduced to a linear programming under certain conditions. Furthermore, we update the blocks of variables sequentially one at a time.

1.3 Notation and Organization

Let $\mathbb{N}_{m}:=\{1,2,...,m\}$ , $\mathbb{N}_{p}:=\{1,2,...,p\}$ and $\mathbb{R}^{m}_{+}:=\{x\in\mathbb{R}^{m}:x_{i}\geq 0~{}\text{for all}~{}i\}$ . The superscript $``\top"$ means “transpose”. Denote $\boldsymbol{a}_{i}$ by the $i$ -th row of the matrix $\boldsymbol{A}$ and $b_{i}$ by the $i$ -th element of the vector $\boldsymbol{b}$ . For convenience, we let $\boldsymbol{A}_{\mathcal{I},j}$ denote the submatrix consisting of columns of $\boldsymbol{A}_{j}$ indexed by $\mathcal{I}$ and $\boldsymbol{b}_{\mathcal{I}}$ denote the subvector consisting of entries of $\boldsymbol{b}\in\mathbb{R}^{m}$ indexed by $\mathcal{I}$ . A neighborhood of a point $\textbf{x}^{*}$ is a set $\mathcal{N}(\textbf{x}^{*},1)$ consisting of all points x such that $\|\textbf{x}-\textbf{x}^{*}\|^{2}\leq 1$ . Let $\mathbf{1}$ be a row vector of all ones.

This paper is structured as follows. An AL function is defined and the strong duality of the problem (1) is discussed in section 2. We propose a customized ALM incorporating BCD methods and refinement strategies to improve the quality of the ALM solutions in section 3. We establish the convergence results of both the BCD methods to minimize the AL function and the ALM applied to the whole problem (1) in section 4. The proposed method is applied to two practical problems in section 5. Concluding remarks are made in the last section.

2 The AL Strong Duality

The Lagrangian relaxation (LR) of (1) with respect to the constraint (1b) has the following form:

\min_{\textbf{x}\in\mathcal{X}}~{}~{}\mathcal{L}(\textbf{\rm{x}},\lambda):=\sum\limits_{j=1}^{p}~{}\boldsymbol{c}_{j}^{\top}\textbf{x}_{j}+\lambda^{\top}\left(\sum\limits_{j=1}^{p}\boldsymbol{A}_{j}\textbf{x}_{j}-\boldsymbol{b}\right),

(2)

where $\lambda\in\mathbb{R}^{m}_{+}$ is a Lagrange multiplier associated with the constraint (1b). We can observe that (2) is much easier to be solved than the original problem (1) since (2) can be decomposed into $p$ -block low-dimensional independent subproblems. However, there may exist a non-zero duality gap when certain constraints are relaxed by using classical Lagrangian dual [33]. To reduce the duality gap, we add a quadratic penalty function to the Lagrangian function in (2) and solve the augmented Lagrangian dual problem which is defined as follows.

Definition 2.1 (AL Dual)

We define an AL function by

	$\displaystyle L(\textbf{\rm{x}},\lambda,\rho)$	$\displaystyle=$	$\displaystyle\sum\limits_{j=1}^{p}~{}\boldsymbol{c}_{j}^{\top}\textbf{\rm{x}}_{j}+\lambda^{\top}\left(\sum\limits_{j=1}^{p}\boldsymbol{A}_{j}\textbf{\rm{x}}_{j}-\boldsymbol{b}\right)$		(3)
			$\displaystyle+\frac{\rho}{2}\left\\|\left(\sum\limits_{j=1}^{p}\boldsymbol{A}_{j}\textbf{\rm{x}}_{j}-\boldsymbol{b}\right)_{+}\right\\|^{2},$		(3)

where $\lambda\in\mathbb{R}_{+}^{m},\rho>0.$ The corresponding AL relaxation of (1) is given by

d(\lambda,\rho):=\min_{\textbf{\rm{x}}\in\mathcal{X}}~{}L(\textbf{\rm{x}},\lambda,\rho).

(4)

We call the following maximization problem the AL dual problem:

f_{\rho}^{\text{LD}}:=\max_{\lambda\in\mathbb{R}_{+}^{m}}d(\lambda,\rho).

(5)

Note that the classical AL function of (1) is given by

\hat{L}(\textbf{\rm{x}},\lambda,\rho)=\boldsymbol{c}^{\top}\textbf{\rm{x}}+\frac{\rho}{2}\left\|\left(\boldsymbol{A}\textbf{\rm{x}}-\boldsymbol{b}+\frac{\lambda}{\rho}\right)_{+}\right\|^{2}-\frac{\|\lambda\|^{2}}{2\rho}.

(6)

We prefer using the form of the AL function (3) rather than the classical version (6) due to the absence of $\lambda/\rho$ in the quadratic term of the max function. This makes it possible to convert the AL function into a linear function under certain conditions, making the problem (4) easier to solve, which will be explained in the next section. We also verify that the quadratic term in (3) is an exact penalty and strong duality holds between the AL dual problem (5) and the primal problem (1).

Lemma 2.1 (Strong Duality)

Suppose the problem (1) is feasible and its optimal value is bounded. If a minimum achievable non-zero slack exists, i.e., there is a $\delta$ such that for any $i\in\mathbb{N}_{m}$ ,

0<\delta\leq\min_{\textbf{\rm{x}}\in\mathcal{X}}\{(\boldsymbol{a}_{i}\textbf{\rm{x}}-b_{i})^{2}:\boldsymbol{a}_{i}\textbf{\rm{x}}>b_{i}\},

(7)

then there exists a finite $\rho^{*}\in(0,+\infty)$ such that

f_{\rho^{*}}^{\text{LD}}=\min_{\boldsymbol{A}{\textbf{\rm{x}}}\leq\boldsymbol{b},{\textbf{\rm{x}}}\in\mathcal{X}}\boldsymbol{c}^{\top}{\textbf{\rm{x}}}.

Proof:

For any $\lambda\in\mathbb{R}_{+}^{m}$ and $\rho>0$ , since $\{\textbf{x}\in\mathcal{X}:\boldsymbol{A}\textbf{x}-\boldsymbol{b}\leq 0\}\subseteq\mathcal{X}$ , we have

d(\lambda,\rho)\leq\mathop{\min}_{\textbf{x}\in\mathcal{X}\atop\boldsymbol{A}\textbf{x}-\boldsymbol{b}\leq 0}~{}L(\textbf{x},\lambda,\rho)\leq\mathop{\min}_{\textbf{x}\in\mathcal{X}\atop\boldsymbol{A}\textbf{x}-\boldsymbol{b}\leq 0}\boldsymbol{c}^{\top}\textbf{x}=f^{\text{IP}}.

(8)

Then $f_{\rho}^{\text{LD}}\leq f^{\text{IP}}$ .

Now it suffices to find a finite $\rho^{*}$ such that $f_{\rho^{*}}^{\text{LD}}\geq f^{\text{IP}}$ . We first let $\textbf{x}^{0}$ be any arbitrary feasible solution of (1), that is, $\textbf{x}^{0}\in\mathcal{X}$ and $\boldsymbol{A}\textbf{x}^{0}\leq\boldsymbol{b}$ . Denote by $f^{\text{LP}}$ the linear programming (LP) relaxation of $f^{\text{IP}}$ . Since the value of the LP relaxation of (1) is bounded [34], i.e., $-\infty<f^{\text{LP}}\leq\boldsymbol{c}^{\top}\textbf{x}^{0}<+\infty$ , we set $\rho^{*}=2(\boldsymbol{c}^{\top}\textbf{x}^{0}-f^{\text{LP}})/\delta$ , then $0<\rho^{*}<+\infty$ . Moreover,

\!f^{\text{LD}}\geq\max_{\lambda\in\mathbb{R}_{+}^{m}}d(\lambda,\rho^{*})\geq d(\lambda^{*},\rho^{*})=\min_{\textbf{x}\in\mathcal{X}}L(\textbf{x},\lambda^{*},\rho^{*}),

(9)

where $\lambda^{*}\in\mathbb{R}_{+}^{m}$ is a given parameter. Let $\mathbb{I}:=\{i\in\mathbb{N}_{m}:\boldsymbol{a}_{i}\textbf{x}-b_{i}>0,\textbf{x}\in\mathcal{X}\}$ , we consider following two cases:

Case 1: $\mathbb{I}=\emptyset$ . In this case we have $\boldsymbol{A}\textbf{x}\leq\boldsymbol{b}$ for all $\textbf{x}\in\mathcal{X}$ . By letting $\bar{\lambda}=0$ , we can obtain that

	$\displaystyle L(\textbf{x},\bar{\lambda},\rho^{*})$	$\displaystyle=\boldsymbol{c}^{\top}\textbf{x}+\bar{\lambda}^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\rho^{*}}{2}\\|\left(\boldsymbol{A}\textbf{x}-\boldsymbol{b}\right)_{+}\\|^{2}$
		$\displaystyle=\boldsymbol{c}^{\top}\textbf{x}\geq f^{\text{IP}}.$		(10)

Case 2: $\mathbb{I}\neq\emptyset$ . Denote by $\lambda^{\text{LP}}$ a positive optimal vector of dual variables for $\boldsymbol{A}\textbf{x}\leq\boldsymbol{b}$ in the LP relaxation of (1). In this case we get

	$\displaystyle L(\textbf{x},\lambda^{\text{LP}},\rho^{*})=$	$\displaystyle\boldsymbol{c}^{\top}\textbf{x}+(\lambda^{\text{LP}})^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\rho^{*}}{2}\sum_{i\in\mathbb{I}}(\boldsymbol{a}_{i}\textbf{x}-b_{i})^{2}$
		$\displaystyle+\frac{\rho^{*}}{2}\sum\nolimits_{i\notin\mathbb{I}}((\boldsymbol{a}_{i}\textbf{x}-b_{i})_{+})^{2}$
	$\displaystyle=$	$\displaystyle\boldsymbol{c}^{\top}\textbf{x}+(\lambda^{\text{LP}})^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\rho^{*}}{2}\sum_{i\in\mathbb{I}}(\boldsymbol{a}_{i}\textbf{x}-b_{i})^{2}.$

Since

\frac{\rho^{*}}{2}\sum_{i\in\mathbb{I}}(\boldsymbol{a}_{i}\textbf{x}-b_{i})^{2}\geq\frac{\rho^{*}}{2}\min_{i\in\mathbb{I}}(\boldsymbol{a}_{i}\textbf{x}-b_{i})^{2}\overset{\eqref{delta}}{\geq}\frac{\rho^{*}}{2}\delta=\boldsymbol{c}^{\top}\textbf{x}^{0}-f^{\text{LP}},

it yields

	$\displaystyle L(\textbf{x},\lambda^{\text{LP}},\rho^{*})$	$\displaystyle\geq\boldsymbol{c}^{\top}\textbf{x}+(\lambda^{\text{LP}})^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+(\boldsymbol{c}^{\top}\textbf{x}^{0}-f^{\text{LP}})$
		$\displaystyle\geq f^{\text{LP}}+(\boldsymbol{c}^{\top}\textbf{x}^{0}-f^{\text{LP}})=\boldsymbol{c}^{\top}\textbf{x}^{0}\geq f^{\text{IP}},$		(11)

where the second inequality holds due to the definition of $\lambda^{\text{LP}}$ . Thus the inequalities (2) and (2) by letting $\lambda^{*}$ be $\lambda^{\text{LR}}$ and $\bar{\lambda}$ imply that

d(\lambda^{*},\rho^{*})=\min\nolimits_{\textbf{x}\in\mathcal{X}}~{}L(\textbf{x},\lambda^{*},\rho^{*})\geq f^{\text{IP}}.

This together with (9) and (8) yields that

f_{\rho^{*}}^{\text{LD}}=d(\lambda^{*},\rho^{*})=f^{\text{IP}}.

Hence we complete the proof. ∎∎

One can observe from the proof that there exists a finite value $\rho^{*}$ such that for all $\rho\geq\rho^{*}$ , the strong duality still holds. Therefore, given a sufficiently large penalty parameter $\rho$ , we can achieve a satisfactory feasibility of (1). Specifically, as $\rho$ increases beyond $\rho^{*}$ , the augmented Lagrangian method penalizes constraint violations (1b) more heavily, thereby making the solution closer to the feasible region of the problem. The strong duality allows us to obtain a globally optimal solution to the problem (1) by solving the augmented Lagrangian dual problem (5).

To utilize the concave structure of the dual function $d(\lambda,\rho)$ , we apply the projected subgradient method for solving (5) since the dual function is not differentiable. We first give the definition of subgradient and subdifferential.

Definition 2.2

Let $h:\mathbb{R}^{m}\rightarrow\mathbb{R}$ be a convex function. The vector $s\in\mathbb{R}^{m}$ is called a subgradient of $h$ at $\bar{x}\in\mathbb{R}^{m}$ if

h(x)-h(\bar{x})\geq s^{\top}(x-\bar{x}),~{}~{}\forall x\in\mathbb{R}^{m}.

The subdifferential of $h$ at $\bar{x}$ is the set of all subgradients of $h$ at $\bar{x}$ which is given by

\partial h(x)=\{s\in\mathbb{R}^{m}:h(x)-h(\bar{x})\geq s^{\top}(x-\bar{x}),~{}~{}\forall x\in\mathbb{R}^{m}\}.

Since $d(\lambda,\rho)$ is concave, we adjust Definition 2.2 to correspond to the set $-\partial(-d(\lambda,\rho))$ , allowing us to apply properties of the subdifferential of a convex function to a concave function.

Proposition 2.1

Consider the dual function $d(\lambda,\rho):\mathbb{R}^{m+1}\rightarrow\mathbb{R}$ . Then the subdifferentials of $d$ at $\lambda$ and $\rho$ satisfy

\boldsymbol{A}\textbf{\rm{x}}-\boldsymbol{b}\in\partial_{\lambda}d(\lambda,\rho),\quad\frac{1}{2}\|(\boldsymbol{A}\textbf{\rm{x}}-\boldsymbol{b})_{+}\|^{2}\in\partial_{\rho}d(\lambda,\rho),

(12)

where x is a solution of (4) with input $(\lambda,\rho)$ .

Proof:

Let x be a solution of (4) with input $(\lambda,\rho)$ . Then for any pair $(\hat{\lambda},\hat{\rho})\in\mathbb{R}^{m}_{+}\times\mathbb{R}_{+}$ , it holds that

\displaystyle\begin{array}[]{ll}d(\hat{\lambda},\hat{\rho})\!\!\!\!&\leq\boldsymbol{c}^{\top}\textbf{x}+\hat{\lambda}^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\hat{\rho}}{2}\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}\|^{2}\\ &=\boldsymbol{c}^{\top}\textbf{x}+\lambda^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\rho}{2}\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}\|^{2}\\ &\quad+(\hat{\lambda}-\lambda)^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\hat{\rho}-\rho}{2}\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}\|^{2}\\ &=d(\lambda,\rho)+(\hat{\lambda}-\lambda)^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\hat{\rho}-\rho}{2}\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}\|^{2},\end{array}

where the last equality holds due to the optimality of x. Then by the definition of subgradient and subdifferential, we arrive at (12). ∎∎

3 Augmented Lagrangian method

In this section, we introduce an augmented Lagrangian method framework. This method is composed of two steps in solving the augmented Lagrangian dual problem (5). We first use the primal information x to construct the subgradient of $d$ at $(\lambda,\rho)$ . Since $\lambda$ and $\rho$ satisfy $\lambda\geq 0$ and $\rho>0$ , then we apply the projected subgradient method to update parameters $\lambda$ and $\rho$ . Starting from $\lambda^{0}\in\mathbb{R}_{+}^{m}$ and $\rho^{0}>0$ , we can update $\lambda$ and $\rho$ at $(k+1)$ -th iteration by


$\displaystyle\lambda^{k+1}$	$\displaystyle=\left(\lambda^{k}+\alpha^{k}(\boldsymbol{A}\textbf{x}^{k+1}-\boldsymbol{b})\right)_{+},$	(14a)
$\displaystyle\rho^{k+1}$	$\displaystyle=\left(\rho^{k}+\frac{\alpha^{k}}{2}\left\\|(\boldsymbol{A}\textbf{x}^{k+1}-\boldsymbol{b})_{+}\right\\|^{2}\right)_{+}$
	$\displaystyle=\rho^{k}+\frac{\alpha^{k}}{2}\left\\|(\boldsymbol{A}\textbf{x}^{k+1}-\boldsymbol{b})_{+}\right\\|^{2},$	(14b)

where $\textbf{x}^{k+1}$ is an optimal solution of the augmented Lagrangian relaxation problem (4) at $\lambda^{k}$ and $\rho^{k}$ , that is,

\textbf{x}^{k+1}\in\arg\min\limits_{\textbf{x}\in\mathcal{X}}~{}L(\textbf{x},\lambda^{k},\rho^{k}).

(15)

Therefore, the iterative processes (15), (14a) and (14b) consisting of primal and dual variables make up the ALM framework.

Solving the x-subproblem (15) is an important step in ALM. While the subproblem (15) has no closed form solution in general, we can apply an iterative method to solve it exactly or inexactly. Consequently, the ALM for solving the problem (1) consists of outer and inner iterations. The subscript $k$ is used to denote the outer iteration number, and the subscript $t$ is used to denote the inner iteration number.

3.1 A BCD method for subproblem (15)

In this subsection, we present a BCD method for solving the x-subproblem (15) in the ALM framework. The BCD method minimizes the function $L$ by iterating cyclically in order $\textbf{x}_{1},...,\textbf{x}_{p}$ , fixing the previous iteration during each iteration. Denote $\textbf{x}^{t}=\left(\textbf{x}_{1}^{t};\textbf{x}_{2}^{t};...;\textbf{x}_{p}^{t}\right)$ where $\textbf{x}^{t}_{j}$ is the value of $\textbf{x}_{j}$ at its $t$ -th update. Let

L^{t}_{j}(\textbf{x}_{j},\lambda,\rho)=L(\textbf{x}^{t+1}_{1},...,\textbf{x}^{t+1}_{j-1},\textbf{x}_{j},\textbf{x}^{t}_{j+1},...,\textbf{x}^{t}_{p},\lambda,\rho),

and

\textbf{x}^{t}(j)=\left(\textbf{x}_{1}^{t+1},\textbf{x}_{2}^{t+1},...,\textbf{x}_{j-1}^{t+1},\textbf{x}_{j}^{t},\textbf{x}_{j+1}^{t},...,\textbf{x}_{p}^{t}\right).

Therefore $\textbf{x}^{t}(1)=\left(\textbf{x}_{1}^{t};\textbf{x}_{2}^{t};...;\textbf{x}_{p}^{t}\right)=\textbf{x}^{t}$ and $\textbf{x}^{t}(p+1)=\left(\textbf{x}_{1}^{t+1};\textbf{x}_{2}^{t+1};...;\textbf{x}_{p}^{t+1}\right)=\textbf{x}^{t+1}.$ Given fixed parameters $\lambda\geq 0$ and $\rho>0$ , we can also calculate the gradient of $L(\textbf{x}^{t}(j),\lambda,\rho)$ at $\textbf{x}_{j}$ by

	$\displaystyle g_{j}(\textbf{x}^{t})$	$\displaystyle:=$	$\displaystyle\nabla_{\textbf{x}_{j}}L(\textbf{x}^{t}(j),\lambda,\rho)$
		$\displaystyle=$	$\displaystyle\boldsymbol{c}_{j}+\boldsymbol{A}_{j}^{\top}\lambda+\rho\boldsymbol{A}_{j}^{\top}\left(\boldsymbol{A}\textbf{x}^{t}(j)-\boldsymbol{b}\right)_{+}.$

At each step, we consider two types of updates for every $\textbf{x}_{j}\in\mathcal{X}_{j}$ :


	$\displaystyle\text{Classical:}~{}\textbf{x}^{t+1}_{j}\in\mathop{\arg\min}_{\textbf{x}_{j}\in\mathcal{X}_{j}}~{}L^{t}_{j}(\textbf{x}_{j},\lambda,\rho),$		(16a)
	Proximal linear:
	$\displaystyle\textbf{x}_{j}^{t+1}\in\mathop{\arg\min}_{\textbf{x}_{j}\in\mathcal{X}_{j}}\left\{\langle\textbf{x}_{j}-\textbf{x}_{j}^{t},g_{j}(\textbf{x}^{t})\rangle+\frac{1}{2\tau}\\|\textbf{x}_{j}-\textbf{x}_{j}^{t}\\|^{2}\right\},$		(16b)

where $\tau>0$ is a step size. In fact, if one defines the projection operator by

P_{\mathcal{X}}(v)\in\arg\min\{\|u-v\|:u\in\mathcal{X}\},

then we obtain the following equivalent form of (16b):

\textbf{x}_{j}^{t+1}=P_{\mathcal{X}_{j}}\left(\textbf{x}_{j}^{t}-\tau g_{j}(\textbf{x}^{t})\right).

In general, the classical subproblem (16a) is fundamentally harder to solve because of the quadratic term in the objective function. However, we derive a simplified form of this subproblem under certain conditions, which will be discussed in Subsection 3.1.2. By contrast, the prox-linear subproblem (16b) is relatively easy to solve because the objective function is linear with respect to $\textbf{x}_{j}$ . Now we summarize the ALM for solving (1) in Algorithm 1, which allows each $\textbf{x}_{j}$ to be updated by (16a) or (16b). We assume that each block $j$ is updated by the same scheme in (16a) and (16b) for all iteration $t$ .

Input: Initial point

\textbf{x}^{0}

\lambda^{0},\rho^{0}

Output: A feasible solution

\textbf{x}^{k+1}

1 for $k=0,1,...,k_{\max}$ do

2 for $t=0,1,...,t_{\max}$ do

3 for $j=1,2,...,p$ do

4 Compute

\textbf{x}_{j}^{(t+1)}

by (16a) or (16b);

5 if

\textbf{\rm{x}}^{(t+1)}=\textbf{\rm{x}}^{(t)}

, then let

\textbf{x}^{k+1}=\textbf{x}^{(t+1)}

and break;

6 if

\|(\boldsymbol{A}\textbf{\rm{x}}^{k+1}-\boldsymbol{b})_{+}\|_{2}=0

, then Teminate;

7 Update the Lagrangian multipliers

\lambda^{k+1}

by (14a) and the penalty coefficient

\rho^{k+1}

by (14b).

Algorithm 1 ALM with BCD

Subsequently, we show more detailed information about these two updates (16a) and (16b).

3.1.1 Proximal linear update of BCD

Before considering the proximal linear subproblem (16b), we first give a definition of a linear operator, which is essential in the BCD method.

Definition 3.1

The linear operator associated to a vector $v\in\mathbb{R}^{n}$ is defined by

\mathcal{T}_{\Omega}(v)=\mathop{\arg\min}_{\textbf{u}\in\Omega}~{}v^{\top}\textbf{u},

where $\Omega$ is a nonempty and closed set.

Benefiting from the good characteristics of x being a binary variable, we have

\|\textbf{x}_{j}\|^{2}=\textbf{1}^{\top}\textbf{x}_{j},~{}~{}\text{for}~{}~{}\forall j\in\mathbb{N}_{p},

(17)

then the objective function in (16b) is linear. Therefore, we can also rewrite (16b) as

	$\displaystyle\textbf{x}_{j}^{t+1}$	$\displaystyle\in\mathop{\arg\min}_{\textbf{x}_{j}\in\mathcal{X}_{j}}~{}\left\{\textbf{x}_{j}^{\top}g_{j}(\textbf{x}^{t})+\frac{1}{2\tau}\textbf{1}^{\top}\textbf{x}_{j}-\frac{1}{\tau}\textbf{x}_{j}^{\top}\textbf{x}_{j}^{t}\right\}$
		$\displaystyle=\mathcal{T}_{\mathcal{X}_{j}}\left(\tau g_{j}(\textbf{x}^{t})+\frac{\textbf{1}}{2}-\textbf{x}_{j}^{t}\right).$		(18)

Due to the discrete property of the feasible set $\mathcal{X}_{j}$ , it is crucial to choose the step size $\tau$ appropriately. If $\tau$ is too large, the BCD method will not converge. If $\tau$ is too small, the BCD method will be stuck at some points. The reason why this happens will be explained in the convergence analysis.

3.1.2 Classical update of BCD

We start by giving the following assumption of model (1), which is very common in many applications such as train timetabling, vehicle routing, and allocation problems.

Assumption 3.1

The entries of the matrix $\boldsymbol{A}$ are either $1$ or $0$ . The vector $\boldsymbol{b}$ equals 1.

Under this assumption, we consider the subproblem (16a). For each $j\in\mathbb{N}_{p}$ , if the condition $\boldsymbol{A}_{j}\textbf{\rm{x}}^{t+1}_{j}\leq\textbf{\rm{1}}$ always holds for each iteration of the BCD method, then

	$\displaystyle\textbf{\rm{x}}^{t+1}_{j}$	$\displaystyle\in\mathop{\arg\min}_{\textbf{\rm{x}}_{j}\in\mathcal{X}_{j}}~{}L^{t}_{j}(\textbf{\rm{x}}_{j},\lambda,\rho)$		(19)
		$\displaystyle=\mathcal{T}_{\mathcal{X}_{j}}\left(\boldsymbol{c}_{j}+\boldsymbol{A}_{j}^{\top}\lambda+\rho\boldsymbol{A}_{j}^{\top}\left(\sum\limits_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{\rm{x}}^{t}_{l}(j)-\frac{\textbf{\rm{1}}}{2}\right)_{+}\right).$

To prove the above derivation, we introduce following two notations at the $t$ -th update:

\displaystyle\begin{array}[]{ll}\mathcal{I}:=\{i\in\mathbb{N}_{m}:\sum_{l\neq j}^{p}\boldsymbol{A}_{i,l}\textbf{x}_{l}^{t}(j)=0\},\\ \bar{\mathcal{I}}:=\{i\in\mathbb{N}_{m}:\sum_{l\neq j}^{p}\boldsymbol{A}_{i,l}\textbf{x}_{l}^{t}(j)\geq 1\},\end{array}

(22)

where $\textbf{x}^{t}_{l}(j)=\left\{\begin{array}[]{ll}\textbf{x}^{t+1}_{l},&\text{if}~{}l<j,\\ \textbf{x}^{t}_{l},&\text{if}~{}l>j.\end{array}\right.$ Obviously, $\mathcal{I}\cap\bar{\mathcal{I}}=\emptyset$ and $\mathcal{I}\cup\bar{\mathcal{I}}=\mathbb{N}_{m}$ . Therefore, we have

	$\displaystyle\sum\nolimits_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\bar{\mathcal{I}}}}{2}=\left(\sum\nolimits_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\bar{\mathcal{I}}}}{2}\right)_{+},$		(23)
	$\displaystyle\sum\nolimits_{l\neq j}^{p}\boldsymbol{A}_{\mathcal{I},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\mathcal{I}}}{2}=0,$		(24)

and

\displaystyle\begin{array}[]{ll}\left(\boldsymbol{A}_{\mathcal{I},j}\textbf{x}_{j}+\sum\limits_{l\neq j}^{p}\boldsymbol{A}_{\mathcal{I},l}\textbf{x}^{t}_{l}(j)-\textbf{1}_{\mathcal{I}}\right)_{\!\!+}=\left(\boldsymbol{A}_{\mathcal{I},j}\textbf{x}_{j}-\textbf{1}_{\mathcal{I}}\right)_{+}\!=\!\textbf{0},\\ \boldsymbol{A}_{\bar{\mathcal{I}},j}\textbf{x}_{j}+\sum_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\textbf{1}_{\bar{\mathcal{I}}}\geq\textbf{0}.\end{array}

(27)

Since the element in $\boldsymbol{A}_{j}\textbf{x}_{j}$ is either $1$ or $0$ , we have

\|\boldsymbol{A}_{j}\textbf{x}_{j}\|^{2}=\textbf{1}^{\top}(\boldsymbol{A}_{j}\textbf{x}_{j}),~{}~{}\forall~{}j\in\mathbb{N}_{p}.

(28)

Denote $C=\left\|\sum_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\textbf{1}_{\bar{\mathcal{I}}}\right\|^{2}$ . Then

\displaystyle\begin{array}[]{ll}&\left\|\left(\boldsymbol{A}_{j}\textbf{x}_{j}+\sum_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\textbf{1}\right)_{+}\right\|^{2}\\ \overset{\eqref{AIj}}{=}&\|\boldsymbol{A}_{\bar{\mathcal{I}},j}\textbf{x}_{j}\|^{2}+2(\boldsymbol{A}_{\bar{\mathcal{I}},j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\textbf{1}_{\bar{\mathcal{I}}}\right)+C\\ \overset{\eqref{Ajxj}}{=}&2(\boldsymbol{A}_{\bar{\mathcal{I}},j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\bar{\mathcal{I}}}}{2}\right)+C\\ \overset{\eqref{AbIj2}}{=}&2(\boldsymbol{A}_{\bar{\mathcal{I}},j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\bar{\mathcal{I}}}}{2}\right)_{+}+C\\ \overset{\eqref{AIj2}}{=}&2(\boldsymbol{A}_{\bar{\mathcal{I}},j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\bar{\mathcal{I}}}}{2}\right)_{+}\\ &+2(\boldsymbol{A}_{\mathcal{I},j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{\mathcal{I},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\mathcal{I}}}{2}\right)_{+}+C\\ =&2(\boldsymbol{A}_{j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}}{2}\right)_{+}+C.\end{array}

(31)

Then the iterative scheme for x-update is given by

\displaystyle\begin{array}[]{ll}\textbf{x}^{t+1}_{j}&=\mathop{\arg\min}\limits_{\textbf{x}_{j}\in\mathcal{X}_{j}}~{}L^{t}_{j}(\textbf{x}_{j},\lambda,\rho)\\ &=\mathop{\arg\min}\limits_{\textbf{x}_{j}\in\mathcal{X}_{j}}~{}\left\{\boldsymbol{c}_{j}^{\top}\textbf{x}_{j}+\lambda^{\top}\left(\boldsymbol{A}_{j}\textbf{x}_{j}-\textbf{1}\right)\right.\\ &\quad\left.+\frac{\rho}{2}\left\|\left(\boldsymbol{A}_{j}\textbf{x}_{j}+\sum_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\textbf{1}\right)_{+}\right\|^{2}\right\}\\ &\overset{\eqref{Abar}}{=}\mathop{\arg\min}_{\textbf{x}_{j}\in\mathcal{X}_{j}}\left\{\boldsymbol{c}_{j}^{\top}\textbf{x}_{j}+\lambda^{\top}\left(\boldsymbol{A}_{j}\textbf{x}_{j}-\textbf{1}\right)\right.\\ &\quad\left.+\rho(\boldsymbol{A}_{j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}}{2}\right)_{+}\right\}\\ &=\mathcal{T}_{\mathcal{X}_{j}}\left(\boldsymbol{c}_{j}+\boldsymbol{A}_{j}^{\top}\lambda+\rho\boldsymbol{A}_{j}^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}}{2}\right)_{\!\!+}\right).\end{array}

We can observe that the condition $\boldsymbol{A}\textbf{x}\leq\textbf{1}$ induces the decomposition of minimizing the augmented Lagrangian function (3) into a set of subproblems with a linear objective function, which makes (16a) easier to solve. Therefore, the key point in the derivation of the linearization process is utilizing the fact that the entries in $\boldsymbol{A}_{j}\textbf{x}_{j}$ are either $1$ or $0$ . We next verify this condition is always true in each iteration of the BCD method under the following assumption.

Assumption 3.2

Denote by $T_{\boldsymbol{A}_{j}}$ the indices of columns of the matrix $\boldsymbol{A}_{j}$ containing only zeros and $c_{j_{s}}$ the $s$ -th element of $\boldsymbol{c}_{j}$ . At least one of the following holds.

(i)

For all $j\in\mathbb{N}_{p}$ , there exists $\textbf{\rm{x}}_{j}\in\mathcal{X}_{j}$ such that $\boldsymbol{A}_{j}\textbf{\rm{x}}_{j}\leq\textbf{\rm{1}}$ and $\{s\in\mathbb{N}_{n_{j}}:c_{j_{s}}\neq 0\}\subseteq T_{\boldsymbol{A}_{j}}$ .
(ii)

For all $j\in\mathbb{N}_{p}$ , there is at most one nonzero element in the vector $\boldsymbol{c}_{j}$ and $\{\textbf{\rm{x}}_{j}\in\mathbb{R}^{n_{j}}:\boldsymbol{A}_{j}\textbf{\rm{x}}_{j}\leq\textbf{\rm{1}}\}\subseteq\{\textbf{\rm{x}}_{j}\in\mathbb{R}^{n_{j}}:\textbf{\rm{x}}_{j}\in\mathcal{X}_{j}\}$ .

Remark: (a) Assumption 3.2 (i) requires that if an element $\textbf{\rm{x}}_{j_{s}}$ satisfies $\textbf{\rm{x}}_{j_{s}}=1$ , then at least one of $c_{j_{s}}$ and the $s$ -th column vector of $\boldsymbol{A}_{j}$ is zero. Assumption 3.2 (ii) requires that if the block constraint set $\mathcal{X}_{j}$ is large enough such that $\boldsymbol{A}_{j}\textbf{\rm{x}}_{j}\leq\textbf{\rm{1}}$ holds, then the weight $\boldsymbol{c}_{j}$ corresponding to each block variable has only one nonzero element. (b) Assumption 3.2 usually holds when $\boldsymbol{A}$ and $\boldsymbol{c}$ are sparse. This is particularly true in the train timetabling problem with the objective of scheduling more trains, as will be shown in Section 5.

Lemma 3.1

Suppose Assumption 3.2 hold. From any starting point $\textbf{\rm{x}}^{0}$ , the solution generated by the BCD method with the classical update at each iteration satisfies $\boldsymbol{A}_{j}\textbf{\rm{x}}_{j}^{t+1}\leq\textbf{\rm{1}}$ for all $j\in\mathbb{N}_{p}$ and $t=0,1,2,...$ .

Proof:

Let $\textbf{x}_{j}^{t+1}$ be the solution generated by the BCD method after $t$ -th classical update. We argue by contradiction and suppose that there exists $i\in\mathbb{N}_{m}$ such that $\boldsymbol{A}_{i,j}\textbf{x}_{j}^{t+1}>1$ . Then for any $\bar{\textbf{x}}_{j}\in\mathcal{X}_{j}$ satisfying $\boldsymbol{A}_{j}\bar{\textbf{x}}_{j}\leq 1$ , we have

L^{t}_{j}(\textbf{x}_{j}^{t+1},\lambda,\rho)<L^{t}_{j}(\bar{\textbf{x}}_{j},\lambda,\rho),~{}~{}\forall j\in\mathbb{N}_{p},

which implies that for any $j\in\mathbb{N}_{p}$ ,

\left(\boldsymbol{c}_{j}+\boldsymbol{A}_{j}^{\top}\lambda+\rho\boldsymbol{A}_{j}^{\top}\left(\sum\limits_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}}{2}\right)_{\!\!+}\right)^{\top}(\textbf{x}_{j}^{t+1}-\bar{\textbf{x}}_{j})<0,

then

\displaystyle\begin{array}[]{ll}&\boldsymbol{c}_{j}^{\top}(\textbf{x}_{j}^{t+1}-\bar{\textbf{x}}_{j})+\lambda^{\top}(\boldsymbol{A}_{j}\textbf{x}_{j}^{t+1}-\boldsymbol{A}_{j}\bar{\textbf{x}}_{j})+\\ &\rho\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}}{2}\right)_{\!\!+}^{\top}(\boldsymbol{A}_{j}\textbf{x}_{j}^{t+1}-\boldsymbol{A}_{j}\bar{\textbf{x}}_{j})<0.\end{array}

(35)

We consider following two cases corresponding to Assumption 3.2:

Case 1: Assumption 3.2 (i) implies $c_{j_{s}}=0$ for any $s\in S_{j}:=\{s\in\mathbb{N}_{n_{j}}:\textbf{x}_{j_{s}}^{t+1}=1,\boldsymbol{A}_{i,j_{s}}=1\}$ . Taking $\bar{\textbf{x}}_{j}=(\textbf{x}_{j_{1}}^{t+1},...,\textbf{x}_{j_{s}}^{t+1}-1,...,\textbf{x}_{j_{n_{j}}}^{t+1})$ with $s\in S_{j}$ such that $\boldsymbol{A}_{j}\bar{\textbf{x}}_{j}\leq\textbf{1}$ gives us

\boldsymbol{c}_{j}^{\top}(\textbf{x}_{j}^{t+1}-\bar{\textbf{x}}_{j})=\sum\nolimits_{s}c_{j_{s}}=0~{}\text{and}~{}\boldsymbol{A}_{i,j}\textbf{x}_{j}^{t+1}-\boldsymbol{A}_{i,j}\bar{\textbf{x}}_{j}\geq 1.

(36)

This contradicts (35) since $\lambda\geq 0,\rho>0$ .

Case 2: Assumption 3.2 (ii) implies that there exists

\bar{\textbf{x}}_{j}=(\textbf{x}_{j_{1}}^{t+1},...,\bar{\textbf{x}}_{j_{s^{\prime}}},...,\textbf{x}_{j_{s}}^{t+1}-1,...,\textbf{x}_{j_{n_{j}}}^{t+1})~{}\text{with}~{}\bar{\textbf{x}}_{j_{s^{\prime}}}=1

(37)

for any $s\in S_{j}$ and $s^{\prime}\neq s$ such that $\boldsymbol{A}_{j}\bar{\textbf{x}}_{j}\leq\textbf{1}$ . If $c_{j_{s^{\prime}}}\geq 0$ , we can apply the same procedure in Case 1 and arrive at the contradiction. If $c_{j_{s^{\prime}}}<0$ , we take $\bar{\textbf{x}}_{j}$ in (37), then $\textbf{x}_{j_{s^{\prime}}}^{t+1}-\bar{\textbf{x}}_{j_{s^{\prime}}}\leq 0$ . Hence,

\displaystyle\boldsymbol{c}_{j}^{\top}(\textbf{x}_{j}^{t+1}-\bar{\textbf{x}}_{j})=c_{j_{s^{\prime}}}(\textbf{x}_{j_{s^{\prime}}}^{t+1}-\bar{\textbf{x}}_{j_{s^{\prime}}})\geq 0,\ \boldsymbol{A}_{i,j}\textbf{x}_{j}^{t+1}-\boldsymbol{A}_{i,j}\bar{\textbf{x}}_{j}\geq 1,

which is a contradiction that completes the proof. ∎∎

Overall, we can reformulate the subproblem (16a) into (19) based on the special structure of the model (1) under certain conditions, and thus make it easier to solve. It is worth noting that if Assumption 3.2 is not satisfied, we can consider (19) as a linear approximation of (16a). This linearization technique is widely used, including in works [20] and [21]. However, they lacked theoretical guarantees. Our main result shows that under specified assumptions, using the update (19) in the ALM-C method is equivalent to solving (16a) exactly, theoretically ensuring the effectiveness of our method.

3.2 Finding a good feasible solution by set packing

Since the BCD method may not always yield a feasible solution[35], we can adopt several refinement strategies that are very useful to find a feasible solution to the problem (1) in practice. Very often, (1) represents a problem where we optimize under limited resources, and this provides us a view of set packing problems. Since we produce candidate solutions all along, a natural idea is to utilize past iterates to construct a feasible solution. A simple strategy could be constructing a solution pool for each $j$ that includes the past BCD iterates $V^{k}_{j}=\{\textbf{x}^{1}_{j},\textbf{x}^{2}_{j},...,\textbf{x}^{k}_{j}\},j=1,...,p$ . Intuitively, the solution pool may include feasible or “nice” solutions in some sense.

To illustrate the above approach, we first introduce the iterative sequential technique.

3.2.1 A sweeping technique

The most simple technique is probably to select a subset of blocks, one by one, until the infeasibility is detected. We present this sequential method in Algorithm 2, which can be understood by simply sweeping the blocks and selecting a candidate solution from $V_{j}^{k}$ if feasibility is still guaranteed, otherwise simply skip the current block and continue.

Input: The set of past BCD solutions

V^{k}_{j},j=1,...,p

Output: A feasible solution

\hat{\textbf{x}}^{k}

1 while the termination is not satisfied do

2 for $j=1,2,...,p$ do

3 Select

\textbf{v}_{j}\in V_{j}^{k}

;

4 if $\boldsymbol{A}_{j}\textbf{\rm{v}}_{j}+\sum\limits_{l=1}^{j-1}\boldsymbol{A}_{l}\hat{\textbf{\rm{x}}}^{k}_{l}-\boldsymbol{b}\leq 0$ then

5 let

\hat{\textbf{x}}_{j}^{k}=\textbf{v}_{j}

6 else

7 let

\hat{\textbf{x}}_{j}^{k}=0

;

Algorithm 2 A sweeping technique

3.2.2 A packing technique

Let us further explore the idea of selecting solutions in a systematic way. Formally, for each block $j\in\mathbb{N}_{p}$ , we introduce a set of binary variables $\mu_{j}$ that defines the current selection:

\hat{\textbf{x}}_{j}^{k}=X_{j}^{k}\mu_{j},~{}\mu_{j}\in\{0,1\}^{k},

where $X_{j}^{k}=[\textbf{x}^{1}_{j};\textbf{x}^{2}_{j};...;\textbf{x}^{k}_{j}]$ . We consider the following problem:


$\displaystyle\min_{\mu_{j}\in\{0,1\}^{k}}~{}$	$\displaystyle~{}\sum\nolimits_{j=1}^{p}c_{j}^{\top}X_{j}^{k}\mu_{j}$	(38a)
s.t.	$\displaystyle~{}\mu_{j}^{\top}\textbf{1}\leq 1,~{}\forall j=1,...,p,$	(38b)
	$\displaystyle~{}\sum\nolimits_{j=1}^{p}\boldsymbol{A}_{j}X_{j}^{k}\mu_{j}\leq\boldsymbol{b}.$	(38c)

In view of (38b), one may recognize the above problem as a restricted master problem appearing in column generation algorithms where (38c) stands for a set of knapsack constraints.

Specifically, the coupling constraints (38c) in our model are cliques, representing complete subgraphs where each pair of distinct nodes is connected. This allows us to reformulate the model as a maximum independent set problem. Therefore, we can find a feasible solution $\textbf{x}^{*}$ which satisfies $\boldsymbol{A}\textbf{x}^{*}\leq\boldsymbol{b}$ by solving the problem (38). Since this problem is still hard, we only solve the relaxation problem of the maximal independent set. Now we go into details. Since the knapsack (38c) means the candidate solutions may conflict, we can construct a conflict graph $F=(V,E)$ . In this graph, $V$ represents the set of nodes, where each node corresponds to a solution generated as the algorithm proceeds, and $E$ is the set of edges that connect two conflicting solutions, meaning they violate the coupling constraints. In this view, we only have to maintain the graph $F$ and find a maximal independent set $\mathcal{K}$ . Therefore, the output feasible point $\textbf{x}_{j}^{*}$ corresponds to $\textbf{v}_{j}\in\mathcal{K}$ for each $j\in\mathbb{N}_{p}$ . If $\textbf{v}_{j}\notin\mathcal{K}$ , then $\textbf{x}_{j}^{*}=0$ . We summarize the maximal independent set technique in Algorithm 3. We also note that Algorithm 2 can be seen as a special case of Algorithm 3.

Input: The BCD solution

\textbf{x}^{k}

, last conflict graph

F^{k-1}=(V^{k-1},E^{k-1})

Output: A feasible solution

\textbf{x}_{*}^{k}

1 Step 1: Conflict graph update

2 for $j=1,2,...,p$ do

3 Collect new candidate paths

\tilde{V}^{k}_{j}

for block

j

;

4 (1.1 node-update)

V^{k}\leftarrow V^{k-1}+\tilde{V}^{k}_{j}

;

5 (1.2 self-check)

E^{k}\leftarrow E^{k-1}+\{(p,p^{\prime})\mid\forall p\neq p^{\prime},p\in V^{k}_{j},p^{\prime}\in V^{k}_{j}\}

;

6 (1.3 edge-completion)

E^{k}\leftarrow E^{k}+\Big{\{}(p,p^{\prime})\mid

\left.\text{if}~{}p,p^{\prime}\text{are compatible}~{}\text{for}~{}\forall p\in\tilde{V}^{k}_{j},p^{\prime}\in V^{k}\backslash V^{k}_{j}\right\}

, here ‘compatible’ means that these two nodes (block variables) satisfy the binding constraints;

7Step 2: Maximal independent set (MIS) for a feasible solution

8 Select a candidate solution set

K\subset V^{k}

;

9 for $v\in K$ do

10 Compute a maximal independent set

\mathcal{K}(v)

with respect to

v

O(|E^{k}|)

iterations;

Compute

\textbf{x}^{*}=\arg\max_{v}\mathcal{K}(v)

Algorithm 3 A packing (maximal independent set) technique

Based on the above-mentioned techniques, we improve the ALM and propose a customized ALM in Algorithm 4.

Input:

\textbf{x}^{0},\lambda^{0},\rho^{0}>0

and the best objective function value

f^{*}=+\infty

. Set

k=0

Output: A local (global) optimal solution

\textbf{x}^{*}

1 while the termination is not satisfied do

2 Step 2: Construct solution using Alg. 1

3 Update the BCD solution

\textbf{x}^{k+1}

by the procedure in lines 2-5 of Alg. 1;

4 Step 3: Generate a feasible solution

5 if the BCD solution is not feasible then

6 transform the BCD solution to a feasible solution

\bar{\textbf{x}}^{k+1}

by calling a refinement method in Alg. 2 or 3;

7 else

8 let

\bar{\textbf{x}}^{k+1}=\textbf{x}^{k+1}

;

10 Step 4: Update the best solution

11 if $\boldsymbol{c}^{\top}\bar{\textbf{x}}^{k+1}\leq f^{*}$ then

12 Set

f^{*}=\boldsymbol{c}^{\top}\bar{\textbf{x}}^{k+1}

and

\textbf{x}^{*}=\bar{\textbf{x}}^{k+1}

;

13 Update the Lagrangian multipliers

\lambda^{k+1}

by (14a) and the penalty coefficient

\rho^{k+1}

by (14b). Let

k=k+1

Algorithm 4 A customized ALM

Although Algorithms 2 and 3 can help us find a feasible solution to problem (1), the quality of output by them remains unjustified. To evaluate and improve the quality of the solution, the simple way is to estimate the upper and lower bounds of the objective function value, and then calculate the gap between them. The smaller the gap, the better the current solution. Obviously, our method can provide an upper bound of the objective function value. As for the generation of the lower bound, we can use the following method. (i) LP relaxation: by directly relaxing the binary integer variables to $[0,1]$ continuous variables, we solve a linear programming problem exactly to obtain the lower bound of the objective function value. (ii) LR relaxation: since the Lagrangian dual problem (2) is separable, we can solve the decomposed subproblems exactly to obtain the lower bound of the objective function value.

In general, the Lagrangian dual bound is at least as tight as the linear programming bound obtained from the usual linear programming relaxation [14]. Note that the relaxation must be solved to optimality to yield a valid bound. Therefore, we can use a combination of LR method and Alg. 4. To be specific, LR aims at generating the lower bound of the objective function value in (1), the solution is usually infeasible. Steps 2 and 3 in Alg. 4 are used for generating feasible solutions of (1). The smaller the gap between the lower bound and the upper bound, the closer the feasible solution is to the global optimal solution of the problem. It can be seen from the iterative procedure that after many iterations, this algorithm can generate many feasible solutions, and the lower bound of the model is constantly improving. Finally, we select the best solution. The combination of these two methods takes advantage of the ALM and the LR method, it not only finds a good feasible solution but also evaluates the quality of the solution.

Compared with the ADMM-based method in [21], we utilize BCD method to perform multiple iterations to solve the subproblem (15) until the solutions remain unchanged, which can improve the solution accuracy of the subproblem, thereby reducing the total number of iterations. Moreover, we adopt different refinement techniques to further enhance the solution quality.

4 Convergence Analysis

In this section, we present the convergence analysis of the block coordinate descent method for solving the augmented Lagrangian relaxation problem (4) and the augmented Lagrangian method for solving the dual problem (5). Unless otherwise stated, the convergence results presented in this section do not rely on Assumptions 3.1 and 3.2.

4.1 Convergence of BCD

We begin this section with the property of augmented Lagrangian function (3) that is fundamental in the convergence analyis.

Proposition 4.1

The gradient of $L(\textbf{\rm{x}},\lambda,\rho)$ at x is Lipschitz continuous with constant $\kappa$ on $\mathcal{X}$ , namely,

\|\nabla L(\textbf{\rm{x}},\lambda,\rho)-\nabla L(\bar{\textbf{\rm{x}}},\lambda,\rho)\|\leq\kappa\|\textbf{\rm{x}}-\bar{\textbf{\rm{x}}}\|

for all $\textbf{\rm{x}},\bar{\textbf{\rm{x}}}\in\mathcal{X}$ , where $\kappa=\rho\|\boldsymbol{A}\|_{2}^{2}$ . Furthermore,

L(\bar{\textbf{\rm{x}}},\lambda,\rho)\leq L(\textbf{\rm{x}},\lambda,\rho)+\langle\bar{\textbf{\rm{x}}}-\textbf{\rm{x}},\nabla L(\textbf{\rm{x}},\lambda,\rho)\rangle+\frac{\kappa}{2}\|\bar{\textbf{\rm{x}}}-\textbf{\rm{x}}\|^{2}

(39)

for all $\textbf{\rm{x}},\bar{\textbf{\rm{x}}}\in\mathcal{X}$ .

Proof:

For any $\textbf{\rm{x}},\bar{\textbf{\rm{x}}}\in\mathcal{X}$ ,

		$\displaystyle\\|\nabla L(\textbf{\rm{x}},\lambda,\rho)-\nabla L(\bar{\textbf{\rm{x}}},\lambda,\rho)\\|$
	$\displaystyle=$	$\displaystyle\left\\|\rho\boldsymbol{A}^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}-\rho\boldsymbol{A}^{\top}(\boldsymbol{A}\bar{\textbf{x}}-\boldsymbol{b})_{+}\right\\|$
	$\displaystyle\leq$	$\displaystyle\rho\\|\boldsymbol{A}\\|_{2}\left\\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}-(\boldsymbol{A}\bar{\textbf{x}}-\boldsymbol{b})_{+}\right\\|$
	$\displaystyle\leq$	$\displaystyle\rho\\|\boldsymbol{A}\\|_{2}\left\\|\boldsymbol{A}\textbf{x}-\boldsymbol{A}\bar{\textbf{x}}\right\\|\leq\rho\\|\boldsymbol{A}\\|^{2}_{2}\left\\|\textbf{x}-\bar{\textbf{x}}\right\\|.$

Then we have

		$\displaystyle\langle\nabla L(\textbf{\rm{x}},\lambda,\rho)-\nabla L(\bar{\textbf{\rm{x}}},\lambda,\rho),\textbf{\rm{x}}-\bar{\textbf{\rm{x}}}\rangle$
	$\displaystyle\leq$	$\displaystyle\\|\nabla L(\textbf{\rm{x}},\lambda,\rho)-\nabla L(\bar{\textbf{\rm{x}}},\lambda,\rho)\\|\\|\textbf{\rm{x}}-\bar{\textbf{\rm{x}}}\\|\leq\kappa\\|\textbf{\rm{x}}-\bar{\textbf{\rm{x}}}\\|^{2}.$

It follows from the convexity of the function $f(\textbf{x})=\frac{\kappa}{2}\|\textbf{x}\|^{2}-L(\textbf{\rm{x}},\lambda,\rho)$ that

f(\bar{\textbf{\rm{x}}})\geq f(\textbf{x})+\nabla f(\textbf{x})^{\top}(\textbf{\rm{x}}-\bar{\textbf{\rm{x}}}),

which implies the estimate (39). ∎∎

Considering two different updates in the BCD method, we first summarize the convergence property of the BCD method for solving (16a), which has been presented in [36]. Before that, we give the definition of the blockwise optimal solution, which is also called coordinatewise minimum point in [37]. Given parameters $\lambda$ and $\rho$ , a feasible solution $\textbf{x}^{*}$ is called a blockwise optimal solution of the problem (4) if for each $j\in\mathbb{N}_{p}$ , we have for all $\textbf{x}=(\textbf{x}_{1}^{*},...,\textbf{x}_{j-1}^{*},\textbf{x}_{j},\textbf{x}_{j+1}^{*},...,\textbf{x}_{p}^{*})\in\mathcal{X}$ , $L(\textbf{x}^{*},\lambda,\rho)\leq L(\textbf{x},\lambda,\rho).$

Lemma 4.1

Suppose Assumptions 3.1 and 3.2 hold. If the starting point satisfies $\textbf{\rm{x}}^{0}\in\mathcal{X}$ , then the BCD method for solving (16a) is always executable and terminates after a finite number of iterations with a blockwise optimal solution of the problem (4).

Proof:

If Assumptions 3.1 and 3.2 hold, Lemma 3.1 tells us that the subproblem (16a) can be solved exactly. (i) Since the constraint set of each subproblem (16a) is bounded, then all subproblems have an optimal solution. Therefore, the BCD method for solving (16a) is executable. (ii) The result in (i) illustrates that the sequence $\{\textbf{x}^{t}\}$ generated by the BCD method exists. Thus the sequence of objective function values $\{L(\textbf{\rm{x}}^{t},\lambda,\rho)\}$ can only take finitely many different values. This together with the monotonically decreasing property of the function $L(\textbf{\rm{x}},\lambda,\rho)$ yields that $L(\textbf{\rm{x}}^{t},\lambda,\rho)$ must become a constant. Then the BCD method is executable. (iii) By the definition of the blockwise optimal solution and the fact that the subproblem (16a) can be solved exactly, we arrive at the conclusion. ∎∎

A similar conclusion can be found in [36], but it does not clarify how to solve the subproblem. Note that each (global) optimal solution of the model (4) is blockwise optimal, but not vice versa. Subsequently, we analyze the convergence of (16b). The following lemma ensures decreasing the function value of $L$ after each iteration if $\textbf{\rm{x}}^{t+1}\neq\textbf{\rm{x}}^{t}$ and the step size is chosen properly.

Lemma 4.2

Let $\{\textbf{\rm{x}}^{t}\}_{t\in\mathbb{N}}$ be a sequence generated by (16b), we obtain

(\frac{1}{2\tau}-\frac{\kappa}{2})\|\textbf{\rm{x}}^{t+1}-\textbf{\rm{x}}^{t}\|^{2}\leq L(\textbf{\rm{x}}^{t},\lambda,\rho)-L(\textbf{\rm{x}}^{t+1},\lambda,\rho).

(40)

Proof:

Since

\textbf{x}_{j}^{t+1}=\mathcal{T}_{\mathcal{X}_{j}}\left(\tau g_{j}(\textbf{x}^{t})^{\top}\textbf{x}_{j}+\frac{\textbf{1}}{2}-\textbf{x}_{j}^{t}\right)

and $\textbf{x}_{j}^{t}\in\mathcal{X}_{j}$ for all $j\in\{1,2,...,p\}$ , we have

\left(\tau g_{j}(\textbf{x}^{t})+\frac{\textbf{1}}{2}-\textbf{x}_{j}^{t}\right)^{\!\!\top}\!\textbf{x}_{j}^{t+1}\leq\left(\tau g_{j}(\textbf{x}^{t})+\frac{\textbf{1}}{2}-\textbf{x}_{j}^{t}\right)^{\!\!\top}\!\textbf{x}_{j}^{t},

which implies that

2\tau\langle\textbf{x}_{j}^{t+1}-\textbf{x}_{j}^{t},g_{j}(\textbf{x}^{t})\rangle+\|\textbf{x}_{j}^{t+1}-\textbf{x}_{j}^{t}\|^{2}\leq 0.

(41)

Proposition 4.1 tells us that

		$\displaystyle L(\textbf{x}^{t+1},\lambda,\rho)-L(\textbf{x}^{t},\lambda,\rho)$
	$\displaystyle\leq$	$\displaystyle\sum\nolimits_{j=1}^{p}\langle\textbf{x}_{j}^{t+1}-\textbf{x}_{j}^{t},g_{j}(\textbf{x}^{t})\rangle+\frac{\kappa}{2}\sum\nolimits_{j=1}^{p}\\|\textbf{x}_{j}^{t+1}-\textbf{x}_{j}^{t}\\|^{2}.$		(42)

Combining inequalities (41) and (4.1) yields (40). The proof is completed. ∎∎

This lemma tells us that a small step size satisfying $\tau<1/\kappa$ leads to a decrease in the function value $L$ when $\textbf{\rm{x}}^{t+1}\neq\textbf{\rm{x}}^{t}$ . However, the following lemma states that when the step size is too small, the iteration returns the same result. We let $g(\textbf{x})$ denote the gradient of $L(\textbf{x},\lambda,\rho)$ at x.

Lemma 4.3

If the step size satisfies $0<\tau<\frac{1}{2\|g(\textbf{\rm{x}}^{t})\|}$ when $g(\textbf{\rm{x}}^{t})\neq\textbf{0}$ , then it holds that

\textbf{\rm{x}}^{t}=\mathcal{T}_{\mathcal{X}}\left(\tau g(\textbf{\rm{x}}^{t})+\frac{\mathbb{1}}{2}-\textbf{\rm{x}}^{t}\right).

Proof:

If $0<\tau<\frac{1}{2\|g(\textbf{x}^{t})\|}$ , for all $\textbf{x}^{t}\neq\textbf{x}\in\mathcal{X}$ , we have

-2\tau g(\textbf{x}^{t})^{\top}(\textbf{x}^{t}-\textbf{x})\leq 2\tau\|g(\textbf{x}^{t})\|\|\textbf{x}^{t}-\textbf{x}\|\leq\|\textbf{x}^{t}-\textbf{x}\|<\|\textbf{x}^{t}-\textbf{x}\|^{2},

where the last inequality holds due to $\textbf{x}^{t},\textbf{x}\in\{0,1\}^{n}$ . It yields that

\left(\tau g(\textbf{x}^{t})+\frac{\textbf{1}}{2}-\textbf{x}^{t}\right)^{\top}\textbf{x}^{t}\leq\left(\tau g(\textbf{x}^{t})+\frac{\textbf{1}}{2}-\textbf{x}^{t}\right)^{\top}\textbf{x}.

By the definition of the operator $\mathcal{T}_{\mathcal{X}}(\cdot)$ , we arrive at the conclusion. ∎∎

Based on Lemma 4.3, the implementation of the BCD method heavily relies on the choice of step size $\tau$ . Hence, we give a definition of a $\tau$ -stationary point.

Definition 4.1

For the AL relaxation problem (4), if a point $\textbf{\rm{x}}^{*}$ satisfies

\textbf{\rm{x}}^{*}=\mathcal{T}_{\mathcal{X}}\left(\tau g(\textbf{\rm{x}}^{*})+\frac{\textbf{1}}{2}-\textbf{\rm{x}}^{*}\right)

with $\tau>0$ , then it is called a $\tau$ -stationary point.

For the augmented Lagrangian relaxation problem (4), given parameters $\lambda$ and $\rho$ , we say that $\textbf{x}^{*}$ is a $\delta$ -local minimizer if there is an integer $\delta>0$ such that

L(\textbf{x},\lambda,\rho)\geq L(\textbf{x}^{*},\lambda,\rho),~{}~{}\text{for~{}all}~{}\textbf{x}\in\mathcal{N}(\textbf{x}^{*},\delta)\cap\mathcal{X}.

Note that a $n$ -local minimizer is a global minimizer due to the fact $\|\textbf{x}-\textbf{x}^{*}\|\leq n$ for all $\textbf{x},\textbf{x}^{*}\in\mathcal{X}$ . The following important result reveals a relationship between a $\tau$ -stationary point and a global minimizer of the problem (4).

Theorem 4.1

We have the following relationships between the $\tau$ -stationary point and the local minimizer of the problem (4).

(i)

If $\textbf{\rm{x}}^{*}$ is a local minimizer, then $\textbf{\rm{x}}^{*}$ is a $\tau$ -stationary point for any step size $0<\tau<1/\kappa$ .
(ii)

If $\textbf{\rm{x}}^{*}$ is a $\tau$ -stationary point with $\tau>\delta/2$ , then $\textbf{\rm{x}}^{*}$ is a $\delta$ -local minimizer of the problem (4) under the assumption that the entries in $\boldsymbol{A},\boldsymbol{b},\boldsymbol{c},\lambda$ and $\rho$ are integral.

Proof:

(i) If $\textbf{x}^{*}$ is a local minimizer, then for any $\textbf{x}\in\mathcal{X}$ we have

	$\displaystyle L(\textbf{x}^{*},\lambda,\rho)\leq$	$\displaystyle L(\textbf{x},\lambda,\rho)$
	$\displaystyle\leq$	$\displaystyle L(\textbf{x}^{},\lambda,\rho)+\langle\nabla L(\textbf{x}^{},\lambda,\rho),\textbf{x}-\textbf{x}^{}\rangle+\frac{\kappa}{2}\\|\textbf{x}-\textbf{x}^{}\\|^{2}.$

Therefore,

\langle\nabla L(\textbf{x}^{*},\lambda,\rho),\textbf{x}-\textbf{x}^{*}\rangle\geq-\frac{\kappa}{2}\|\textbf{x}-\textbf{x}^{*}\|^{2}\geq-\frac{1}{2\tau}\|\textbf{x}-\textbf{x}^{*}\|^{2},

which implies that $\textbf{x}^{*}$ is a $\tau$ -stationary point.

(ii) Since $\textbf{x}^{*}$ is a stationary point, then for any $\textbf{x}\in\mathcal{X}$ we have

\left(\tau g(\textbf{x}^{*})+\frac{\textbf{1}}{2}-\textbf{x}^{*}\right)^{\top}\textbf{x}^{*}\leq\left(\tau g(\textbf{x}^{*})+\frac{\textbf{1}}{2}-\textbf{x}^{*}\right)^{\top}\textbf{x}.

(43)

For any $\textbf{x}\in\mathcal{X}\cap\mathcal{N}(\textbf{x}^{*},\delta)$ , we define the index sets $\mathbb{J}:=\{l\in\mathbb{N}_{n}:x^{*}_{l}=0,x_{l}=1\}$ and $\bar{\mathbb{J}}:=\{l\in\mathbb{N}_{n}:x^{*}_{l}=1,x_{l}=0\}$ . Then

			$\displaystyle\left(\tau g(\textbf{x}^{})+\frac{\textbf{1}}{2}-\textbf{x}^{}\right)^{\top}\left(\textbf{x}-\textbf{x}^{*}\right)$
		$\displaystyle=$	$\displaystyle\sum\nolimits_{l\in\mathbb{J}}\left(\tau g_{l}(\textbf{x}^{})+\frac{1}{2}\right)-\sum\nolimits_{l\in\bar{\mathbb{J}}}\left(\tau g_{l}(\textbf{x}^{})-\frac{1}{2}\right)$
		$\displaystyle\leq$	$\displaystyle\tau\left(\sum\nolimits_{l\in\mathbb{J}}g_{l}(\textbf{x}^{})-\sum\nolimits_{l\in\bar{\mathbb{J}}}g_{l}(\textbf{x}^{})\right)+\frac{\delta}{2},$

which together with (43) and $\tau>\frac{\delta}{2}$ yields that

\displaystyle\sum\nolimits_{l\in\mathbb{J}}g_{l}(\textbf{x}^{*})-\sum\nolimits_{l\in\bar{\mathbb{J}}}g_{l}(\textbf{x}^{*})\geq-\frac{\delta}{2\tau}>-1.

Since the entries in $\boldsymbol{A},\boldsymbol{b},\boldsymbol{c},\lambda$ and $\rho$ are integral, then $g_{l}(\textbf{x}^{*})$ is an integer for every $l\in\mathbb{N}_{n}$ . This implies that

\displaystyle\sum\nolimits_{l\in\mathbb{J}}g_{l}(\textbf{x}^{*})-\sum\nolimits_{l\in\bar{\mathbb{J}}}g_{l}(\textbf{x}^{*})\geq 0.

Then for any $\textbf{x}\in\mathcal{X}\cap\mathcal{N}(\textbf{x}^{*},\delta)$ , it follows from the convexity of $L$ that

			$\displaystyle L(\textbf{x},\lambda,\rho)-L(\textbf{x}^{*},\lambda,\rho)$
		$\displaystyle\geq$	$\displaystyle\langle\nabla L(\textbf{x}^{},\lambda,\rho),\textbf{x}-\textbf{x}^{}\rangle$
		$\displaystyle=$	$\displaystyle\sum\nolimits_{l\in\mathbb{J}}g_{l}(\textbf{x}^{})(x_{l}-x_{l}^{})+\sum\nolimits_{l\in\bar{\mathbb{J}}}g_{l}(\textbf{x}^{})(x_{l}-x_{l}^{})$
		$\displaystyle=$	$\displaystyle\sum\nolimits_{l\in\mathbb{J}}g_{l}(\textbf{x}^{})-\sum\nolimits_{l\in\bar{\mathbb{J}}}g_{l}(\textbf{x}^{})\geq 0.$

Therefore, $\textbf{x}^{*}$ is a $\delta$ -local minimizer of the problem (4). ∎∎

We can observe from Theorem 4.1 that every blockwise optimal solution of (4) is a $\tau$ -stationary point. Conversely, if the entries in $\boldsymbol{A},\boldsymbol{b},\boldsymbol{c},\lambda$ and $\rho$ are integral and $\tau>\frac{1}{2}\max_{j}\{n_{j}\}$ , then every $\tau$ -stationary point of (4) is blockwise optimal. These results on relationships between $\tau$ -stationary point, blockwise optimal solution and local (global) minimizer are summarized in Figure 1 intuitively.

Figure 1: The relationships among

\tau

-stationary point, blockwise optimal solution and local minimizer. Condition(a): the entries in

\boldsymbol{A},\boldsymbol{b},\boldsymbol{c},\lambda

and

\rho

are integral.

We finally present the main result on the convergence of the BCD method for solving (16b).

Theorem 4.2

(Convergence properties) Let $\{\textbf{\rm{x}}^{t}\}_{t\in\mathbb{N}}$ be a sequence generated by (16b). If the step size satisfies $0<\tau<\frac{1}{2\kappa}$ , then we have

(i)

The sequence $\{L(\textbf{\rm{x}}^{t},\lambda,\rho)\}_{t\in\mathbb{N}}$ is nonincreasing and

\frac{\kappa}{2}\|\textbf{\rm{x}}^{t+1}-\textbf{\rm{x}}^{t}\|^{2}\leq L(\textbf{\rm{x}}^{t},\lambda,\rho)-L(\textbf{\rm{x}}^{t+1},\lambda,\rho).

(44)

(ii)

The sequence $\{\textbf{\rm{x}}^{t}\}$ converges to a $\tau$ -stationary point after at most $\lceil\frac{2C\sqrt{n}+\kappa n}{\kappa}\rceil$ iterations, where $C=\max_{\textbf{\rm{x}}\in\mathcal{X}}\|\nabla L(\textbf{\rm{x}},\lambda,\rho)\|$ .

Proof:

(i) It is obvious from (40) that if $0<\tau<\frac{1}{2\kappa}$ , then (44) holds.

(ii) Given the parameters $\lambda\in\mathbb{R}^{m}_{+}$ and $\rho>0$ , we can observe that the model (4) can be equivalently written as

\min F(\textbf{x}):=L(\textbf{x},\lambda,\rho)+\delta_{\mathcal{X}}(\textbf{x}),

where $\delta_{\mathcal{X}}(\textbf{x})$ is the indicator function of the set $\mathcal{X}$ , i.e., $\delta_{\mathcal{X}}(\textbf{x})=+\infty,\text{if}~{}\textbf{x}\in\mathcal{X}$ and $0$ otherwise. To verify that the sequence $\{\textbf{\rm{x}}^{t}\}$ converges to a $\tau$ -stationary point, we examine that the conditions in [38, Theorem 1] hold.

Firstly, since the set $\mathcal{X}=\{\textbf{x}\in\{0,1\}^{n}:\boldsymbol{B}\textbf{x}\leq\boldsymbol{d}\}$ is semi-algebraic, then $\delta_{\mathcal{X}}(\textbf{x})$ is a Kurdyka-Łojasiewicz (KL) function [39, 38]. It is obvious to see that $L(\textbf{x},\lambda,\rho)$ is also a KL function, and hence the function $F(\textbf{x})$ satisfies the KL property, which is a crucial condition ensuring convergence.

Secondly, the Lipschitz constant $\kappa>0$ in Proposition 4.1 is bounded if $\rho$ has an upper bound, and the problem (4) is inf-bounded.

Thirdly, for any $\textbf{x},\bar{\textbf{x}}\in\mathcal{X}$ with $\bar{\textbf{x}}=(\textbf{x}_{1},...,\bar{\textbf{x}}_{j},...,\textbf{x}_{p})$ ,

			$\displaystyle\\|\nabla_{\textbf{x}_{j}}L(\textbf{x},\lambda,\rho)-\nabla_{\textbf{x}_{j}}L(\bar{\textbf{x}},\lambda,\rho)\\|$
		$\displaystyle=$	$\displaystyle\left\\|\rho\boldsymbol{A}_{j}^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}-\rho\boldsymbol{A}_{j}^{\top}(\boldsymbol{A}\bar{\textbf{x}}-\boldsymbol{b})_{+}\right\\|$
		$\displaystyle\leq$	$\displaystyle\rho\\|\boldsymbol{A}_{j}\\|_{2}\left\\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}-(\boldsymbol{A}\bar{\textbf{x}}-\boldsymbol{b})_{+}\right\\|$
		$\displaystyle\leq$	$\displaystyle\rho\\|\boldsymbol{A}_{j}\\|_{2}\left\\|\boldsymbol{A}\textbf{x}-\boldsymbol{A}\bar{\textbf{x}}\right\\|\leq\rho\\|\boldsymbol{A}_{j}\\|^{2}_{2}\left\\|\textbf{x}_{j}-\bar{\textbf{x}}_{j}\right\\|.$

Therefore, the result follows from [38, Theorem 1].

Suppose $\textbf{x}^{*}$ is a $\tau$ -stationary point, then for every $\textbf{x}\in\mathcal{X}$ , we have

	$\displaystyle L(\textbf{x},\lambda,\rho)-L(\textbf{x}^{*},\lambda,\rho)$
$\displaystyle\leq$	$\displaystyle\langle\nabla L(\textbf{x}^{},\lambda,\rho),\textbf{x}-\textbf{x}^{}\rangle+\frac{\kappa}{2}\\|\textbf{x}-\textbf{x}^{*}\\|^{2}$
$\displaystyle\leq$	$\displaystyle C\\|\textbf{x}-\textbf{x}^{}\\|+\frac{\kappa}{2}\\|\textbf{x}-\textbf{x}^{}\\|^{2}\leq C\sqrt{n}+\frac{\kappa n}{2}.$	(45)

If each iteration always finds a new point until arriving at $\textbf{x}^{*}$ , we get

	$\displaystyle\frac{\kappa T}{2}$	$\displaystyle\leq\sum\nolimits_{t=0}^{T}\left(L(\textbf{\rm{x}}^{t},\lambda,\rho)-L(\textbf{x}^{t+1},\lambda,\rho)\right)$
		$\displaystyle\leq L(\textbf{x}^{0},\lambda,\rho)-L(\textbf{x}^{*},\lambda,\rho),$		(46)

where the first inequality holds due to the conclusion in (i) and the fact $\|\textbf{x}^{t+1}-\textbf{x}^{t}\|\geq 1$ . Thus combining (4.1) with (4.1) yields that $T\leq\lceil\frac{2C\sqrt{n}+\kappa n}{\kappa}\rceil$ . It implies that we only need at most $\lceil\frac{2C\sqrt{n}+\kappa n}{\kappa}\rceil$ steps to arrive at a $\tau$ -stationary point. ∎∎

If each iteration of the BCD method always finds a new point, then according to Theorem 4.2 (i), we have $\sum_{t=0}^{\infty}\|\textbf{\rm{x}}^{t+1}-\textbf{\rm{x}}^{t}\|^{2}<+\infty$ . By Theorem 1 in [38], we can conclude that $\lim_{t\rightarrow\infty}\textbf{x}^{t}=\textbf{x}^{*},$ where $\textbf{x}^{*}$ is a limit point of the sequence $\{\textbf{x}^{t}\}_{t\in\mathbb{N}}$ . If we choose an initial point that is already close to an optimal solution and an appropriate step size $\tau$ , based on specific convergence criteria detailed in Theorem 4.2, the BCD method can find a global solution of problem (4).

4.2 Convergence of ALM

Assume the BCD method returns a global minimizer $\textbf{x}^{*}$ to the augmented Lagrangian relaxation problem (4) in each inner loop. In this subsection, we focus on the convergence property of the projected subgradient method for solving the dual problem (5). For convenience, we introduce the following constants for subsequent analysis:

	$\displaystyle S$	$\displaystyle:=$	$\displaystyle{\text{arg}\max}_{\lambda\in\mathbb{R}_{+}^{m},\rho>0}~{}d(\lambda,\rho),$
	$\displaystyle\theta$	$\displaystyle:=$	$\displaystyle{\min}_{(\lambda,\rho)\in S}~{}\\|\lambda^{0}-\lambda\\|^{2}+(\rho^{0}-\rho)^{2}.$

Let $d_{g}^{k}$ denote the subgradient of $d(\lambda^{k},\rho^{k})$ . We estimate the distance between the dual function value in each iteration and the optimal value of the problem (1) in the following theorem.

Theorem 4.3

If we take the step size $\alpha^{k}=\frac{\beta_{k}}{\|d_{g}^{k}\|}$ with $\beta_{k}=\sqrt{\frac{\theta}{K}}$ , then,

f^{\text{IP}}-\max_{k\in\{1,2,...,K\}}d(\lambda^{k},\rho^{k})\leq\frac{\zeta}{2}\sqrt{\frac{5\theta}{K}},

where $\zeta={\max}_{\textbf{\rm{x}}\in\mathcal{X}}~{}\|\boldsymbol{A}\textbf{\rm{x}}-\boldsymbol{b}\|^{4}$ . Furthermore, if the positive sequence $\{\beta_{k}\}_{k\in\mathbb{N}}$ is bounded and $\sum_{k\in\mathbb{N}}\beta_{k}^{2}<+\infty.$ Then $(\lambda^{k},\rho^{k})$ converges to some $(\lambda^{*},\rho^{*})\in S$ .

Proof:

Let $\{(\lambda^{k},\rho^{k})\}_{k\in\mathbb{N}}$ be the sequence generated by Algorithm 1. We derive from the Lemma 3 in [17] that

f^{\text{IP}}-\max_{k\in\{1,2,...,K\}}d(\lambda^{k},\rho^{k})\leq\frac{\theta+\sum_{k=1}^{K}(\alpha^{k}\|d_{g}^{k}\|)^{2}}{2\sum_{k=1}^{K}\alpha^{k}}.

(47)

For all $k\in\mathbb{N}$ , we have

\|d_{g}^{k}\|^{2}=\|\boldsymbol{A}\textbf{x}^{k}-\boldsymbol{b}\|^{2}+\frac{1}{4}\|(\boldsymbol{A}\textbf{x}^{k}-\boldsymbol{b})_{+}\|^{4}\leq\frac{5}{4}\|\boldsymbol{A}\textbf{x}^{k}-\boldsymbol{b}\|^{4}\leq\frac{5}{4}\zeta.

Then the right-hand side of (47) satisfies

\frac{\theta+\sum_{k=1}^{K}(\alpha^{k}\|d_{g}^{k}\|)^{2}}{2\sum_{k=1}^{K}\alpha^{k}}=\frac{\sqrt{5}\zeta(\theta+\sum_{k=1}^{K}\beta_{k}^{2})}{4\sum_{k=1}^{K}\beta_{k}}=\frac{\zeta}{2}\sqrt{\frac{5\theta}{K}}.

Thus we arrive at the first result.

The second claim is then proved in [40, Theorem 7.4], where the proof for the subgradient method is readily extended to the projected subgradient method. ∎∎

Although the theoretical analysis of ALM-C relies on Assumption 3.1, our tests show that ALM-C is also effective in more general settings. Furthermore, we present the ALM-P method as a versatile alternative not relying on this assumption.

5 Applications

In this section, we show the performance of the proposed algorithms on two practical problems. One is the train timetabling problem and the other is the vehicle routing problem. To show the performance of compared methods, we define three termination criteria: a maximum number of iterations, a time limit, and an optimality gap based on the difference between the objective value generated by our methods and the best known value. All instances are tested on a MacBook Pro 2019 with 8GB of memory and Intel Core i5 (Turbo Boost up to 4.1 GHz) with 128 MB eDRAM.

5.1 Capacitated vehicle routing problem

We consider the capacitated vehicle routing problem with time windows (CVRPTW). The problem is defined on a complete directed graph $G=(V,E)$ , where $V=\{0,1,...,n\}$ is the node set and $E$ is the edge set. Node 0 represents the depot where the vehicles are based, and nodes 1 to $n$ represent the customers that need to be served. Each edge $(s,t)$ in $E$ has an associated travel time $T_{st}$ . Each customer $s$ has a demand $c_{s}$ and a service time window $[a_{s},b_{s}]$ . We let $d_{st}$ be the distance from node $s$ to node $t$ and $M$ be a large constant. The objective is to construct a set of least-cost vehicle routes starting and ending at the depot, such that each customer is visited exactly once within their time window, and the total demand of customers served in each route does not exceed vehicle capacity $C$ .

To formulate this problem as an integer program, we define the following decision variables: (i) $x_{st}^{j}$ : binary variable equal to 1 if edge $(s,t)$ is used by vehicle $j$ , 0 otherwise. (ii) $w_{s}^{j}$ : continuous variable indicating the start of service time at customer $s$ by vehicle $j$ . Then the block structured integer linear programming formulation is:


min	$\displaystyle\sum\nolimits_{j\in\mathbb{N}_{p}}\sum\nolimits_{(s,t)\in E}d_{st}x_{st}^{j}$	(48a)
s.t.	$\displaystyle\sum\nolimits_{j\in\mathbb{N}_{p}}\sum\nolimits_{t\in V:t\neq s}x_{st}^{j}=1,\hskip 45.52458pts\in V\backslash{0}$	(48b)
	$\displaystyle\sum\nolimits_{t\in V\backslash s}x_{st}^{j}=\sum\nolimits_{t\in V\backslash s}x_{ts}^{j},\hskip 18.49428pti\in V,~{}j\in\mathbb{N}_{p}$	(48c)
	$\displaystyle\sum\nolimits_{t\in V\backslash 0}x_{0t}^{j}=1,\hskip 93.89418ptj\in\mathbb{N}_{p}$	(48d)
	$\displaystyle\sum\nolimits_{s\in V}\sum\nolimits_{t\in V\backslash s}c_{s}x_{st}^{j}\leq C,\hskip 52.63777ptj\in\mathbb{N}_{p}$	(48e)
	$\displaystyle w_{s}^{j}+T_{st}-M(1-x_{st}^{j})\leq w_{t}^{j},(s,t)\in E,j\in\mathbb{N}_{p}$	(48f)
	$\displaystyle a_{s}\leq w_{s}^{j}\leq b_{s},\hskip 81.09052pts\in V,~{}j\in\mathbb{N}_{p}$	(48g)
	$\displaystyle x_{st}^{j}\in{0,1},\hskip 81.09052pt(s,t)\in E,~{}j\in\mathbb{N}_{p}$	(48h)

The block structure lies in the routing variables $x_{st}^{j}$ for each vehicle $j$ , which are constrained by the flow balance, capacity and time window constraints. By relaxing the coupling constraints (48b), the problem decomposes into separate routing subproblems per vehicle.

5.1.1 Parameter Setting

We let ALM-C and ALM-P denote Algorithm 4 using the updates (16a) and (16b) in step 2, respectively. We compare our proposed methods with the Gurobi solver (version 11.0.0) and OR-tools on all the instances from the Solomon dataset [41]. Since the ADMM in [20] lacks adaptability for all instances, then we do not compare with it. To illustrate the scale of these instances, we present a subset of representative examples in Table I. The notations $|F|,|V|$ and $|E|$ represent the number of vehicles, the number of customers and the number of edges. $n$ and $n_{w}$ denote the number of variables $x$ and $w$ , respectively. To evaluate the robustness of the compared methods, we conduct experiments on the C1-type instances with various problem sizes, as detailed in Table II. The results of all remaining instances from the Solomon dataset are presented in Table III.

We adopt a “vehicle and route-based” formulation to model the CVRPTW problem, which is different from the space-time network flow formulation used in [20]. Therefore, the subproblem is a route problem with capacity and time window constraints. We utilize the Gurobi solver to solve the subproblem for convenience. Note that the formulation does not affect Gurobi’s performance. To ensure a fair comparison and maximize Gurobi’s utilization, we have implemented efficient callback functions based on Danzig’s formulation to fine-tune its performance. We set a time limit of 2000 seconds for the compared methods, except for or-tools where the time limit is set to 500 seconds, and use the symbol “–” to signify that the solver failed to find a feasible solution within the allotted time limit.

TABLE I: A subset of representative examples of the Solomon datasets

No.	$\|F\|$	$\|V\|$	$\|E\|$	$m$	$q$	$n$	$n_{w}$
R101-50	12	50	2,550	50	30,636	30,600	612
R201-50	6	50	2,550	50	15,318	15,300	306
RC101-50	8	50	2,550	50	20,424	20,400	408
RC201-50	5	50	2,550	50	12,765	12,750	255
C101-100	10	100	10,100	100	101,030	101,000	1,010
C201-100	3	100	10,100	100	30,309	30,300	303

5.1.2 Performances of the Proposed Algorithm

In the subsequent tables, the “ $f^{*}$ ” and “ $f$ ” columns correspond to the best-known objective values and the objective values of feasible solutions generated by these compared methods, respectively. The “Time” column denotes the CPU time (in seconds) that the methods taken by the algorithms to meet the stopping criteria. The optimality gap is defined by gap ${}^{1}=|f-f^{*}|/|f^{*}|$ . Due to the lack of an inherent termination criterion in OR-tools, it runs for the entire limited time and outputs the corresponding feasible solution. Therefore, we do not report its solution time. Table II shows the stability of the compared methods under various problem sizes on the C1-type instances. We can observe that our methods outperform the OR-Tools and Gurobi, and are more stable than the OR-Tools. From Table III, we can find that the ALM-C algorithm outperforms the OR-Tools and Gurobi in most instances, achieving significantly lower optimality gaps and competitive computation times. The ALM-P algorithm also exhibits promising results, often outperforming Gurobi in efficiency and solution quality. Overall, both the proposed ALM-C and ALM-P algorithms demonstrate their superiority and robustness in solving the CVRPTW problem, providing high-quality solutions with good computational efficiency when compared to existing solvers and heuristic approaches.

TABLE II: Performance comparison on Solomon’s C1 instances with varying problem size and perturbation. A time limit of 500 seconds is set for OR-Tools.

ins.	$\|V\|$	$\|F\|$	$f^{*}$	OR-Tools		Gurobi			ALM-P			ALM-C
ins.	$\|V\|$	$\|F\|$	$f^{*}$	$f$	gap¹	$f$	gap¹	Time	$f$	gap¹	Time	$f$	gap¹	Time
c101	25	3	191.3	632.6	230.7%	191.8	0.3%	0.0	191.8	0.3%	2.4	191.8	0.3%	2.7
	50	5	362.4	514.4	41.9%	363.2	0.2%	1.1	363.2	0.2%	5.9	363.2	0.2%	5.0
	100	10	827.3	828.9	0.2%	828.9	0.2%	8.9	828.9	0.2%	57.8	828.9	0.2%	55.0
c102	25	3	190.3	786.1	313.1%	190.7	0.2%	1.9	190.7	0.2%	3.7	190.7	0.2%	4.6
	50	5	361.4	667.9	84.8%	362.2	0.2%	25.5	362.2	0.2%	16.0	362.2	0.2%	15.3
	100	10	827.3	828.9	0.2%	-		2000	828.9	0.2%	167.4	828.9	0.2%	196.4
c103	25	3	190.3	786.1	313.1%	190.7	0.2%	7.0	190.7	0.2%	15.9	190.7	0.2%	11.0
	50	5	361.4	667.9	84.8%	362.2	0.2%	1584.8	365.3	1.1%	40.0	362.2	0.2%	41.5
	100	10	826.3	-	-	-		2000	839.0	1.5%	362.6	828.1	0.2%	259.9
c104	25	3	186.9	875.6	368.5%	187.4	0.3%	45.1	188.6	0.9%	33.5	188.6	0.9%	17.9
	50	5	358.0	606.2	69.3%	360.1	0.6%	2000	362.2	1.2%	57.1	358.9	0.3%	319.8
	100	10	822.9	1202.3	46.1%	-		2000	890.9	8.3%	667.5	904.9	10.0%	363.3
c105	25	3	191.3	609.6	218.6%	191.8	0.3%	0.1	191.8	0.3%	3.3	191.8	0.3%	3.4
	50	5	362.4	482.4	33.1%	363.2	0.2%	0.3	363.2	0.2%	6.4	363.2	0.2%	6.0
	100	10	827.3	828.9	0.2%	828.9	0.2%	15.8	828.9	0.2%	53.3	828.9	0.2%	34.2
c106	25	3	191.3	639.6	234.3%	191.8	0.3%	0.1	191.8	0.3%	3.6	191.8	0.3%	3.8
	50	5	362.4	430.4	18.8%	363.2	0.2%	1.4	363.2	0.2%	5.8	363.2	0.2%	5.9
	100	10	827.3	828.9	0.2%	828.9	0.2%	650.6	828.9	0.2%	69.6	828.9	0.2%	47.0
c107	25	3	191.3	565.6	195.6%	191.8	0.3%	0.1	191.8	0.3%	3.2	191.8	0.3%	4.3
	50	5	362.4	457.4	26.2%	363.2	0.2%	0.8	363.2	0.2%	6.1	363.2	0.2%	6.6
	100	10	827.3	828.9	0.2%	828.9	0.2%	14.6	828.9	0.2%	65.2	828.9	0.2%	35.2
c108	25	3	191.3	564.6	195.1%	191.8	0.3%	0.9	191.8	0.3%	4.5	191.8	0.3%	4.4
	50	5	362.4	-	-	363.2	0.2%	18.6	363.2	0.2%	8.4	363.2	0.2%	14.2
	100	10	827.3	-	-	-		2000	828.9	0.2%	98.0	828.9	0.2%	265.4
c109	25	3	191.3	475.6	148.6%	191.8	0.3%	9.6	191.8	0.3%	6.3	191.8	0.3%	9.7
	50	5	362.4	363.2	0.2%	363.2	0.2%	1867.4	363.2	0.2%	12.3	363.2	0.2%	13.1
	100	10	827.3	828.9	0.2%	-		2000	828.9	0.2%	234.5	828.9	0.2%	304.3

TABLE III: Complete results on other Solomon’s instances, all instances are associated with 50 customers. A time limit of 500 seconds is set for OR-Tools.

ins.	$\|J\|$	$f^{*}$	OR-Tools		Gurobi			ALM-P			ALM-C
ins.	$\|J\|$	$f^{*}$	$f$	$\varepsilon$	$f$	$\varepsilon$	$t$	$f$	$\varepsilon$	$t$	$f$	$\varepsilon$	$t$
c201.50	3	360.2	1622	350.3%	361.8	0.4%	0.1	361.8	0.4%	4.2	361.8	0.4%	3.4
c202.50	3	360.2	1622	350.3%	361.8	0.4%	5.7	361.8	0.4%	19.1	366.8	1.8%	7.1
c203.50	3	359.8	1713.4	376.2%	361.4	0.4%	113.5	361.4	0.4%	43.7	361.4	0.4%	76.8
c204.50	2	350.1	1435.3	310.0%	351.7	0.5%	2000	366.9	4.8%	55.7	351.7	0.5%	166.3
c205.50	3	359.8	1729	380.5%	361.4	0.4%	1.1	361.4	0.4%	7.9	361.4	0.4%	7.0
c206.50	3	359.8	1741.8	384.1%	361.4	0.4%	1.5	361.4	0.4%	14.4	361.4	0.4%	19.1
c207.50	3	359.6	1622.3	351.1%	361.2	0.4%	11.9	361.2	0.5%	25.0	361.2	0.4%	125.7
c208.50	2	350.5	1629.2	364.8%	352.1	0.5%	8.0	355.9	1.5%	14.2	352.1	0.5%	31.5
r101.50	12	1044	1541.9	47.7%	1046.7	0.3%	2.1	1046.7	0.3%	38.5	1046.7	0.3%	47.3
r102.50	11	909	1374.8	51.2%	911.4	0.3%	2000	939.5	3.4%	248.9	925.3	1.8%	317.2
r103.50	9	772.9	1069.1	38.3%	-	-	2000	806.4	4.3%	289.8	803.8	4.0%	256.1
r104.50	6	625.4	743.5	18.9%	-	-	2000	654.4	4.6%	472.1	686.6	9.8%	649.9
r105.50	9	899.3	1101.7	22.5%	901.9	0.3%	268.5	906.13	0.8%	61.1	901.9	0.3%	60.2
r106.50	5	793	937.8	18.3%	-	-	2000	-	-	-	-	-	-
r107.50	7	711.1	790.5	11.2%	-	-	2000	816.47	14.8%	120.5	741.5	4.3%	336.2
r108.50	6	617.7	698.1	13.0%	-	-	2000	645.7	4.5%	215.1	643.9	4.2%	466.5
r109.50	8	786.8	873.2	11.0%	805.6	2.4%	2000	796.3	1.2%	77.2	788.7	0.2%	324.9
r110.50	7	697	887.1	27.3%	-	-	2000	754.9	8.3%	560.6	768.1	10.2%	225.4
r111.50	7	707.2	784.5	10.9%	-	-	2000	750.0	6.0%	297.8	801.9	13.4%	220.8
r112.50	6	630.2	730.6	15.9%	-	-	2000	665.7	5.6%	476.5	698.0	10.8%	300.2
r201.50	6	791.9	1196.9	51.1%	794.3	0.3%	5.2	803.5	1.5%	27.0	801.3	1.2%	30.4
r202.50	5	698.5	1173.2	68.0%	723.0	3.5%	2000	735.7	5.3%	437.0	726.3	4.0%	337.2
r203.50	5	605.3	1173.2	93.8%	608.0	0.4%	2000	639.4	5.6%	168.6	653.0	7.9%	380.9
r204.50	2	506.4	1127.9	122.7%	512.4	1.2%	2000	524.0	3.5%	75.8	583.8	15.3%	401.6
r205.50	4	690.1	1132.3	64.1%	700.2	1.5%	2000	732.4	6.1%	74.5	731.7	6.0%	87.8
r206.50	4	632.4	1066.9	68.7%	657.2	3.9%	2000	682.1	7.9%	188.4	677.2	7.1%	266.6
r207.50	3	361.6	1046.8	189.5%	-	-	2000	362.6	0.3%	13.1	364.1	0.7%	25.2
r208.50	1	328.2	1019.6	210.7%	-	-	2000	329.3	0.3%	11.9	329.3	0.3%	13.3
r209.50	4	600.6	1043.6	73.8%	639.6	6.5%	2000	608.5	1.3%	217.9	609.4	1.5%	58.5
r210.50	4	645.6	1119.5	73.4%	661.0	2.4%	2000	710.2	10.0%	288.6	669.2	3.7%	253.4
r211.50	3	535.5	958.7	79.0%	-	-	2000	590.4	10.3%	168.8	555.1	3.7%	339.8
rc101.50	8	944	1108.3	17.4%	945.6	0.2%	2000	945.6	0.2%	191.2	945.6	0.2%	78.8
rc102.50	7	822.5	940.7	14.4%	-	-	2000	823.1	0.1%	510.3	823.1	0.1%	242.9
rc103.50	6	710.9	834.9	17.4%	-	-	2000	736.4	3.6%	134.7	751.3	5.7%	335.2
rc104.50	5	545.8	641.4	17.5%	-	-	2000	546.5	0.1%	101.6	546.5	0.1%	91.5
rc105.50	8	855.3	1112.7	30.1%	-	-	2000	873.2	2.1%	99.0	902.7	5.5%	134.4
rc106.50	6	723.2	793	9.7%	-	-	2000	733.7	1.5%	79.3	728.1	0.7%	152.4
rc107.50	6	642.7	752.3	17.1%	-	-	2000	650.3	1.2%	235.2	644.0	0.2%	183.7
rc108.50	6	598.1	690.3	15.4%	-	-	2000	600.7	0.4%	256.1	599.2	0.2%	276.4
rc201.50	5	684.8	1904.7	178.1%	686.3	0.2%	12.5	687.7	0.4%	31.9	686.3	0.2%	14.8
rc202.50	5	613.6	1172.6	91.1%	615.0	0.2%	2000	615.6	0.3%	62.2	615.0	0.2%	85.9
rc203.50	4	555.3	1163.1	109.5%	556.5	0.2%	2000	590.4	6.3%	256.7	558.5	0.6%	265.8
rc204.50	3	444.2	1090.3	145.5%	509.4	14.7%	2000	451.8	1.7%	218.4	462.3	4.1%	153.7
rc205.50	5	630.2	1222.7	94.0%	632.0	0.3%	2000	632.0	0.3%	89.8	632.0	0.3%	56.0
rc206.50	5	610	1088.5	78.4%	611.7	0.3%	2000	611.7	0.3%	36.8	611.7	0.3%	42.9
rc207.50	4	558.6	998.3	78.7%	-	-	2000	591.7	5.9%	315.5	608.5	8.9%	281.1
rc208.50	2	269.1	936.9	248.2%	-	-	2000	269.6	0.2%	65.0	269.6	0.2%	109.1

To demonstrate the advantages of our methods, we show the convergence curve of the primal bound and dual bound of Gurobi, comparing it to the solution found by our solver with the instance “C109.50” as an example. As we can observe in Figure 2, our methods achieve near-optimal solutions in around 15 seconds, significantly faster than Gurobi which takes approximately 300 seconds. We notice that as the objective values increase, the corresponding constraint violation decreases simultaneously, facilitating fast convergence.

5.2 Train timetabling problem

We consider following space-time network model for the train timetabling problem (TTP) on a macro level, which is based on the model in [42]. We use a directed, acyclic and multiplicative graph $G=(V,E)$ to characterize the train timetabling problem, where $V$ and $E$ denote the set of all nodes and the set of all arcs. For each train $j\in\mathbb{N}_{p}$ , the sets or parameters with superscript or subscript notation corresponds to relevant object to $j$ . For each arc $e\in E_{j}$ , we introduce a binary variable $x_{e}$ equal to 1 if the arc $e$ is selected. For each node $v\in V$ , let $\delta^{+}_{j}(v)$ and $\delta^{-}_{j}(v)$ be the sets of arcs in $E_{j}$ leaving and entering node $v$ , respectively. Then the integer programming model of TTP is given by


$\displaystyle\max$	$\displaystyle\sum\nolimits_{j\in\mathbb{N}_{p}}\sum\nolimits_{e\in E^{j}}p_{e}x_{e}$	(49a)
s.t.	$\displaystyle\sum\nolimits_{e\in\delta_{j}^{+}(\sigma)}x_{e}\leq 1,\hskip 88.2037ptj\in\mathbb{N}_{p}$	(49b)
	$\displaystyle\sum_{e\in\delta_{j}^{-}(v)}x_{e}=\sum_{e\in\delta_{j}^{+}(v)}x_{e},~{}j\in\mathbb{N}_{p},\hskip 6.25958ptv\in V\backslash\{\sigma,\tau\}$	(49c)
	$\displaystyle\sum\nolimits_{e\in\delta_{j}^{-}(\tau)}x_{e}\leq 1,\hskip 91.04872ptj\in\mathbb{N}_{p}$	(49d)
	$\displaystyle\sum\nolimits_{v^{\prime}\in\mathcal{N}(v)}\sum\nolimits_{j\in\mathcal{T}(v^{\prime})}\sum\nolimits_{e\in\delta_{j}^{-}(v^{\prime})}x_{e}\leq 1,v\in V$	(49e)
	$\displaystyle\sum\nolimits_{e\in C}x_{e}\leq 1,\hskip 105.2751ptC\in\mathcal{C}$	(49f)
	$\displaystyle x_{e}\in\{0,1\},\hskip 120.92421pte\in E,$	(49g)

where $p_{e}$ is the “profit” of using a certain arc $e$ . $\sigma$ and $\tau$ denote artificial origin and destination nodes, respectively. $\mathcal{T}(v)$ and $\mathcal{N}(v)$ denote the set of trains may passing through node $v$ and the set of nodes conflicted with node $v$ , respectively. $\mathcal{C}$ denotes the (exponentially large) family of maximal subsets $C$ of pairwise incompatible arcs. In this model, (49b), (49c), (49d) imply the arcs of train $j$ should form a valid path in $G$ , (49e) represents headway constraints, (49f) forbids the simultaneous selection of incompatible arcs, imposing the track capacity constraints.

Let $\textbf{x}_{j}=\{x_{e}\mid e\in E_{j}\}$ . Then we can rewrite this space-time network model in the general form as (1). Our goal is to show that the proposed Algorithm 4 is fully capable to provide implementable time tables for the Jinghu railway. Specifically, our algorithms are tested on the time-tabling problem for Beijing-Shanghai high-speed railway (or Jinghu high-speed railway in Mandarin). As one of the busiest railways in the world, the Beijing-Shanghai high-speed railway transported over 210 million passengers in 2019^*^**For details, see https://en.wikipedia.org/wiki/Beijing-Shanghai_high-speed_railway.. In our test case, the problem consists of 29 stations and 292 trains in both directions (up and down), including two major levels of speed: 300 km/h and 350 km/h. Several numerical experiments are carried out based on the data of Beijing-Shanghai high-speed railway to demonstrate the feasibility and effectiveness of the proposed strategy.

We compare our proposed methods with the Gurobi solver and ADMM [21] on small and real-world instances, as presented in Tables IV and V. The notations $|F|,|S|$ and $|T|$ represent the number of trains, the number of stations, and the time window. In most large-scale cases, the Gurobi solver takes a much longer time to solve, so we set a time limit of two hours. Since (49a) maximizes the positive revenue, we have a negative cost if reversing the objective to a minimization problem, then both subproblems (16a) and (16b) are solved by the Bellman-Ford algorithm, which is an efficient tool for solving the shortest path problem.

5.2.1 Performances on small instances

We first validate our algorithm on a smaller sub-network using a subset of stations of the Jinghu railway. We set the revenue of each arc proportional to the distance between two stations, which corresponds to an intuition that longer rides should bear higher incomes. For our proposed methods, we set uniformly $\rho^{0}=20$ and $\sigma=1.2$ for all instances. We set a time limit of 100 seconds for our algorithms. Since the optimal value is unknown, we report the upper bounds (UB) obtained by the Gurobi solver as a reference for comparison in Table IV. We can observe that our ALM-C performs competitively with the Gurobi solver in terms of both the optimal value and computation time.

TABLE IV: Performance on four small examples on sub-networks of the Jinghu railway for maximizing the total revenue

$\|F\|$	$\|S\|$	$\|T\|$	Gurobi			ADMM		ALM-P		ALM-C
$\|F\|$	$\|S\|$	$\|T\|$	UB	$f$	Time	$f$	Time	$f$	Time	$f^{*}$	Time
20	15	200	20952	20952	5.0	20952	14.2	20952	13.1	20952	10.0
25	15	200	23396	23396	11.5	23396	31.3	21085	22.1	23396	11.4
30	15	300	25707	25707	14.4	24552	14.1	25707	23.0	25707	9.2
40	15	300	34305	34305	24.3	31817	100.0	32061	100.0	34305	17.0

5.2.2 Performances on real-world instances

To verify the efficiency of the ALM-C and ALM-P on large-scale data with all stations involved, we consider the following five examples of ascending problem size in Table V. Similarly, we try to maximize the total revenue of our schedule. In practice, the revenue of an arc may be ambiguous. Besides, since Jinghu high-speed railway always has a high level of utilization, introducing new trains is always beneficial since it further covers unmet demand. In this view, we introduce an alternative objective function to simply maximize the number of scheduled trains. Perhaps not surprisingly, this simplified objective can dramatically speed up our methods for practical interest. This is due to the fact that the coefficient $p_{e}$ in the objective function is sparse, resulting in faster identification of the optimal solution under this model.

For our proposed methods, we set $\rho^{0}=10^{-3}$ and $\sigma=2$ for the instances No. 1-2, $\rho^{0}=10^{-2}$ and $\sigma=1.1$ for the instances No. 3 and No. 5, $\rho^{0}=10^{-3}$ and $\sigma=1.1$ for the instance No. 4. We set $\textbf{x}^{0}$ and $\lambda^{0}$ to be zero for all instances. Both the number of variables and the number of constraints of the model (49) are very large for this practical TTP instance, whose dimensions are up to tens of millions. Tables VI and VII show the performance results of our proposed methods and the Gurobi solver, where gap ${}^{2}:=(\text{UB}-f)/f$ .

TABLE V: Five examples of the large practical railway network

No.	$\|F\|$	$\|S\|$	$\|T\|$	$m$	$q$	$n$
1	50	29	300	49,837	231,722	308,177
2	50	29	600	113,737	1,418,182	1,396,572
3	100	29	720	145,135	1,788,088	4,393,230
4	150	29	960	199,211	3,655,151	9,753,590
5	292	29	1080	228,584	13,625,558	26,891,567

TABLE VI: Performance comparison between the Gurobi solver, ADMM, ALM-P and ALM-C on the large networks for maximizing total revenue in Table V.

No.	Gurobi			ADMM			ALM-P			ALM-C
No.	$f$	gap²	Time	$f$	gap²	Time	$f$	gap²	Time	$f$	gap²	Time
1	6.18e+04	13.6%	7200.0	6.94e+04	3.0%	54.1	6.93e+04	3.1%	84.3	6.98e+04	2.5%	72.2
2	7.61e+04	45.3%	7200.0	1.34e+05	3.5%	96.2	1.34e+05	3.4%	144.4	1.35e+05	3.0%	126.1
3	9.69e+04	61.5%	7200.0	2.33e+05	7.3%	558.3	2.36e+05	6.3%	414.9	2.36e+05	6.3%	396.6
4	1.23e+05	74.5%	7200.0	4.38e+05	9.6%	1422.7	4.42e+05	8.7%	924.3	4.45e+05	8.1%	714.9
5	2.95e+05	74.0%	7200.0	9.15e+05	19.5%	3600.0	9.60e+05	15.6%	2094.2	9.63e+05	15.3%	1722.7

TABLE VII: Performance comparison between the Gurobi solver, ADMM, ALM-P and ALM-C on the large networks for maximizing the number of trains in the timetable in Table V

No.	$f^{*}$	Gurobi			ADMM			ALM-P			ALM-C
No.	$f^{*}$	$f$	gap¹	Time	$f$	gap¹	Time	$f$	gap¹	Time	$f$	gap¹	Time
1	–	27	–	3836.0	27	–	3.4	27	–	3.4	27	–	3.5
2	50	50	0	2965.8	50	0	8.2	50	0	8.1	50	0	8.2
3	100	11	89.0%	7200.0	89	11.0%	3001.4	100	0	127.4	100	0	96.4
4	150	14	90.7%	7200.0	148	1.3%	3005.3	150	0	99.5	150	0	96.4
5	292	30	89.7%	7200.0	286	2.1%	3003.8	292	0	203.1	292	0	172.1

The results clearly demonstrate the high effectiveness of our methods compared to Gurobi, especially when dealing with large-scale data. For the instance No. 1, both ALM-C and ALM-P can schedule 27 trains in a few seconds, which are at least 1000 times faster than the Gurobi solver and meanwhile obtain satisfactory accuracy performance. In particular, when the scale of data is up to tens of millions in instance No. 5, our ALM-C and ALM-P successfully schedule all 292 trains. In contrast, the Gurobi solver produces a relatively small number of trains, only 30 trains, and takes much longer. The results of other instances also illustrate the effectiveness of this technique. Therefore, we can conclude that the customized ALM provides a fast and global optimal solution for this practical problem.

6 Conclusion

In this paper, we study general integer programming with block structure. Benefiting from its special structure, we extend the augmented Lagrangian method, originally designed for continuous problems, to effectively solve the problem (1). By introducing a novel augmented Lagrangian function, we establish the strong duality and optimality for the problem (1). Furthermore, we provide the convergence results of the proposed methods for both the augmented Lagrangian relaxation and dual problems. To obtain high-quality feasible solutions, we develop a customized ALM combined with refinement techniques to iteratively improve the primal and dual solution quality simultaneously. The numerical experiments demonstrate that the customized ALM is time-saving and performs well for finding optimal solutions to a wide variety of practical problems.

References

[1] P. Wang, C. Shen, A. van den Hengel, and P. H. Torr, “Large-scale binary quadratic optimization using semidefinite relaxation and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 3, pp. 470–485, 2016.
[2] B. S. Y. Lam and A. W.-C. Liew, “A fast binary quadratic programming solver based on stochastic neighborhood search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 32–49, 2020.
[3] X. Lin, Z. J. Hou, H. Ren, and F. Pan, “Approximate mixed-integer programming solution with machine learning technique and linear programming relaxation,” in 2019 3rd International Conference on Smart Grid and Smart Cities (ICSGSC). IEEE, 2019, pp. 101–107.
[4] R. Anderson, J. Huchette, W. Ma, C. Tjandraatmadja, and J. P. Vielma, “Strong mixed-integer programming formulations for trained neural networks,” Mathematical Programming, vol. 183, no. 1, pp. 3–39, 2020.
[5] F. Eisenbrand, C. Hunkenschröder, K.-M. Klein, M. Kouteckỳ, A. Levin, and S. Onn, “An algorithmic theory of integer programming,” arXiv preprint arXiv:1904.01361, 2019.
[6] J. Cslovjecsek, M. Kouteckỳ, A. Lassota, M. Pilipczuk, and A. Polak, “Parameterized algorithms for block-structured integer programs with large entries,” in Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM, 2024, pp. 740–751.
[7] J. A. De Loera, R. Hemmecke, S. Onn, and R. Weismantel, “N-fold integer programming,” Discrete Optimization, vol. 5, no. 2, pp. 231–241, 2008.
[8] L. Chen, On Block-Structured Integer Programming and Its Applications. Cham: Springer International Publishing, 2019, pp. 153–177.
[9] A. H. Land and A. G. Doig, “An automatic method for solving discrete programming problems,” in 50 Years of Integer Programming 1958-2008. Springer, 2010, pp. 105–132.
[10] R. E. Gomory, “Outline of an algorithm for integer solutions to linear programs,” Bulletin of the American Mathematical Society, vol. 64, pp. 275–278, 1958.
[11] J. F. Benders, “Partitioning procedures for solving mixed-variables programming problems,” Numerische Mathematik, vol. 4, no. 1, pp. 238–252, 1962.
[12] G. B. Dantzig and P. Wolfe, “Decomposition principle for linear programs,” Operations Research, vol. 8, no. 1, pp. 101–111, 1960.
[13] A. M. Geoffrion, “Lagrangean relaxation for integer programming,” in Approaches to Integer Programming. Springer, 1974, pp. 82–114.
[14] M. Conforti, G. Cornuéjols, and G. Zambelli, Integer programming. Springer, 2014, vol. 271.
[15] A. Caprara, M. Monaci, P. Toth, and P. L. Guida, “A lagrangian heuristic algorithm for a real-world train timetabling problem,” Discrete Applied Mathematics, vol. 154, no. 5, pp. 738–753, 2006.
[16] B. Wu and B. Ghanem, “ $\ell_{p}$ -box ADMM: A versatile framework for integer programming,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1695–1708, 2018.
[17] K. Sun, M. Sun, and W. Yin, “Decomposition methods for global solutions of mixed-integer linear programs,” arXiv preprint arXiv:2102.11980, 2021.
[18] M. Feizollahi, “Large-scale unit commitment: Decentralized mixed integer programming approaches,” Ph.D. dissertation, Georgia Institute of Technology, 2015.
[19] R. Takapoui, “The alternating direction method of multipliers for mixed-integer optimization applications,” Ph.D. dissertation, Stanford University, 2017.
[20] Y. Yao, X. Zhu, H. Dong, S. Wu, H. Wu, L. C. Tong, and X. Zhou, “ADMM-based problem decomposition scheme for vehicle routing problem with time windows,” Transportation Research Part B: Methodological, vol. 129, pp. 156–174, 2019.
[21] Q. Zhang, R. M. Lusby, P. Shang, and X. Zhu, “Simultaneously re-optimizing timetables and platform schedules under planned track maintenance for a high-speed railway network,” Transportation Research Part C: Emerging Technologies, vol. 121, p. 102823, 2020.
[22] Y. Kanno and S. Kitayama, “Alternating direction method of multipliers as a simple effective heuristic for mixed-integer nonlinear optimization,” Structural and Multidisciplinary Optimization, vol. 58, no. 3, pp. 1291–1295, 2018.
[23] E. L. Johnson, G. L. Nemhauser, and M. W. Savelsbergh, “Progress in linear programming-based algorithms for integer programming: An exposition,” Informs Journal on Computing, vol. 12, no. 1, pp. 2–23, 2000.
[24] J. Feldman, M. J. Wainwright, and D. R. Karger, “Using linear programming to decode binary linear codes,” IEEE Transactions on Information Theory, vol. 51, no. 3, pp. 954–972, 2005.
[25] M. Laurent and F. Rendl, “Semidefinite programming and integer programming,” Handbooks in Operations Research and Management Science, vol. 12, pp. 393–514, 2005.
[26] P. Wang, C. Shen, and A. Van Den Hengel, “A fast semidefinite approach to solving binary quadratic problems,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1312–1319.
[27] A. Camisa, I. Notarnicola, and G. Notarstefano, “A primal decomposition method with suboptimality bounds for distributed mixed-integer linear programming,” in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 3391–3396.
[28] R. Vujanic, P. M. Esfahani, P. J. Goulart, S. Mariéthoz, and M. Morari, “A decomposition method for large scale MILPs, with performance guarantees and a power system application,” Automatica, vol. 67, pp. 144–156, 2016.
[29] P. Muts, I. Nowak, and E. M. Hendrix, “The decomposition-based outer approximation algorithm for convex mixed-integer nonlinear programming,” Journal of Global Optimization, vol. 77, no. 1, pp. 75–96, 2020.
[30] W. Murray and K.-M. Ng, “An algorithm for nonlinear optimization problems with binary variables,” Computational Optimization and Applications, vol. 47, no. 2, pp. 257–288, 2010.
[31] S. Lucidi and F. Rinaldi, “Exact penalty functions for nonlinear integer programming problems,” Journal of Optimization Theory and Applications, vol. 145, no. 3, pp. 479–488, 2010.
[32] C. Yu, K. L. Teo, and Y. Bai, “An exact penalty function method for nonlinear mixed discrete programming problems,” Optimization Letters, vol. 7, no. 1, pp. 23–38, 2013.
[33] M. J. Feizollahi, S. Ahmed, and A. Sun, “Exact augmented Lagrangian duality for mixed integer linear programming,” Mathematical Programming, vol. 161, no. 1, pp. 365–387, 2017.
[34] C. E. Blair and R. G. Jeroslow, “The value function of a mixed integer program: I,” Discrete Mathematics, vol. 19, no. 2, pp. 121–138, 1977.
[35] J. A. González and J. Castro, “A heuristic block coordinate descent approach for controlled tabular adjustment,” Computers $\&$ Operations Research, vol. 38, no. 12, pp. 1826–1835, 2011.
[36] S. Jäger and A. Schöbel, “The blockwise coordinate descent method for integer programs,” Mathematical Methods of Operations Research, vol. 91, no. 2, pp. 357–381, 2020.
[37] P. Tseng, “Convergence of a block coordinate descent method for nondifferentiable minimization,” Journal of Optimization Theory and Applications, vol. 109, no. 3, pp. 475–494, 2001.
[38] J. Bolte, S. Sabach, and M. Teboulle, “Proximal alternating linearized minimization for nonconvex and nonsmooth problems,” Mathematical Programming, vol. 146, no. 1, pp. 459–494, 2014.
[39] H. Attouch, J. Bolte, and B. F. Svaiter, “Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods,” Mathematical Programming, vol. 137, no. 1-2, pp. 91–129, 2013.
[40] A. Ruszczynski, Nonlinear Optimization. Princeton University Press, 2011.
[41] M. M. Solomon, “Algorithms for the vehicle routing and scheduling problems with time window constraints,” Operations research, vol. 35, no. 2, pp. 254–265, 1987.
[42] A. Caprara, M. Fischetti, and P. Toth, “Modeling and solving the train timetabling problem,” Operations Research, vol. 50, no. 5, pp. 851–861, 2002.

		$\displaystyle\\|\nabla L(\textbf{\rm{x}},\lambda,\rho)-\nabla L(\bar{\textbf{\rm{x}}},\lambda,\rho)\\|$
	$\displaystyle=$	$\displaystyle\left\\|\rho\boldsymbol{A}^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}-\rho\boldsymbol{A}^{\top}(\boldsymbol{A}\bar{\textbf{x}}-\boldsymbol{b})_{+}\right\\|$
	$\displaystyle\leq$	$\displaystyle\rho\\|\boldsymbol{A}\\|_{2}\left\\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}-(\boldsymbol{A}\bar{\textbf{x}}-\boldsymbol{b})_{+}\right\\|$
	$\displaystyle\leq$	$\displaystyle\rho\\|\boldsymbol{A}\\|_{2}\left\\|\boldsymbol{A}\textbf{x}-\boldsymbol{A}\bar{\textbf{x}}\right\\|\leq\rho\\|\boldsymbol{A}\\|^{2}_{2}\left\\|\textbf{x}-\bar{\textbf{x}}\right\\|.$

			$\displaystyle\left(\tau g(\textbf{x}^{})+\frac{\textbf{1}}{2}-\textbf{x}^{}\right)^{\top}\left(\textbf{x}-\textbf{x}^{*}\right)$
		$\displaystyle=$	$\displaystyle\sum\nolimits_{l\in\mathbb{J}}\left(\tau g_{l}(\textbf{x}^{})+\frac{1}{2}\right)-\sum\nolimits_{l\in\bar{\mathbb{J}}}\left(\tau g_{l}(\textbf{x}^{})-\frac{1}{2}\right)$
		$\displaystyle\leq$	$\displaystyle\tau\left(\sum\nolimits_{l\in\mathbb{J}}g_{l}(\textbf{x}^{})-\sum\nolimits_{l\in\bar{\mathbb{J}}}g_{l}(\textbf{x}^{})\right)+\frac{\delta}{2},$

			$\displaystyle\\|\nabla_{\textbf{x}_{j}}L(\textbf{x},\lambda,\rho)-\nabla_{\textbf{x}_{j}}L(\bar{\textbf{x}},\lambda,\rho)\\|$
		$\displaystyle=$	$\displaystyle\left\\|\rho\boldsymbol{A}_{j}^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}-\rho\boldsymbol{A}_{j}^{\top}(\boldsymbol{A}\bar{\textbf{x}}-\boldsymbol{b})_{+}\right\\|$
		$\displaystyle\leq$	$\displaystyle\rho\\|\boldsymbol{A}_{j}\\|_{2}\left\\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}-(\boldsymbol{A}\bar{\textbf{x}}-\boldsymbol{b})_{+}\right\\|$
		$\displaystyle\leq$	$\displaystyle\rho\\|\boldsymbol{A}_{j}\\|_{2}\left\\|\boldsymbol{A}\textbf{x}-\boldsymbol{A}\bar{\textbf{x}}\right\\|\leq\rho\\|\boldsymbol{A}_{j}\\|^{2}_{2}\left\\|\textbf{x}_{j}-\bar{\textbf{x}}_{j}\right\\|.$