This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Customized Augmented Lagrangian Method for Block-Structured Integer Programming

Rui Wang, Chuwen Zhang, Shanwen Pu, Jianjun Gao, Zaiwen Wen Rui Wang and Zaiwen Wen are with the Beijing International Center for Mathematical Research, Peking University, Beijing 100871, China (email: [email protected]; [email protected]). Chuwen Zhang, Shanwen Pu and Jianjun Gao are with the School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai 200433, China (email: [email protected]; [email protected];[email protected]).This research is supported in part by the National Natural Science Foundation of China (NSFC) grants 12331010, 72150001, 72225009, 72394360, 72394365.
Abstract

Integer programming with block structures has received considerable attention recently and is widely used in many practical applications such as train timetabling and vehicle routing problems. It is known to be NP-hard due to the presence of integer variables. We define a novel augmented Lagrangian function by directly penalizing the inequality constraints and establish the strong duality between the primal problem and the augmented Lagrangian dual problem. Then, a customized augmented Lagrangian method is proposed to address the block-structures. In particular, the minimization of the augmented Lagrangian function is decomposed into multiple subproblems by decoupling the linking constraints and these subproblems can be efficiently solved using the block coordinate descent method. We also establish the convergence property of the proposed method. To make the algorithm more practical, we further introduce several refinement techniques to identify high-quality feasible solutions. Numerical experiments on a few interesting scenarios show that our proposed algorithm often achieves a satisfactory solution and is quite effective.

Index Terms:
Integer programming, augmented Lagrangian method, block coordinate descent, convergence

1 Introduction

In this paper, we consider a block-structured integer programming problem:

min\displaystyle\min 𝒄x\displaystyle~{}~{}\boldsymbol{c}^{\top}\textbf{x} (1a)
s.t. 𝑨x𝒃,\displaystyle~{}~{}\boldsymbol{A}\textbf{x}\leq\boldsymbol{b}, (1b)
xj𝒳j,j=1,2,,p,\displaystyle~{}~{}\textbf{x}_{j}\in\mathcal{X}_{j},~{}j=1,2,...,p, (1c)

where xjnj\textbf{x}_{j}\in\mathbb{R}^{n_{j}} is the jj-th block variable of xn\textbf{x}\in\mathbb{R}^{n}, i.e., x=(x1;;xp)\textbf{x}=(\textbf{x}_{1};...;\textbf{x}_{p}) for p1p\geq 1 with n=j=1pnjn=\sum_{j=1}^{p}n_{j}. In (1), 𝑨m×n,𝒃m,𝒄n\boldsymbol{A}\in\mathbb{R}^{m\times n},\boldsymbol{b}\in\mathbb{R}^{m},\boldsymbol{c}\in\mathbb{R}^{n} and the constraint 𝒳j\mathcal{X}_{j} is the set of 0,10,1 vectors in a polyhedron, i.e.,

𝒳j:={xj{0,1}nj:𝑩jxj𝒅j},j=1,2,,p.\mathcal{X}_{j}:=\{\textbf{x}_{j}\in\{0,1\}^{n_{j}}:\boldsymbol{B}_{j}\textbf{x}_{j}\leq\boldsymbol{d}_{j}\},~{}j=1,2,...,p.

The constraints (1c) can be reformulated as x𝒳:={x{0,1}n:𝑩x𝒅}\textbf{x}\in\mathcal{X}:=\{\textbf{x}\in\{0,1\}^{n}:~{}\boldsymbol{B}\textbf{x}\leq\boldsymbol{d}\} where the block diagonal matrix 𝑩q×n\boldsymbol{B}\in\mathbb{R}^{q\times n} is formed by the small submatrices 𝑩j\boldsymbol{B}_{j} as the main diagonal and 𝒅=(𝒅1;;𝒅p)\boldsymbol{d}=(\boldsymbol{d}_{1};...;\boldsymbol{d}_{p}). Correspondingly, 𝒄\boldsymbol{c} and 𝑨\boldsymbol{A} can be rewritten as 𝒄=(𝒄1;𝒄2;;𝒄p)\boldsymbol{c}=(\boldsymbol{c}_{1};\boldsymbol{c}_{2};...;\boldsymbol{c}_{p}) with 𝒄jnj\boldsymbol{c}_{j}\in\mathbb{R}^{n_{j}} and 𝑨=(𝑨1𝑨2𝑨p)\boldsymbol{A}=(\boldsymbol{A}_{1}~{}\boldsymbol{A}_{2}~{}...~{}\boldsymbol{A}_{p}) with 𝑨jm×nj\boldsymbol{A}_{j}\in\mathbb{R}^{m\times n_{j}}.

Assume that these constraints (1c) are “nice” in the sense that an integer program with just these constraints is easy. Therefore, if the coupling constraints (1b) are ignored, the remaining problem which is only composed of the constraints (1c) is easier to solve than the original problem (1). For convenience, we assume 𝒳\mathcal{X} is not empty and (1) is feasible. Denote by fIPf^{\text{IP}} the optimal value of the problem (1). This block structure is closely related to an important model known as “nn-fold integer programming (IP)” studied extensively in computer vision [1, 2], machine learning [3, 4] and theoretical computer science [5, 6], etc. The theoretical foundations of nn-fold IPs have significant implications for efficient algorithm development in various fields. For example, an algorithmic theory of integer programming based on nn-fold IP was proposed in [5]. Recent advancements have been provided in [6]. Furthermore, the progress in theory and application of integer programming with a block structure was summarized in [7, 8]. While existing works on nn-fold IP mainly focus on asymptotic analyses, this work aims to develop an efficient augmented Lagrangian approach tailored to the block structure for practical efficiency with a convergence guarantee.

1.1 Related Work

The branch and bound algorithm for general integer programming (IP) was first introduced by Land and Doig [9]. Gomory [10] developed a cutting plane algorithm for integer programming problems. These two approaches are at the heart of current state-of-the-art software for integer programming. Unfortunately, these methods often suffer from high computational burdens due to the discrete constraint, thus they are not good choices for solving some large-scale practical problems. Therefore, it is necessary to develop efficient approaches to obtain feasible and desirable solutions, even if they may not be globally optimal. Considering the block structure of the integer linear programming (1), a natural idea is to decompose a large global problem into smaller local subproblems. There are three typical decomposition methods for solving such problem: Benders [11], Dantzig-Wolfe (DW) [12] and Lagrangian decompositions [13]. The Benders decomposition method is used to deal with mixed integer programming problems by decomposing them into a master problem and a primal subproblem. It generates cuts that are added to the master problem by solving the dual of the primal subproblem. This method is not suitable for (1), since our primal subproblem is an integer programming problem, which is still NP-hard. The DW decomposition method can solve block structured integer programming problem (1) based on resolution theorem. To improve the tractability of large-scale problems, the DW decomposition relies on column generation. However, it might be harder or even intractable to solve the master problem by column generation if further constraints are applied to 𝒳j\mathcal{X}_{j} [14, Chapter 8.2]. The Lagrangian decomposition introduces Lagrange multipliers and constructs a sequence of simpler subproblems. Although the method itself has a limitation due to its inherent solution symmetry for certain practical problems, the Lagrangian duality bound is useful in the branch-and-bound procedure. A Lagrangian heuristic algorithm has also been proposed in [15] for solving a real-world train timetabling problem, which decomposed the problem into smaller subproblems using Lagrangian relaxation and developed a heuristic algorithm based on subgradient optimization to generate feasible solutions.

Many techniques from continuous optimization including the alternating direction method of multipliers (ADMM) have been applied to solve integer programming recently. The authors in [16] proposed an p\ell_{p}-box ADMM for solving a binary integer programming, where the binary constraint is replaced by the intersection of a box and an p\ell_{p}-norm sphere. The authors in [17] proposed augmented Lagrangian method (ALM) and ADMM based on the 1\ell_{1} augmented Lagrangian function for two-block mixed integer linear programming (MILP). For the multi-block MILP problem, an extended ADMM was proposed in [18] to employ a release-and-fix approach to solve the subproblems. There are also some heuristic methods based on ADMM [19, 20, 21, 22] that have been successfully applied to various practical integer programming problems.

There are some other widespread approaches on the relaxation of the binary constraint including linear programming (LP) relaxation [23, 24] and semidefinite relaxation [25, 26]. For the multi-block MILP problem, several inexact methods have been proposed including a distributed algorithm relying on primal decomposition [27], a dual decomposition method [28], and a decomposition-based outer approximation method [29]. By introducing continuous variables to replace the discrete variables, the exact penalty methods [30, 31, 32] have been studied for solving the nonlinear IP problems. Then the problem is transformed into an equivalent nonlinear continuous optimization problem.

1.2 Contributions

In this paper, we propose a customized ALM for solving (1). Our main contributions are listed below.

  • (i)

    We define a novel augmented Lagrangian (AL) function that differs from the classical AL function and establish strong duality theory for the augmented Lagrangian relaxation of the block-structured integer programming (1), which motivates us to utilize the ALM for solving the problem (1).

  • (ii)

    Based on the special structure of (1), we propose two block coordinate descent (BCD)-type methods that are well-suited for solving the resulting subproblems in our ALM framework. These methods utilize classical update and proximal linear update techniques, denoted as ALM-C and ALM-P, respectively. We also analyze their convergence properties under proper conditions.

  • (iii)

    To address challenges in finding the global optimal solution for practical problems such as train timetabling, we introduce refinement strategies and propose a customized ALM to enhance the quality of solutions generated by the ALM. Our numerical experiments demonstrate the effectiveness of the proposed method in solving large-scale instances.

Note that the ADMM-based method in [21] solves each subproblem only once per iteration for a fixed Lagrangian multiplier. On the other hand, our ALM method involves multiple iterations using the BCD method to minimize each AL function until certain rules are satisfied. Once the AL subproblem is solved exactly, the strong duality guarantees that the ALM can converge to a global minimizer of the problem (1). Therefore, achieving high accuracy in minimizing the AL function allows our ALM to achieve superior solutions with fewer iterations, resulting in significantly reduced computational time compared to the ADMM. This claim is supported by our numerical tests. Additionally, we introduce Assumptions 3.1 and 3.2, derived from the structural characteristics of the practical problem in [20] and [21], and subsequently analyze the theoretical properties of the ALM-C method under these assumptions. However, our ALM-P method has wider applicability and does not require these specific assumptions. Moreover, the ALM can be regarded as a dual ascent algorithm with respect to dual variables, hence ensuring its convergence. In contrast, the convergence analysis of the ADMM for solving this kind of problem remains unclear.

The existing ALM-based methods for solving integer programming in [33] mainly focused on the duality of the augmented Lagrangian dual for MILP with equality constraints, but numerical experiments were not available. Moreover, our approach differs significantly in that we handle inequality constraints directly, without introducing slack variables. This approach has two key benefits: it reduces the number of variables, thus decreasing computational burden in high-dimensional cases, and it allows for more customized algorithmic design based on the inherent structure of the problem. In [17], the 1\ell_{1} norm was considered as an augmented term in the AL function for MILP such that the minimization of the AL function can be decomposed into multiple low-dimensional 1\ell_{1}-penalty subproblems due to the separable blocks. These subproblems were then solved in parallel using the Gurobi solver. In our work, we take a different approach by using a quadratic term in the AL function which allows the augmented Lagrangian subproblem to be reduced to a linear programming under certain conditions. Furthermore, we update the blocks of variables sequentially one at a time.

1.3 Notation and Organization

Let m:={1,2,,m}\mathbb{N}_{m}:=\{1,2,...,m\}, p:={1,2,,p}\mathbb{N}_{p}:=\{1,2,...,p\} and +m:={xm:xi0for alli}\mathbb{R}^{m}_{+}:=\{x\in\mathbb{R}^{m}:x_{i}\geq 0~{}\text{for all}~{}i\}. The superscript ``"``\top" means “transpose”. Denote 𝒂i\boldsymbol{a}_{i} by the ii-th row of the matrix 𝑨\boldsymbol{A} and bib_{i} by the ii-th element of the vector 𝒃\boldsymbol{b}. For convenience, we let 𝑨,j\boldsymbol{A}_{\mathcal{I},j} denote the submatrix consisting of columns of 𝑨j\boldsymbol{A}_{j} indexed by \mathcal{I} and 𝒃\boldsymbol{b}_{\mathcal{I}} denote the subvector consisting of entries of 𝒃m\boldsymbol{b}\in\mathbb{R}^{m} indexed by \mathcal{I}. A neighborhood of a point x\textbf{x}^{*} is a set 𝒩(x,1)\mathcal{N}(\textbf{x}^{*},1) consisting of all points x such that xx21\|\textbf{x}-\textbf{x}^{*}\|^{2}\leq 1. Let 𝟏\mathbf{1} be a row vector of all ones.

This paper is structured as follows. An AL function is defined and the strong duality of the problem (1) is discussed in section 2. We propose a customized ALM incorporating BCD methods and refinement strategies to improve the quality of the ALM solutions in section 3. We establish the convergence results of both the BCD methods to minimize the AL function and the ALM applied to the whole problem (1) in section 4. The proposed method is applied to two practical problems in section 5. Concluding remarks are made in the last section.

2 The AL Strong Duality

The Lagrangian relaxation (LR) of (1) with respect to the constraint (1b) has the following form:

minx𝒳(x,λ):=j=1p𝒄jxj+λ(j=1p𝑨jxj𝒃),\min_{\textbf{x}\in\mathcal{X}}~{}~{}\mathcal{L}(\textbf{\rm{x}},\lambda):=\sum\limits_{j=1}^{p}~{}\boldsymbol{c}_{j}^{\top}\textbf{x}_{j}+\lambda^{\top}\left(\sum\limits_{j=1}^{p}\boldsymbol{A}_{j}\textbf{x}_{j}-\boldsymbol{b}\right), (2)

where λ+m\lambda\in\mathbb{R}^{m}_{+} is a Lagrange multiplier associated with the constraint (1b). We can observe that (2) is much easier to be solved than the original problem (1) since (2) can be decomposed into pp-block low-dimensional independent subproblems. However, there may exist a non-zero duality gap when certain constraints are relaxed by using classical Lagrangian dual [33]. To reduce the duality gap, we add a quadratic penalty function to the Lagrangian function in (2) and solve the augmented Lagrangian dual problem which is defined as follows.

Definition 2.1 (AL Dual)

We define an AL function by

L(x,λ,ρ)\displaystyle L(\textbf{\rm{x}},\lambda,\rho) =\displaystyle= j=1p𝒄jxj+λ(j=1p𝑨jxj𝒃)\displaystyle\sum\limits_{j=1}^{p}~{}\boldsymbol{c}_{j}^{\top}\textbf{\rm{x}}_{j}+\lambda^{\top}\left(\sum\limits_{j=1}^{p}\boldsymbol{A}_{j}\textbf{\rm{x}}_{j}-\boldsymbol{b}\right) (3)
+ρ2(j=1p𝑨jxj𝒃)+2,\displaystyle+\frac{\rho}{2}\left\|\left(\sum\limits_{j=1}^{p}\boldsymbol{A}_{j}\textbf{\rm{x}}_{j}-\boldsymbol{b}\right)_{+}\right\|^{2},

where λ+m,ρ>0.\lambda\in\mathbb{R}_{+}^{m},\rho>0. The corresponding AL relaxation of (1) is given by

d(λ,ρ):=minx𝒳L(x,λ,ρ).d(\lambda,\rho):=\min_{\textbf{\rm{x}}\in\mathcal{X}}~{}L(\textbf{\rm{x}},\lambda,\rho). (4)

We call the following maximization problem the AL dual problem:

fρLD:=maxλ+md(λ,ρ).f_{\rho}^{\text{LD}}:=\max_{\lambda\in\mathbb{R}_{+}^{m}}d(\lambda,\rho). (5)

Note that the classical AL function of (1) is given by

L^(x,λ,ρ)=𝒄x+ρ2(𝑨x𝒃+λρ)+2λ22ρ.\hat{L}(\textbf{\rm{x}},\lambda,\rho)=\boldsymbol{c}^{\top}\textbf{\rm{x}}+\frac{\rho}{2}\left\|\left(\boldsymbol{A}\textbf{\rm{x}}-\boldsymbol{b}+\frac{\lambda}{\rho}\right)_{+}\right\|^{2}-\frac{\|\lambda\|^{2}}{2\rho}. (6)

We prefer using the form of the AL function (3) rather than the classical version (6) due to the absence of λ/ρ\lambda/\rho in the quadratic term of the max function. This makes it possible to convert the AL function into a linear function under certain conditions, making the problem (4) easier to solve, which will be explained in the next section. We also verify that the quadratic term in (3) is an exact penalty and strong duality holds between the AL dual problem (5) and the primal problem (1).

Lemma 2.1 (Strong Duality)

Suppose the problem (1) is feasible and its optimal value is bounded. If a minimum achievable non-zero slack exists, i.e., there is a δ\delta such that for any imi\in\mathbb{N}_{m},

0<δminx𝒳{(𝒂ixbi)2:𝒂ix>bi},0<\delta\leq\min_{\textbf{\rm{x}}\in\mathcal{X}}\{(\boldsymbol{a}_{i}\textbf{\rm{x}}-b_{i})^{2}:\boldsymbol{a}_{i}\textbf{\rm{x}}>b_{i}\}, (7)

then there exists a finite ρ(0,+)\rho^{*}\in(0,+\infty) such that

fρLD=min𝑨x𝒃,x𝒳𝒄x.f_{\rho^{*}}^{\text{LD}}=\min_{\boldsymbol{A}{\textbf{\rm{x}}}\leq\boldsymbol{b},{\textbf{\rm{x}}}\in\mathcal{X}}\boldsymbol{c}^{\top}{\textbf{\rm{x}}}.
Proof:

For any λ+m\lambda\in\mathbb{R}_{+}^{m} and ρ>0\rho>0, since {x𝒳:𝑨x𝒃0}𝒳\{\textbf{x}\in\mathcal{X}:\boldsymbol{A}\textbf{x}-\boldsymbol{b}\leq 0\}\subseteq\mathcal{X}, we have

d(λ,ρ)minx𝒳𝑨x𝒃0L(x,λ,ρ)minx𝒳𝑨x𝒃0𝒄x=fIP.d(\lambda,\rho)\leq\mathop{\min}_{\textbf{x}\in\mathcal{X}\atop\boldsymbol{A}\textbf{x}-\boldsymbol{b}\leq 0}~{}L(\textbf{x},\lambda,\rho)\leq\mathop{\min}_{\textbf{x}\in\mathcal{X}\atop\boldsymbol{A}\textbf{x}-\boldsymbol{b}\leq 0}\boldsymbol{c}^{\top}\textbf{x}=f^{\text{IP}}. (8)

Then fρLDfIPf_{\rho}^{\text{LD}}\leq f^{\text{IP}}.

Now it suffices to find a finite ρ\rho^{*} such that fρLDfIPf_{\rho^{*}}^{\text{LD}}\geq f^{\text{IP}}. We first let x0\textbf{x}^{0} be any arbitrary feasible solution of (1), that is, x0𝒳\textbf{x}^{0}\in\mathcal{X} and 𝑨x0𝒃\boldsymbol{A}\textbf{x}^{0}\leq\boldsymbol{b}. Denote by fLPf^{\text{LP}} the linear programming (LP) relaxation of fIPf^{\text{IP}}. Since the value of the LP relaxation of (1) is bounded [34], i.e., <fLP𝒄x0<+-\infty<f^{\text{LP}}\leq\boldsymbol{c}^{\top}\textbf{x}^{0}<+\infty, we set ρ=2(𝒄x0fLP)/δ\rho^{*}=2(\boldsymbol{c}^{\top}\textbf{x}^{0}-f^{\text{LP}})/\delta, then 0<ρ<+0<\rho^{*}<+\infty. Moreover,

fLDmaxλ+md(λ,ρ)d(λ,ρ)=minx𝒳L(x,λ,ρ),\!f^{\text{LD}}\geq\max_{\lambda\in\mathbb{R}_{+}^{m}}d(\lambda,\rho^{*})\geq d(\lambda^{*},\rho^{*})=\min_{\textbf{x}\in\mathcal{X}}L(\textbf{x},\lambda^{*},\rho^{*}), (9)

where λ+m\lambda^{*}\in\mathbb{R}_{+}^{m} is a given parameter. Let 𝕀:={im:𝒂ixbi>0,x𝒳}\mathbb{I}:=\{i\in\mathbb{N}_{m}:\boldsymbol{a}_{i}\textbf{x}-b_{i}>0,\textbf{x}\in\mathcal{X}\}, we consider following two cases:

Case 1: 𝕀=\mathbb{I}=\emptyset. In this case we have 𝑨x𝒃\boldsymbol{A}\textbf{x}\leq\boldsymbol{b} for all x𝒳\textbf{x}\in\mathcal{X}. By letting λ¯=0\bar{\lambda}=0, we can obtain that

L(x,λ¯,ρ)\displaystyle L(\textbf{x},\bar{\lambda},\rho^{*}) =𝒄x+λ¯(𝑨x𝒃)+ρ2(𝑨x𝒃)+2\displaystyle=\boldsymbol{c}^{\top}\textbf{x}+\bar{\lambda}^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\rho^{*}}{2}\|\left(\boldsymbol{A}\textbf{x}-\boldsymbol{b}\right)_{+}\|^{2}
=𝒄xfIP.\displaystyle=\boldsymbol{c}^{\top}\textbf{x}\geq f^{\text{IP}}. (10)

Case 2: 𝕀\mathbb{I}\neq\emptyset. Denote by λLP\lambda^{\text{LP}} a positive optimal vector of dual variables for 𝑨x𝒃\boldsymbol{A}\textbf{x}\leq\boldsymbol{b} in the LP relaxation of (1). In this case we get

L(x,λLP,ρ)=\displaystyle L(\textbf{x},\lambda^{\text{LP}},\rho^{*})= 𝒄x+(λLP)(𝑨x𝒃)+ρ2i𝕀(𝒂ixbi)2\displaystyle\boldsymbol{c}^{\top}\textbf{x}+(\lambda^{\text{LP}})^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\rho^{*}}{2}\sum_{i\in\mathbb{I}}(\boldsymbol{a}_{i}\textbf{x}-b_{i})^{2}
+ρ2i𝕀((𝒂ixbi)+)2\displaystyle+\frac{\rho^{*}}{2}\sum\nolimits_{i\notin\mathbb{I}}((\boldsymbol{a}_{i}\textbf{x}-b_{i})_{+})^{2}
=\displaystyle= 𝒄x+(λLP)(𝑨x𝒃)+ρ2i𝕀(𝒂ixbi)2.\displaystyle\boldsymbol{c}^{\top}\textbf{x}+(\lambda^{\text{LP}})^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\rho^{*}}{2}\sum_{i\in\mathbb{I}}(\boldsymbol{a}_{i}\textbf{x}-b_{i})^{2}.

Since

ρ2i𝕀(𝒂ixbi)2ρ2mini𝕀(𝒂ixbi)2(7)ρ2δ=𝒄x0fLP,\frac{\rho^{*}}{2}\sum_{i\in\mathbb{I}}(\boldsymbol{a}_{i}\textbf{x}-b_{i})^{2}\geq\frac{\rho^{*}}{2}\min_{i\in\mathbb{I}}(\boldsymbol{a}_{i}\textbf{x}-b_{i})^{2}\overset{\eqref{delta}}{\geq}\frac{\rho^{*}}{2}\delta=\boldsymbol{c}^{\top}\textbf{x}^{0}-f^{\text{LP}},

it yields

L(x,λLP,ρ)\displaystyle L(\textbf{x},\lambda^{\text{LP}},\rho^{*}) 𝒄x+(λLP)(𝑨x𝒃)+(𝒄x0fLP)\displaystyle\geq\boldsymbol{c}^{\top}\textbf{x}+(\lambda^{\text{LP}})^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+(\boldsymbol{c}^{\top}\textbf{x}^{0}-f^{\text{LP}})
fLP+(𝒄x0fLP)=𝒄x0fIP,\displaystyle\geq f^{\text{LP}}+(\boldsymbol{c}^{\top}\textbf{x}^{0}-f^{\text{LP}})=\boldsymbol{c}^{\top}\textbf{x}^{0}\geq f^{\text{IP}}, (11)

where the second inequality holds due to the definition of λLP\lambda^{\text{LP}}. Thus the inequalities (2) and (2) by letting λ\lambda^{*} be λLR\lambda^{\text{LR}} and λ¯\bar{\lambda} imply that

d(λ,ρ)=minx𝒳L(x,λ,ρ)fIP.d(\lambda^{*},\rho^{*})=\min\nolimits_{\textbf{x}\in\mathcal{X}}~{}L(\textbf{x},\lambda^{*},\rho^{*})\geq f^{\text{IP}}.

This together with (9) and (8) yields that

fρLD=d(λ,ρ)=fIP.f_{\rho^{*}}^{\text{LD}}=d(\lambda^{*},\rho^{*})=f^{\text{IP}}.

Hence we complete the proof. ∎∎

One can observe from the proof that there exists a finite value ρ\rho^{*} such that for all ρρ\rho\geq\rho^{*}, the strong duality still holds. Therefore, given a sufficiently large penalty parameter ρ\rho, we can achieve a satisfactory feasibility of (1). Specifically, as ρ\rho increases beyond ρ\rho^{*}, the augmented Lagrangian method penalizes constraint violations (1b) more heavily, thereby making the solution closer to the feasible region of the problem. The strong duality allows us to obtain a globally optimal solution to the problem (1) by solving the augmented Lagrangian dual problem (5).

To utilize the concave structure of the dual function d(λ,ρ)d(\lambda,\rho), we apply the projected subgradient method for solving (5) since the dual function is not differentiable. We first give the definition of subgradient and subdifferential.

Definition 2.2

Let h:mh:\mathbb{R}^{m}\rightarrow\mathbb{R} be a convex function. The vector sms\in\mathbb{R}^{m} is called a subgradient of hh at x¯m\bar{x}\in\mathbb{R}^{m} if

h(x)h(x¯)s(xx¯),xm.h(x)-h(\bar{x})\geq s^{\top}(x-\bar{x}),~{}~{}\forall x\in\mathbb{R}^{m}.

The subdifferential of hh at x¯\bar{x} is the set of all subgradients of hh at x¯\bar{x} which is given by

h(x)={sm:h(x)h(x¯)s(xx¯),xm}.\partial h(x)=\{s\in\mathbb{R}^{m}:h(x)-h(\bar{x})\geq s^{\top}(x-\bar{x}),~{}~{}\forall x\in\mathbb{R}^{m}\}.

Since d(λ,ρ)d(\lambda,\rho) is concave, we adjust Definition 2.2 to correspond to the set (d(λ,ρ))-\partial(-d(\lambda,\rho)), allowing us to apply properties of the subdifferential of a convex function to a concave function.

Proposition 2.1

Consider the dual function d(λ,ρ):m+1d(\lambda,\rho):\mathbb{R}^{m+1}\rightarrow\mathbb{R}. Then the subdifferentials of dd at λ\lambda and ρ\rho satisfy

𝑨x𝒃λd(λ,ρ),12(𝑨x𝒃)+2ρd(λ,ρ),\boldsymbol{A}\textbf{\rm{x}}-\boldsymbol{b}\in\partial_{\lambda}d(\lambda,\rho),\quad\frac{1}{2}\|(\boldsymbol{A}\textbf{\rm{x}}-\boldsymbol{b})_{+}\|^{2}\in\partial_{\rho}d(\lambda,\rho), (12)

where x is a solution of (4) with input (λ,ρ)(\lambda,\rho).

Proof:

Let x be a solution of (4) with input (λ,ρ)(\lambda,\rho). Then for any pair (λ^,ρ^)+m×+(\hat{\lambda},\hat{\rho})\in\mathbb{R}^{m}_{+}\times\mathbb{R}_{+}, it holds that

d(λ^,ρ^)𝒄x+λ^(𝑨x𝒃)+ρ^2(𝑨x𝒃)+2=𝒄x+λ(𝑨x𝒃)+ρ2(𝑨x𝒃)+2+(λ^λ)(𝑨x𝒃)+ρ^ρ2(𝑨x𝒃)+2=d(λ,ρ)+(λ^λ)(𝑨x𝒃)+ρ^ρ2(𝑨x𝒃)+2,\displaystyle\begin{array}[]{ll}d(\hat{\lambda},\hat{\rho})\!\!\!\!&\leq\boldsymbol{c}^{\top}\textbf{x}+\hat{\lambda}^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\hat{\rho}}{2}\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}\|^{2}\\ &=\boldsymbol{c}^{\top}\textbf{x}+\lambda^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\rho}{2}\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}\|^{2}\\ &\quad+(\hat{\lambda}-\lambda)^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\hat{\rho}-\rho}{2}\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}\|^{2}\\ &=d(\lambda,\rho)+(\hat{\lambda}-\lambda)^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})+\frac{\hat{\rho}-\rho}{2}\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}\|^{2},\end{array}

where the last equality holds due to the optimality of x. Then by the definition of subgradient and subdifferential, we arrive at (12). ∎∎

3 Augmented Lagrangian method

In this section, we introduce an augmented Lagrangian method framework. This method is composed of two steps in solving the augmented Lagrangian dual problem (5). We first use the primal information x to construct the subgradient of dd at (λ,ρ)(\lambda,\rho). Since λ\lambda and ρ\rho satisfy λ0\lambda\geq 0 and ρ>0\rho>0, then we apply the projected subgradient method to update parameters λ\lambda and ρ\rho. Starting from λ0+m\lambda^{0}\in\mathbb{R}_{+}^{m} and ρ0>0\rho^{0}>0, we can update λ\lambda and ρ\rho at (k+1)(k+1)-th iteration by

λk+1\displaystyle\lambda^{k+1} =(λk+αk(𝑨xk+1𝒃))+,\displaystyle=\left(\lambda^{k}+\alpha^{k}(\boldsymbol{A}\textbf{x}^{k+1}-\boldsymbol{b})\right)_{+}, (14a)
ρk+1\displaystyle\rho^{k+1} =(ρk+αk2(𝑨xk+1𝒃)+2)+\displaystyle=\left(\rho^{k}+\frac{\alpha^{k}}{2}\left\|(\boldsymbol{A}\textbf{x}^{k+1}-\boldsymbol{b})_{+}\right\|^{2}\right)_{+}
=ρk+αk2(𝑨xk+1𝒃)+2,\displaystyle=\rho^{k}+\frac{\alpha^{k}}{2}\left\|(\boldsymbol{A}\textbf{x}^{k+1}-\boldsymbol{b})_{+}\right\|^{2}, (14b)

where xk+1\textbf{x}^{k+1} is an optimal solution of the augmented Lagrangian relaxation problem (4) at λk\lambda^{k} and ρk\rho^{k}, that is,

xk+1argminx𝒳L(x,λk,ρk).\textbf{x}^{k+1}\in\arg\min\limits_{\textbf{x}\in\mathcal{X}}~{}L(\textbf{x},\lambda^{k},\rho^{k}). (15)

Therefore, the iterative processes (15), (14a) and (14b) consisting of primal and dual variables make up the ALM framework.

Solving the x-subproblem (15) is an important step in ALM. While the subproblem (15) has no closed form solution in general, we can apply an iterative method to solve it exactly or inexactly. Consequently, the ALM for solving the problem (1) consists of outer and inner iterations. The subscript kk is used to denote the outer iteration number, and the subscript tt is used to denote the inner iteration number.

3.1 A BCD method for subproblem (15)

In this subsection, we present a BCD method for solving the x-subproblem (15) in the ALM framework. The BCD method minimizes the function LL by iterating cyclically in order x1,,xp\textbf{x}_{1},...,\textbf{x}_{p}, fixing the previous iteration during each iteration. Denote xt=(x1t;x2t;;xpt)\textbf{x}^{t}=\left(\textbf{x}_{1}^{t};\textbf{x}_{2}^{t};...;\textbf{x}_{p}^{t}\right) where xjt\textbf{x}^{t}_{j} is the value of xj\textbf{x}_{j} at its tt-th update. Let

Ljt(xj,λ,ρ)=L(x1t+1,,xj1t+1,xj,xj+1t,,xpt,λ,ρ),L^{t}_{j}(\textbf{x}_{j},\lambda,\rho)=L(\textbf{x}^{t+1}_{1},...,\textbf{x}^{t+1}_{j-1},\textbf{x}_{j},\textbf{x}^{t}_{j+1},...,\textbf{x}^{t}_{p},\lambda,\rho),

and

xt(j)=(x1t+1,x2t+1,,xj1t+1,xjt,xj+1t,,xpt).\textbf{x}^{t}(j)=\left(\textbf{x}_{1}^{t+1},\textbf{x}_{2}^{t+1},...,\textbf{x}_{j-1}^{t+1},\textbf{x}_{j}^{t},\textbf{x}_{j+1}^{t},...,\textbf{x}_{p}^{t}\right).

Therefore xt(1)=(x1t;x2t;;xpt)=xt\textbf{x}^{t}(1)=\left(\textbf{x}_{1}^{t};\textbf{x}_{2}^{t};...;\textbf{x}_{p}^{t}\right)=\textbf{x}^{t} and xt(p+1)=(x1t+1;x2t+1;;xpt+1)=xt+1.\textbf{x}^{t}(p+1)=\left(\textbf{x}_{1}^{t+1};\textbf{x}_{2}^{t+1};...;\textbf{x}_{p}^{t+1}\right)=\textbf{x}^{t+1}. Given fixed parameters λ0\lambda\geq 0 and ρ>0\rho>0, we can also calculate the gradient of L(xt(j),λ,ρ)L(\textbf{x}^{t}(j),\lambda,\rho) at xj\textbf{x}_{j} by

gj(xt)\displaystyle g_{j}(\textbf{x}^{t}) :=\displaystyle:= xjL(xt(j),λ,ρ)\displaystyle\nabla_{\textbf{x}_{j}}L(\textbf{x}^{t}(j),\lambda,\rho)
=\displaystyle= 𝒄j+𝑨jλ+ρ𝑨j(𝑨xt(j)𝒃)+.\displaystyle\boldsymbol{c}_{j}+\boldsymbol{A}_{j}^{\top}\lambda+\rho\boldsymbol{A}_{j}^{\top}\left(\boldsymbol{A}\textbf{x}^{t}(j)-\boldsymbol{b}\right)_{+}.

At each step, we consider two types of updates for every xj𝒳j\textbf{x}_{j}\in\mathcal{X}_{j}:

Classical:xjt+1argminxj𝒳jLjt(xj,λ,ρ),\displaystyle\text{Classical:}~{}\textbf{x}^{t+1}_{j}\in\mathop{\arg\min}_{\textbf{x}_{j}\in\mathcal{X}_{j}}~{}L^{t}_{j}(\textbf{x}_{j},\lambda,\rho), (16a)
Proximal linear:
xjt+1argminxj𝒳j{xjxjt,gj(xt)+12τxjxjt2},\displaystyle\textbf{x}_{j}^{t+1}\in\mathop{\arg\min}_{\textbf{x}_{j}\in\mathcal{X}_{j}}\left\{\langle\textbf{x}_{j}-\textbf{x}_{j}^{t},g_{j}(\textbf{x}^{t})\rangle+\frac{1}{2\tau}\|\textbf{x}_{j}-\textbf{x}_{j}^{t}\|^{2}\right\}, (16b)

where τ>0\tau>0 is a step size. In fact, if one defines the projection operator by

P𝒳(v)argmin{uv:u𝒳},P_{\mathcal{X}}(v)\in\arg\min\{\|u-v\|:u\in\mathcal{X}\},

then we obtain the following equivalent form of (16b):

xjt+1=P𝒳j(xjtτgj(xt)).\textbf{x}_{j}^{t+1}=P_{\mathcal{X}_{j}}\left(\textbf{x}_{j}^{t}-\tau g_{j}(\textbf{x}^{t})\right).

In general, the classical subproblem (16a) is fundamentally harder to solve because of the quadratic term in the objective function. However, we derive a simplified form of this subproblem under certain conditions, which will be discussed in Subsection 3.1.2. By contrast, the prox-linear subproblem (16b) is relatively easy to solve because the objective function is linear with respect to xj\textbf{x}_{j}. Now we summarize the ALM for solving (1) in Algorithm 1, which allows each xj\textbf{x}_{j} to be updated by (16a) or (16b). We assume that each block jj is updated by the same scheme in (16a) and (16b) for all iteration tt.

Input: Initial point x0\textbf{x}^{0}, λ0,ρ0\lambda^{0},\rho^{0}.
Output: A feasible solution xk+1\textbf{x}^{k+1}.
1 for  k=0,1,,kmaxk=0,1,...,k_{\max} do
2      for t=0,1,,tmaxt=0,1,...,t_{\max} do
3            for j=1,2,,pj=1,2,...,p do
4                  Compute xj(t+1)\textbf{x}_{j}^{(t+1)} by (16a) or (16b);
5            if x(t+1)=x(t)\textbf{\rm{x}}^{(t+1)}=\textbf{\rm{x}}^{(t)}, then let xk+1=x(t+1)\textbf{x}^{k+1}=\textbf{x}^{(t+1)} and break;
6      if (𝑨xk+1𝒃)+2=0\|(\boldsymbol{A}\textbf{\rm{x}}^{k+1}-\boldsymbol{b})_{+}\|_{2}=0, then Teminate;
7       Update the Lagrangian multipliers λk+1\lambda^{k+1} by (14a) and the penalty coefficient ρk+1\rho^{k+1} by (14b).
Algorithm 1 ALM with BCD

Subsequently, we show more detailed information about these two updates (16a) and (16b).

3.1.1 Proximal linear update of BCD

Before considering the proximal linear subproblem (16b), we first give a definition of a linear operator, which is essential in the BCD method.

Definition 3.1

The linear operator associated to a vector vnv\in\mathbb{R}^{n} is defined by

𝒯Ω(v)=argminuΩvu,\mathcal{T}_{\Omega}(v)=\mathop{\arg\min}_{\textbf{u}\in\Omega}~{}v^{\top}\textbf{u},

where Ω\Omega is a nonempty and closed set.

Benefiting from the good characteristics of x being a binary variable, we have

xj2=1xj,forjp,\|\textbf{x}_{j}\|^{2}=\textbf{1}^{\top}\textbf{x}_{j},~{}~{}\text{for}~{}~{}\forall j\in\mathbb{N}_{p}, (17)

then the objective function in (16b) is linear. Therefore, we can also rewrite (16b) as

xjt+1\displaystyle\textbf{x}_{j}^{t+1} argminxj𝒳j{xjgj(xt)+12τ1xj1τxjxjt}\displaystyle\in\mathop{\arg\min}_{\textbf{x}_{j}\in\mathcal{X}_{j}}~{}\left\{\textbf{x}_{j}^{\top}g_{j}(\textbf{x}^{t})+\frac{1}{2\tau}\textbf{1}^{\top}\textbf{x}_{j}-\frac{1}{\tau}\textbf{x}_{j}^{\top}\textbf{x}_{j}^{t}\right\}
=𝒯𝒳j(τgj(xt)+12xjt).\displaystyle=\mathcal{T}_{\mathcal{X}_{j}}\left(\tau g_{j}(\textbf{x}^{t})+\frac{\textbf{1}}{2}-\textbf{x}_{j}^{t}\right). (18)

Due to the discrete property of the feasible set 𝒳j\mathcal{X}_{j}, it is crucial to choose the step size τ\tau appropriately. If τ\tau is too large, the BCD method will not converge. If τ\tau is too small, the BCD method will be stuck at some points. The reason why this happens will be explained in the convergence analysis.

3.1.2 Classical update of BCD

We start by giving the following assumption of model (1), which is very common in many applications such as train timetabling, vehicle routing, and allocation problems.

Assumption 3.1

The entries of the matrix 𝐀\boldsymbol{A} are either 11 or 0. The vector 𝐛\boldsymbol{b} equals 1.

Under this assumption, we consider the subproblem (16a). For each jpj\in\mathbb{N}_{p}, if the condition 𝑨jxjt+11\boldsymbol{A}_{j}\textbf{\rm{x}}^{t+1}_{j}\leq\textbf{\rm{1}} always holds for each iteration of the BCD method, then

xjt+1\displaystyle\textbf{\rm{x}}^{t+1}_{j} argminxj𝒳jLjt(xj,λ,ρ)\displaystyle\in\mathop{\arg\min}_{\textbf{\rm{x}}_{j}\in\mathcal{X}_{j}}~{}L^{t}_{j}(\textbf{\rm{x}}_{j},\lambda,\rho) (19)
=𝒯𝒳j(𝒄j+𝑨jλ+ρ𝑨j(ljp𝑨lxlt(j)12)+).\displaystyle=\mathcal{T}_{\mathcal{X}_{j}}\left(\boldsymbol{c}_{j}+\boldsymbol{A}_{j}^{\top}\lambda+\rho\boldsymbol{A}_{j}^{\top}\left(\sum\limits_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{\rm{x}}^{t}_{l}(j)-\frac{\textbf{\rm{1}}}{2}\right)_{+}\right).

To prove the above derivation, we introduce following two notations at the tt-th update:

:={im:ljp𝑨i,lxlt(j)=0},¯:={im:ljp𝑨i,lxlt(j)1},\displaystyle\begin{array}[]{ll}\mathcal{I}:=\{i\in\mathbb{N}_{m}:\sum_{l\neq j}^{p}\boldsymbol{A}_{i,l}\textbf{x}_{l}^{t}(j)=0\},\\ \bar{\mathcal{I}}:=\{i\in\mathbb{N}_{m}:\sum_{l\neq j}^{p}\boldsymbol{A}_{i,l}\textbf{x}_{l}^{t}(j)\geq 1\},\end{array} (22)

where xlt(j)={xlt+1,ifl<j,xlt,ifl>j.\textbf{x}^{t}_{l}(j)=\left\{\begin{array}[]{ll}\textbf{x}^{t+1}_{l},&\text{if}~{}l<j,\\ \textbf{x}^{t}_{l},&\text{if}~{}l>j.\end{array}\right. Obviously, ¯=\mathcal{I}\cap\bar{\mathcal{I}}=\emptyset and ¯=m\mathcal{I}\cup\bar{\mathcal{I}}=\mathbb{N}_{m}. Therefore, we have

ljp𝑨¯,lxlt(j)1¯2=(ljp𝑨¯,lxlt(j)1¯2)+,\displaystyle\sum\nolimits_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\bar{\mathcal{I}}}}{2}=\left(\sum\nolimits_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\bar{\mathcal{I}}}}{2}\right)_{+}, (23)
ljp𝑨,lxlt(j)12=0,\displaystyle\sum\nolimits_{l\neq j}^{p}\boldsymbol{A}_{\mathcal{I},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\mathcal{I}}}{2}=0, (24)

and

(𝑨,jxj+ljp𝑨,lxlt(j)1)+=(𝑨,jxj1)+=0,𝑨¯,jxj+ljp𝑨¯,lxlt(j)1¯0.\displaystyle\begin{array}[]{ll}\left(\boldsymbol{A}_{\mathcal{I},j}\textbf{x}_{j}+\sum\limits_{l\neq j}^{p}\boldsymbol{A}_{\mathcal{I},l}\textbf{x}^{t}_{l}(j)-\textbf{1}_{\mathcal{I}}\right)_{\!\!+}=\left(\boldsymbol{A}_{\mathcal{I},j}\textbf{x}_{j}-\textbf{1}_{\mathcal{I}}\right)_{+}\!=\!\textbf{0},\\ \boldsymbol{A}_{\bar{\mathcal{I}},j}\textbf{x}_{j}+\sum_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\textbf{1}_{\bar{\mathcal{I}}}\geq\textbf{0}.\end{array} (27)

Since the element in 𝑨jxj\boldsymbol{A}_{j}\textbf{x}_{j} is either 11 or 0, we have

𝑨jxj2=1(𝑨jxj),jp.\|\boldsymbol{A}_{j}\textbf{x}_{j}\|^{2}=\textbf{1}^{\top}(\boldsymbol{A}_{j}\textbf{x}_{j}),~{}~{}\forall~{}j\in\mathbb{N}_{p}. (28)

Denote C=ljp𝑨¯,lxlt(j)1¯2C=\left\|\sum_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\textbf{1}_{\bar{\mathcal{I}}}\right\|^{2}. Then

(𝑨jxj+ljp𝑨lxlt(j)1)+2=(27)𝑨¯,jxj2+2(𝑨¯,jxj)(ljp𝑨¯,lxlt(j)1¯)+C=(28)2(𝑨¯,jxj)(ljp𝑨¯,lxlt(j)1¯2)+C=(23)2(𝑨¯,jxj)(ljp𝑨¯,lxlt(j)1¯2)++C=(24)2(𝑨¯,jxj)(ljp𝑨¯,lxlt(j)1¯2)++2(𝑨,jxj)(ljp𝑨,lxlt(j)12)++C=2(𝑨jxj)(ljp𝑨lxlt(j)12)++C.\displaystyle\begin{array}[]{ll}&\left\|\left(\boldsymbol{A}_{j}\textbf{x}_{j}+\sum_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\textbf{1}\right)_{+}\right\|^{2}\\ \overset{\eqref{AIj}}{=}&\|\boldsymbol{A}_{\bar{\mathcal{I}},j}\textbf{x}_{j}\|^{2}+2(\boldsymbol{A}_{\bar{\mathcal{I}},j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\textbf{1}_{\bar{\mathcal{I}}}\right)+C\\ \overset{\eqref{Ajxj}}{=}&2(\boldsymbol{A}_{\bar{\mathcal{I}},j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\bar{\mathcal{I}}}}{2}\right)+C\\ \overset{\eqref{AbIj2}}{=}&2(\boldsymbol{A}_{\bar{\mathcal{I}},j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\bar{\mathcal{I}}}}{2}\right)_{+}+C\\ \overset{\eqref{AIj2}}{=}&2(\boldsymbol{A}_{\bar{\mathcal{I}},j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{\bar{\mathcal{I}},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\bar{\mathcal{I}}}}{2}\right)_{+}\\ &+2(\boldsymbol{A}_{\mathcal{I},j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{\mathcal{I},l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}_{\mathcal{I}}}{2}\right)_{+}+C\\ =&2(\boldsymbol{A}_{j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}}{2}\right)_{+}+C.\end{array} (31)

Then the iterative scheme for x-update is given by

xjt+1=argminxj𝒳jLjt(xj,λ,ρ)=argminxj𝒳j{𝒄jxj+λ(𝑨jxj1)+ρ2(𝑨jxj+ljp𝑨lxlt(j)1)+2}=(31)argminxj𝒳j{𝒄jxj+λ(𝑨jxj1)+ρ(𝑨jxj)(ljp𝑨lxlt(j)12)+}=𝒯𝒳j(𝒄j+𝑨jλ+ρ𝑨j(ljp𝑨lxlt(j)12)+).\displaystyle\begin{array}[]{ll}\textbf{x}^{t+1}_{j}&=\mathop{\arg\min}\limits_{\textbf{x}_{j}\in\mathcal{X}_{j}}~{}L^{t}_{j}(\textbf{x}_{j},\lambda,\rho)\\ &=\mathop{\arg\min}\limits_{\textbf{x}_{j}\in\mathcal{X}_{j}}~{}\left\{\boldsymbol{c}_{j}^{\top}\textbf{x}_{j}+\lambda^{\top}\left(\boldsymbol{A}_{j}\textbf{x}_{j}-\textbf{1}\right)\right.\\ &\quad\left.+\frac{\rho}{2}\left\|\left(\boldsymbol{A}_{j}\textbf{x}_{j}+\sum_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\textbf{1}\right)_{+}\right\|^{2}\right\}\\ &\overset{\eqref{Abar}}{=}\mathop{\arg\min}_{\textbf{x}_{j}\in\mathcal{X}_{j}}\left\{\boldsymbol{c}_{j}^{\top}\textbf{x}_{j}+\lambda^{\top}\left(\boldsymbol{A}_{j}\textbf{x}_{j}-\textbf{1}\right)\right.\\ &\quad\left.+\rho(\boldsymbol{A}_{j}\textbf{x}_{j})^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}}{2}\right)_{+}\right\}\\ &=\mathcal{T}_{\mathcal{X}_{j}}\left(\boldsymbol{c}_{j}+\boldsymbol{A}_{j}^{\top}\lambda+\rho\boldsymbol{A}_{j}^{\top}\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}}{2}\right)_{\!\!+}\right).\end{array}

We can observe that the condition 𝑨x1\boldsymbol{A}\textbf{x}\leq\textbf{1} induces the decomposition of minimizing the augmented Lagrangian function (3) into a set of subproblems with a linear objective function, which makes (16a) easier to solve. Therefore, the key point in the derivation of the linearization process is utilizing the fact that the entries in 𝑨jxj\boldsymbol{A}_{j}\textbf{x}_{j} are either 11 or 0. We next verify this condition is always true in each iteration of the BCD method under the following assumption.

Assumption 3.2

Denote by T𝐀jT_{\boldsymbol{A}_{j}} the indices of columns of the matrix 𝐀j\boldsymbol{A}_{j} containing only zeros and cjsc_{j_{s}} the ss-th element of 𝐜j\boldsymbol{c}_{j}. At least one of the following holds.

  • (i)

    For all jpj\in\mathbb{N}_{p}, there exists xj𝒳j\textbf{\rm{x}}_{j}\in\mathcal{X}_{j} such that 𝑨jxj1\boldsymbol{A}_{j}\textbf{\rm{x}}_{j}\leq\textbf{\rm{1}} and {snj:cjs0}T𝑨j\{s\in\mathbb{N}_{n_{j}}:c_{j_{s}}\neq 0\}\subseteq T_{\boldsymbol{A}_{j}}.

  • (ii)

    For all jpj\in\mathbb{N}_{p}, there is at most one nonzero element in the vector 𝒄j\boldsymbol{c}_{j} and {xjnj:𝑨jxj1}{xjnj:xj𝒳j}\{\textbf{\rm{x}}_{j}\in\mathbb{R}^{n_{j}}:\boldsymbol{A}_{j}\textbf{\rm{x}}_{j}\leq\textbf{\rm{1}}\}\subseteq\{\textbf{\rm{x}}_{j}\in\mathbb{R}^{n_{j}}:\textbf{\rm{x}}_{j}\in\mathcal{X}_{j}\}.

Remark: (a) Assumption 3.2 (i) requires that if an element xjs\textbf{\rm{x}}_{j_{s}} satisfies xjs=1\textbf{\rm{x}}_{j_{s}}=1, then at least one of cjsc_{j_{s}} and the ss-th column vector of 𝑨j\boldsymbol{A}_{j} is zero. Assumption 3.2 (ii) requires that if the block constraint set 𝒳j\mathcal{X}_{j} is large enough such that 𝑨jxj1\boldsymbol{A}_{j}\textbf{\rm{x}}_{j}\leq\textbf{\rm{1}} holds, then the weight 𝒄j\boldsymbol{c}_{j} corresponding to each block variable has only one nonzero element. (b) Assumption 3.2 usually holds when 𝑨\boldsymbol{A} and 𝒄\boldsymbol{c} are sparse. This is particularly true in the train timetabling problem with the objective of scheduling more trains, as will be shown in Section 5.

Lemma 3.1

Suppose Assumption 3.2 hold. From any starting point x0\textbf{\rm{x}}^{0}, the solution generated by the BCD method with the classical update at each iteration satisfies 𝐀jxjt+11\boldsymbol{A}_{j}\textbf{\rm{x}}_{j}^{t+1}\leq\textbf{\rm{1}} for all jpj\in\mathbb{N}_{p} and t=0,1,2,t=0,1,2,....

Proof:

Let xjt+1\textbf{x}_{j}^{t+1} be the solution generated by the BCD method after tt-th classical update. We argue by contradiction and suppose that there exists imi\in\mathbb{N}_{m} such that 𝑨i,jxjt+1>1\boldsymbol{A}_{i,j}\textbf{x}_{j}^{t+1}>1. Then for any x¯j𝒳j\bar{\textbf{x}}_{j}\in\mathcal{X}_{j} satisfying 𝑨jx¯j1\boldsymbol{A}_{j}\bar{\textbf{x}}_{j}\leq 1, we have

Ljt(xjt+1,λ,ρ)<Ljt(x¯j,λ,ρ),jp,L^{t}_{j}(\textbf{x}_{j}^{t+1},\lambda,\rho)<L^{t}_{j}(\bar{\textbf{x}}_{j},\lambda,\rho),~{}~{}\forall j\in\mathbb{N}_{p},

which implies that for any jpj\in\mathbb{N}_{p},

(𝒄j+𝑨jλ+ρ𝑨j(ljp𝑨lxlt(j)12)+)(xjt+1x¯j)<0,\left(\boldsymbol{c}_{j}+\boldsymbol{A}_{j}^{\top}\lambda+\rho\boldsymbol{A}_{j}^{\top}\left(\sum\limits_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}}{2}\right)_{\!\!+}\right)^{\top}(\textbf{x}_{j}^{t+1}-\bar{\textbf{x}}_{j})<0,

then

𝒄j(xjt+1x¯j)+λ(𝑨jxjt+1𝑨jx¯j)+ρ(ljp𝑨lxlt(j)12)+(𝑨jxjt+1𝑨jx¯j)<0.\displaystyle\begin{array}[]{ll}&\boldsymbol{c}_{j}^{\top}(\textbf{x}_{j}^{t+1}-\bar{\textbf{x}}_{j})+\lambda^{\top}(\boldsymbol{A}_{j}\textbf{x}_{j}^{t+1}-\boldsymbol{A}_{j}\bar{\textbf{x}}_{j})+\\ &\rho\left(\sum_{l\neq j}^{p}\boldsymbol{A}_{l}\textbf{x}^{t}_{l}(j)-\frac{\textbf{1}}{2}\right)_{\!\!+}^{\top}(\boldsymbol{A}_{j}\textbf{x}_{j}^{t+1}-\boldsymbol{A}_{j}\bar{\textbf{x}}_{j})<0.\end{array} (35)

We consider following two cases corresponding to Assumption 3.2:

Case 1: Assumption 3.2 (i) implies cjs=0c_{j_{s}}=0 for any sSj:={snj:xjst+1=1,𝑨i,js=1}s\in S_{j}:=\{s\in\mathbb{N}_{n_{j}}:\textbf{x}_{j_{s}}^{t+1}=1,\boldsymbol{A}_{i,j_{s}}=1\}. Taking x¯j=(xj1t+1,,xjst+11,,xjnjt+1)\bar{\textbf{x}}_{j}=(\textbf{x}_{j_{1}}^{t+1},...,\textbf{x}_{j_{s}}^{t+1}-1,...,\textbf{x}_{j_{n_{j}}}^{t+1}) with sSjs\in S_{j} such that 𝑨jx¯j1\boldsymbol{A}_{j}\bar{\textbf{x}}_{j}\leq\textbf{1} gives us

𝒄j(xjt+1x¯j)=scjs=0and𝑨i,jxjt+1𝑨i,jx¯j1.\boldsymbol{c}_{j}^{\top}(\textbf{x}_{j}^{t+1}-\bar{\textbf{x}}_{j})=\sum\nolimits_{s}c_{j_{s}}=0~{}\text{and}~{}\boldsymbol{A}_{i,j}\textbf{x}_{j}^{t+1}-\boldsymbol{A}_{i,j}\bar{\textbf{x}}_{j}\geq 1. (36)

This contradicts (35) since λ0,ρ>0\lambda\geq 0,\rho>0.

Case 2: Assumption 3.2 (ii) implies that there exists

x¯j=(xj1t+1,,x¯js,,xjst+11,,xjnjt+1)withx¯js=1\bar{\textbf{x}}_{j}=(\textbf{x}_{j_{1}}^{t+1},...,\bar{\textbf{x}}_{j_{s^{\prime}}},...,\textbf{x}_{j_{s}}^{t+1}-1,...,\textbf{x}_{j_{n_{j}}}^{t+1})~{}\text{with}~{}\bar{\textbf{x}}_{j_{s^{\prime}}}=1 (37)

for any sSjs\in S_{j} and sss^{\prime}\neq s such that 𝑨jx¯j1\boldsymbol{A}_{j}\bar{\textbf{x}}_{j}\leq\textbf{1}. If cjs0c_{j_{s^{\prime}}}\geq 0, we can apply the same procedure in Case 1 and arrive at the contradiction. If cjs<0c_{j_{s^{\prime}}}<0, we take x¯j\bar{\textbf{x}}_{j} in (37), then xjst+1x¯js0\textbf{x}_{j_{s^{\prime}}}^{t+1}-\bar{\textbf{x}}_{j_{s^{\prime}}}\leq 0. Hence,

𝒄j(xjt+1x¯j)=cjs(xjst+1x¯js)0,𝑨i,jxjt+1𝑨i,jx¯j1,\displaystyle\boldsymbol{c}_{j}^{\top}(\textbf{x}_{j}^{t+1}-\bar{\textbf{x}}_{j})=c_{j_{s^{\prime}}}(\textbf{x}_{j_{s^{\prime}}}^{t+1}-\bar{\textbf{x}}_{j_{s^{\prime}}})\geq 0,\ \boldsymbol{A}_{i,j}\textbf{x}_{j}^{t+1}-\boldsymbol{A}_{i,j}\bar{\textbf{x}}_{j}\geq 1,

which is a contradiction that completes the proof. ∎∎

Overall, we can reformulate the subproblem (16a) into (19) based on the special structure of the model (1) under certain conditions, and thus make it easier to solve. It is worth noting that if Assumption 3.2 is not satisfied, we can consider (19) as a linear approximation of (16a). This linearization technique is widely used, including in works [20] and [21]. However, they lacked theoretical guarantees. Our main result shows that under specified assumptions, using the update (19) in the ALM-C method is equivalent to solving (16a) exactly, theoretically ensuring the effectiveness of our method.

3.2 Finding a good feasible solution by set packing

Since the BCD method may not always yield a feasible solution[35], we can adopt several refinement strategies that are very useful to find a feasible solution to the problem (1) in practice. Very often, (1) represents a problem where we optimize under limited resources, and this provides us a view of set packing problems. Since we produce candidate solutions all along, a natural idea is to utilize past iterates to construct a feasible solution. A simple strategy could be constructing a solution pool for each jj that includes the past BCD iterates Vjk={xj1,xj2,,xjk},j=1,,pV^{k}_{j}=\{\textbf{x}^{1}_{j},\textbf{x}^{2}_{j},...,\textbf{x}^{k}_{j}\},j=1,...,p. Intuitively, the solution pool may include feasible or “nice” solutions in some sense.

To illustrate the above approach, we first introduce the iterative sequential technique.

3.2.1 A sweeping technique

The most simple technique is probably to select a subset of blocks, one by one, until the infeasibility is detected. We present this sequential method in Algorithm 2, which can be understood by simply sweeping the blocks and selecting a candidate solution from VjkV_{j}^{k} if feasibility is still guaranteed, otherwise simply skip the current block and continue.

Input: The set of past BCD solutions Vjk,j=1,,pV^{k}_{j},j=1,...,p.
Output: A feasible solution x^k\hat{\textbf{x}}^{k}
1 while the termination is not satisfied do
2      for j=1,2,,pj=1,2,...,p do
3            Select vjVjk\textbf{v}_{j}\in V_{j}^{k} ;
4             if 𝐀jvj+l=1j1𝐀lx^lk𝐛0\boldsymbol{A}_{j}\textbf{\rm{v}}_{j}+\sum\limits_{l=1}^{j-1}\boldsymbol{A}_{l}\hat{\textbf{\rm{x}}}^{k}_{l}-\boldsymbol{b}\leq 0 then
5                   let x^jk=vj\hat{\textbf{x}}_{j}^{k}=\textbf{v}_{j},
6            else
7                   let x^jk=0\hat{\textbf{x}}_{j}^{k}=0;
8            
9      
Algorithm 2 A sweeping technique

3.2.2 A packing technique

Let us further explore the idea of selecting solutions in a systematic way. Formally, for each block jpj\in\mathbb{N}_{p}, we introduce a set of binary variables μj\mu_{j} that defines the current selection:

x^jk=Xjkμj,μj{0,1}k,\hat{\textbf{x}}_{j}^{k}=X_{j}^{k}\mu_{j},~{}\mu_{j}\in\{0,1\}^{k},

where Xjk=[xj1;xj2;;xjk]X_{j}^{k}=[\textbf{x}^{1}_{j};\textbf{x}^{2}_{j};...;\textbf{x}^{k}_{j}]. We consider the following problem:

minμj{0,1}k\displaystyle\min_{\mu_{j}\in\{0,1\}^{k}}~{} j=1pcjXjkμj\displaystyle~{}\sum\nolimits_{j=1}^{p}c_{j}^{\top}X_{j}^{k}\mu_{j} (38a)
s.t. μj11,j=1,,p,\displaystyle~{}\mu_{j}^{\top}\textbf{1}\leq 1,~{}\forall j=1,...,p, (38b)
j=1p𝑨jXjkμj𝒃.\displaystyle~{}\sum\nolimits_{j=1}^{p}\boldsymbol{A}_{j}X_{j}^{k}\mu_{j}\leq\boldsymbol{b}. (38c)

In view of (38b), one may recognize the above problem as a restricted master problem appearing in column generation algorithms where (38c) stands for a set of knapsack constraints.

Specifically, the coupling constraints (38c) in our model are cliques, representing complete subgraphs where each pair of distinct nodes is connected. This allows us to reformulate the model as a maximum independent set problem. Therefore, we can find a feasible solution x\textbf{x}^{*} which satisfies 𝑨x𝒃\boldsymbol{A}\textbf{x}^{*}\leq\boldsymbol{b} by solving the problem (38). Since this problem is still hard, we only solve the relaxation problem of the maximal independent set. Now we go into details. Since the knapsack (38c) means the candidate solutions may conflict, we can construct a conflict graph F=(V,E)F=(V,E). In this graph, VV represents the set of nodes, where each node corresponds to a solution generated as the algorithm proceeds, and EE is the set of edges that connect two conflicting solutions, meaning they violate the coupling constraints. In this view, we only have to maintain the graph FF and find a maximal independent set 𝒦\mathcal{K}. Therefore, the output feasible point xj\textbf{x}_{j}^{*} corresponds to vj𝒦\textbf{v}_{j}\in\mathcal{K} for each jpj\in\mathbb{N}_{p}. If vj𝒦\textbf{v}_{j}\notin\mathcal{K}, then xj=0\textbf{x}_{j}^{*}=0. We summarize the maximal independent set technique in Algorithm 3. We also note that Algorithm 2 can be seen as a special case of Algorithm 3.

Input: The BCD solution xk\textbf{x}^{k}, last conflict graph Fk1=(Vk1,Ek1)F^{k-1}=(V^{k-1},E^{k-1})
Output: A feasible solution xk\textbf{x}_{*}^{k}
1 Step 1: Conflict graph update
2 for  j=1,2,,pj=1,2,...,p do
3       Collect new candidate paths V~jk\tilde{V}^{k}_{j} for block jj;
4       (1.1 node-update) VkVk1+V~jkV^{k}\leftarrow V^{k-1}+\tilde{V}^{k}_{j};
5       (1.2 self-check) EkEk1+{(p,p)pp,pVjk,pVjk}E^{k}\leftarrow E^{k-1}+\{(p,p^{\prime})\mid\forall p\neq p^{\prime},p\in V^{k}_{j},p^{\prime}\in V^{k}_{j}\};
6       (1.3 edge-completion) EkEk+{(p,p)E^{k}\leftarrow E^{k}+\Big{\{}(p,p^{\prime})\mid ifp,pare compatibleforpV~jk,pVk\Vjk}\left.\text{if}~{}p,p^{\prime}\text{are compatible}~{}\text{for}~{}\forall p\in\tilde{V}^{k}_{j},p^{\prime}\in V^{k}\backslash V^{k}_{j}\right\}, here ‘compatible’ means that these two nodes (block variables) satisfy the binding constraints;
7Step 2: Maximal independent set (MIS) for a feasible solution
8 Select a candidate solution set KVkK\subset V^{k};
9 for vKv\in K do
10       Compute a maximal independent set 𝒦(v)\mathcal{K}(v) with respect to vv in O(|Ek|)O(|E^{k}|) iterations;
11      
Compute x=argmaxv𝒦(v)\textbf{x}^{*}=\arg\max_{v}\mathcal{K}(v).
Algorithm 3 A packing (maximal independent set) technique

Based on the above-mentioned techniques, we improve the ALM and propose a customized ALM in Algorithm 4.

Input: x0,λ0,ρ0>0\textbf{x}^{0},\lambda^{0},\rho^{0}>0 and the best objective function value f=+f^{*}=+\infty. Set k=0k=0.
Output: A local (global) optimal solution x\textbf{x}^{*}.
1 while the termination is not satisfied do
2      Step 2: Construct solution using Alg. 1
3       Update the BCD solution xk+1\textbf{x}^{k+1} by the procedure in lines 2-5 of Alg. 1;
4       Step 3: Generate a feasible solution
5       if the BCD solution is not feasible then
6            transform the BCD solution to a feasible solution x¯k+1\bar{\textbf{x}}^{k+1} by calling a refinement method in Alg. 2 or 3;
7      else
8            let x¯k+1=xk+1\bar{\textbf{x}}^{k+1}=\textbf{x}^{k+1};
9            
10      Step 4: Update the best solution
11       if 𝐜x¯k+1f\boldsymbol{c}^{\top}\bar{\textbf{x}}^{k+1}\leq f^{*} then
12            Set f=𝒄x¯k+1f^{*}=\boldsymbol{c}^{\top}\bar{\textbf{x}}^{k+1} and x=x¯k+1\textbf{x}^{*}=\bar{\textbf{x}}^{k+1};
13      Update the Lagrangian multipliers λk+1\lambda^{k+1} by (14a) and the penalty coefficient ρk+1\rho^{k+1} by (14b). Let k=k+1k=k+1.
Algorithm 4 A customized ALM

Although Algorithms 2 and 3 can help us find a feasible solution to problem (1), the quality of output by them remains unjustified. To evaluate and improve the quality of the solution, the simple way is to estimate the upper and lower bounds of the objective function value, and then calculate the gap between them. The smaller the gap, the better the current solution. Obviously, our method can provide an upper bound of the objective function value. As for the generation of the lower bound, we can use the following method. (i) LP relaxation: by directly relaxing the binary integer variables to [0,1][0,1] continuous variables, we solve a linear programming problem exactly to obtain the lower bound of the objective function value. (ii) LR relaxation: since the Lagrangian dual problem (2) is separable, we can solve the decomposed subproblems exactly to obtain the lower bound of the objective function value.

In general, the Lagrangian dual bound is at least as tight as the linear programming bound obtained from the usual linear programming relaxation [14]. Note that the relaxation must be solved to optimality to yield a valid bound. Therefore, we can use a combination of LR method and Alg. 4. To be specific, LR aims at generating the lower bound of the objective function value in (1), the solution is usually infeasible. Steps 2 and 3 in Alg. 4 are used for generating feasible solutions of (1). The smaller the gap between the lower bound and the upper bound, the closer the feasible solution is to the global optimal solution of the problem. It can be seen from the iterative procedure that after many iterations, this algorithm can generate many feasible solutions, and the lower bound of the model is constantly improving. Finally, we select the best solution. The combination of these two methods takes advantage of the ALM and the LR method, it not only finds a good feasible solution but also evaluates the quality of the solution.

Compared with the ADMM-based method in [21], we utilize BCD method to perform multiple iterations to solve the subproblem (15) until the solutions remain unchanged, which can improve the solution accuracy of the subproblem, thereby reducing the total number of iterations. Moreover, we adopt different refinement techniques to further enhance the solution quality.

4 Convergence Analysis

In this section, we present the convergence analysis of the block coordinate descent method for solving the augmented Lagrangian relaxation problem (4) and the augmented Lagrangian method for solving the dual problem (5). Unless otherwise stated, the convergence results presented in this section do not rely on Assumptions 3.1 and 3.2.

4.1 Convergence of BCD

We begin this section with the property of augmented Lagrangian function (3) that is fundamental in the convergence analyis.

Proposition 4.1

The gradient of L(x,λ,ρ)L(\textbf{\rm{x}},\lambda,\rho) at x is Lipschitz continuous with constant κ\kappa on 𝒳\mathcal{X}, namely,

L(x,λ,ρ)L(x¯,λ,ρ)κxx¯\|\nabla L(\textbf{\rm{x}},\lambda,\rho)-\nabla L(\bar{\textbf{\rm{x}}},\lambda,\rho)\|\leq\kappa\|\textbf{\rm{x}}-\bar{\textbf{\rm{x}}}\|

for all x,x¯𝒳\textbf{\rm{x}},\bar{\textbf{\rm{x}}}\in\mathcal{X}, where κ=ρ𝐀22\kappa=\rho\|\boldsymbol{A}\|_{2}^{2}. Furthermore,

L(x¯,λ,ρ)L(x,λ,ρ)+x¯x,L(x,λ,ρ)+κ2x¯x2L(\bar{\textbf{\rm{x}}},\lambda,\rho)\leq L(\textbf{\rm{x}},\lambda,\rho)+\langle\bar{\textbf{\rm{x}}}-\textbf{\rm{x}},\nabla L(\textbf{\rm{x}},\lambda,\rho)\rangle+\frac{\kappa}{2}\|\bar{\textbf{\rm{x}}}-\textbf{\rm{x}}\|^{2} (39)

for all x,x¯𝒳\textbf{\rm{x}},\bar{\textbf{\rm{x}}}\in\mathcal{X}.

Proof:

For any x,x¯𝒳\textbf{\rm{x}},\bar{\textbf{\rm{x}}}\in\mathcal{X},

L(x,λ,ρ)L(x¯,λ,ρ)\displaystyle\|\nabla L(\textbf{\rm{x}},\lambda,\rho)-\nabla L(\bar{\textbf{\rm{x}}},\lambda,\rho)\|
=\displaystyle= ρ𝑨(𝑨x𝒃)+ρ𝑨(𝑨x¯𝒃)+\displaystyle\left\|\rho\boldsymbol{A}^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}-\rho\boldsymbol{A}^{\top}(\boldsymbol{A}\bar{\textbf{x}}-\boldsymbol{b})_{+}\right\|
\displaystyle\leq ρ𝑨2(𝑨x𝒃)+(𝑨x¯𝒃)+\displaystyle\rho\|\boldsymbol{A}\|_{2}\left\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}-(\boldsymbol{A}\bar{\textbf{x}}-\boldsymbol{b})_{+}\right\|
\displaystyle\leq ρ𝑨2𝑨x𝑨x¯ρ𝑨22xx¯.\displaystyle\rho\|\boldsymbol{A}\|_{2}\left\|\boldsymbol{A}\textbf{x}-\boldsymbol{A}\bar{\textbf{x}}\right\|\leq\rho\|\boldsymbol{A}\|^{2}_{2}\left\|\textbf{x}-\bar{\textbf{x}}\right\|.

Then we have

L(x,λ,ρ)L(x¯,λ,ρ),xx¯\displaystyle\langle\nabla L(\textbf{\rm{x}},\lambda,\rho)-\nabla L(\bar{\textbf{\rm{x}}},\lambda,\rho),\textbf{\rm{x}}-\bar{\textbf{\rm{x}}}\rangle
\displaystyle\leq L(x,λ,ρ)L(x¯,λ,ρ)xx¯κxx¯2.\displaystyle\|\nabla L(\textbf{\rm{x}},\lambda,\rho)-\nabla L(\bar{\textbf{\rm{x}}},\lambda,\rho)\|\|\textbf{\rm{x}}-\bar{\textbf{\rm{x}}}\|\leq\kappa\|\textbf{\rm{x}}-\bar{\textbf{\rm{x}}}\|^{2}.

It follows from the convexity of the function f(x)=κ2x2L(x,λ,ρ)f(\textbf{x})=\frac{\kappa}{2}\|\textbf{x}\|^{2}-L(\textbf{\rm{x}},\lambda,\rho) that

f(x¯)f(x)+f(x)(xx¯),f(\bar{\textbf{\rm{x}}})\geq f(\textbf{x})+\nabla f(\textbf{x})^{\top}(\textbf{\rm{x}}-\bar{\textbf{\rm{x}}}),

which implies the estimate (39). ∎∎

Considering two different updates in the BCD method, we first summarize the convergence property of the BCD method for solving (16a), which has been presented in [36]. Before that, we give the definition of the blockwise optimal solution, which is also called coordinatewise minimum point in [37]. Given parameters λ\lambda and ρ\rho, a feasible solution x\textbf{x}^{*} is called a blockwise optimal solution of the problem (4) if for each jpj\in\mathbb{N}_{p}, we have for all x=(x1,,xj1,xj,xj+1,,xp)𝒳\textbf{x}=(\textbf{x}_{1}^{*},...,\textbf{x}_{j-1}^{*},\textbf{x}_{j},\textbf{x}_{j+1}^{*},...,\textbf{x}_{p}^{*})\in\mathcal{X}, L(x,λ,ρ)L(x,λ,ρ).L(\textbf{x}^{*},\lambda,\rho)\leq L(\textbf{x},\lambda,\rho).

Lemma 4.1

Suppose Assumptions 3.1 and 3.2 hold. If the starting point satisfies x0𝒳\textbf{\rm{x}}^{0}\in\mathcal{X}, then the BCD method for solving (16a) is always executable and terminates after a finite number of iterations with a blockwise optimal solution of the problem (4).

Proof:

If Assumptions 3.1 and 3.2 hold, Lemma 3.1 tells us that the subproblem (16a) can be solved exactly. (i) Since the constraint set of each subproblem (16a) is bounded, then all subproblems have an optimal solution. Therefore, the BCD method for solving (16a) is executable. (ii) The result in (i) illustrates that the sequence {xt}\{\textbf{x}^{t}\} generated by the BCD method exists. Thus the sequence of objective function values {L(xt,λ,ρ)}\{L(\textbf{\rm{x}}^{t},\lambda,\rho)\} can only take finitely many different values. This together with the monotonically decreasing property of the function L(x,λ,ρ)L(\textbf{\rm{x}},\lambda,\rho) yields that L(xt,λ,ρ)L(\textbf{\rm{x}}^{t},\lambda,\rho) must become a constant. Then the BCD method is executable. (iii) By the definition of the blockwise optimal solution and the fact that the subproblem (16a) can be solved exactly, we arrive at the conclusion. ∎∎

A similar conclusion can be found in [36], but it does not clarify how to solve the subproblem. Note that each (global) optimal solution of the model (4) is blockwise optimal, but not vice versa. Subsequently, we analyze the convergence of (16b). The following lemma ensures decreasing the function value of LL after each iteration if xt+1xt\textbf{\rm{x}}^{t+1}\neq\textbf{\rm{x}}^{t} and the step size is chosen properly.

Lemma 4.2

Let {xt}t\{\textbf{\rm{x}}^{t}\}_{t\in\mathbb{N}} be a sequence generated by (16b), we obtain

(12τκ2)xt+1xt2L(xt,λ,ρ)L(xt+1,λ,ρ).(\frac{1}{2\tau}-\frac{\kappa}{2})\|\textbf{\rm{x}}^{t+1}-\textbf{\rm{x}}^{t}\|^{2}\leq L(\textbf{\rm{x}}^{t},\lambda,\rho)-L(\textbf{\rm{x}}^{t+1},\lambda,\rho). (40)
Proof:

Since

xjt+1=𝒯𝒳j(τgj(xt)xj+12xjt)\textbf{x}_{j}^{t+1}=\mathcal{T}_{\mathcal{X}_{j}}\left(\tau g_{j}(\textbf{x}^{t})^{\top}\textbf{x}_{j}+\frac{\textbf{1}}{2}-\textbf{x}_{j}^{t}\right)

and xjt𝒳j\textbf{x}_{j}^{t}\in\mathcal{X}_{j} for all j{1,2,,p}j\in\{1,2,...,p\}, we have

(τgj(xt)+12xjt)xjt+1(τgj(xt)+12xjt)xjt,\left(\tau g_{j}(\textbf{x}^{t})+\frac{\textbf{1}}{2}-\textbf{x}_{j}^{t}\right)^{\!\!\top}\!\textbf{x}_{j}^{t+1}\leq\left(\tau g_{j}(\textbf{x}^{t})+\frac{\textbf{1}}{2}-\textbf{x}_{j}^{t}\right)^{\!\!\top}\!\textbf{x}_{j}^{t},

which implies that

2τxjt+1xjt,gj(xt)+xjt+1xjt20.2\tau\langle\textbf{x}_{j}^{t+1}-\textbf{x}_{j}^{t},g_{j}(\textbf{x}^{t})\rangle+\|\textbf{x}_{j}^{t+1}-\textbf{x}_{j}^{t}\|^{2}\leq 0. (41)

Proposition 4.1 tells us that

L(xt+1,λ,ρ)L(xt,λ,ρ)\displaystyle L(\textbf{x}^{t+1},\lambda,\rho)-L(\textbf{x}^{t},\lambda,\rho)
\displaystyle\leq j=1pxjt+1xjt,gj(xt)+κ2j=1pxjt+1xjt2.\displaystyle\sum\nolimits_{j=1}^{p}\langle\textbf{x}_{j}^{t+1}-\textbf{x}_{j}^{t},g_{j}(\textbf{x}^{t})\rangle+\frac{\kappa}{2}\sum\nolimits_{j=1}^{p}\|\textbf{x}_{j}^{t+1}-\textbf{x}_{j}^{t}\|^{2}. (42)

Combining inequalities (41) and (4.1) yields (40). The proof is completed. ∎∎

This lemma tells us that a small step size satisfying τ<1/κ\tau<1/\kappa leads to a decrease in the function value LL when xt+1xt\textbf{\rm{x}}^{t+1}\neq\textbf{\rm{x}}^{t}. However, the following lemma states that when the step size is too small, the iteration returns the same result. We let g(x)g(\textbf{x}) denote the gradient of L(x,λ,ρ)L(\textbf{x},\lambda,\rho) at x.

Lemma 4.3

If the step size satisfies 0<τ<12g(xt)0<\tau<\frac{1}{2\|g(\textbf{\rm{x}}^{t})\|} when g(xt)0g(\textbf{\rm{x}}^{t})\neq\textbf{0}, then it holds that

xt=𝒯𝒳(τg(xt)+𝟙2xt).\textbf{\rm{x}}^{t}=\mathcal{T}_{\mathcal{X}}\left(\tau g(\textbf{\rm{x}}^{t})+\frac{\mathbb{1}}{2}-\textbf{\rm{x}}^{t}\right).
Proof:

If 0<τ<12g(xt)0<\tau<\frac{1}{2\|g(\textbf{x}^{t})\|}, for all xtx𝒳\textbf{x}^{t}\neq\textbf{x}\in\mathcal{X}, we have

2τg(xt)(xtx)2τg(xt)xtxxtx<xtx2,-2\tau g(\textbf{x}^{t})^{\top}(\textbf{x}^{t}-\textbf{x})\leq 2\tau\|g(\textbf{x}^{t})\|\|\textbf{x}^{t}-\textbf{x}\|\leq\|\textbf{x}^{t}-\textbf{x}\|<\|\textbf{x}^{t}-\textbf{x}\|^{2},

where the last inequality holds due to xt,x{0,1}n\textbf{x}^{t},\textbf{x}\in\{0,1\}^{n}. It yields that

(τg(xt)+12xt)xt(τg(xt)+12xt)x.\left(\tau g(\textbf{x}^{t})+\frac{\textbf{1}}{2}-\textbf{x}^{t}\right)^{\top}\textbf{x}^{t}\leq\left(\tau g(\textbf{x}^{t})+\frac{\textbf{1}}{2}-\textbf{x}^{t}\right)^{\top}\textbf{x}.

By the definition of the operator 𝒯𝒳()\mathcal{T}_{\mathcal{X}}(\cdot), we arrive at the conclusion. ∎∎

Based on Lemma 4.3, the implementation of the BCD method heavily relies on the choice of step size τ\tau. Hence, we give a definition of a τ\tau-stationary point.

Definition 4.1

For the AL relaxation problem (4), if a point x\textbf{\rm{x}}^{*} satisfies

x=𝒯𝒳(τg(x)+12x)\textbf{\rm{x}}^{*}=\mathcal{T}_{\mathcal{X}}\left(\tau g(\textbf{\rm{x}}^{*})+\frac{\textbf{1}}{2}-\textbf{\rm{x}}^{*}\right)

with τ>0\tau>0, then it is called a τ\tau-stationary point.

For the augmented Lagrangian relaxation problem (4), given parameters λ\lambda and ρ\rho, we say that x\textbf{x}^{*} is a δ\delta-local minimizer if there is an integer δ>0\delta>0 such that

L(x,λ,ρ)L(x,λ,ρ),for allx𝒩(x,δ)𝒳.L(\textbf{x},\lambda,\rho)\geq L(\textbf{x}^{*},\lambda,\rho),~{}~{}\text{for~{}all}~{}\textbf{x}\in\mathcal{N}(\textbf{x}^{*},\delta)\cap\mathcal{X}.

Note that a nn-local minimizer is a global minimizer due to the fact xxn\|\textbf{x}-\textbf{x}^{*}\|\leq n for all x,x𝒳\textbf{x},\textbf{x}^{*}\in\mathcal{X}. The following important result reveals a relationship between a τ\tau-stationary point and a global minimizer of the problem (4).

Theorem 4.1

We have the following relationships between the τ\tau-stationary point and the local minimizer of the problem (4).

  • (i)

    If x\textbf{\rm{x}}^{*} is a local minimizer, then x\textbf{\rm{x}}^{*} is a τ\tau-stationary point for any step size 0<τ<1/κ0<\tau<1/\kappa.

  • (ii)

    If x\textbf{\rm{x}}^{*} is a τ\tau-stationary point with τ>δ/2\tau>\delta/2, then x\textbf{\rm{x}}^{*} is a δ\delta-local minimizer of the problem (4) under the assumption that the entries in 𝑨,𝒃,𝒄,λ\boldsymbol{A},\boldsymbol{b},\boldsymbol{c},\lambda and ρ\rho are integral.

Proof:

(i) If x\textbf{x}^{*} is a local minimizer, then for any x𝒳\textbf{x}\in\mathcal{X} we have

L(x,λ,ρ)\displaystyle L(\textbf{x}^{*},\lambda,\rho)\leq L(x,λ,ρ)\displaystyle L(\textbf{x},\lambda,\rho)
\displaystyle\leq L(x,λ,ρ)+L(x,λ,ρ),xx+κ2xx2.\displaystyle L(\textbf{x}^{*},\lambda,\rho)+\langle\nabla L(\textbf{x}^{*},\lambda,\rho),\textbf{x}-\textbf{x}^{*}\rangle+\frac{\kappa}{2}\|\textbf{x}-\textbf{x}^{*}\|^{2}.

Therefore,

L(x,λ,ρ),xxκ2xx212τxx2,\langle\nabla L(\textbf{x}^{*},\lambda,\rho),\textbf{x}-\textbf{x}^{*}\rangle\geq-\frac{\kappa}{2}\|\textbf{x}-\textbf{x}^{*}\|^{2}\geq-\frac{1}{2\tau}\|\textbf{x}-\textbf{x}^{*}\|^{2},

which implies that x\textbf{x}^{*} is a τ\tau-stationary point.

(ii) Since x\textbf{x}^{*} is a stationary point, then for any x𝒳\textbf{x}\in\mathcal{X} we have

(τg(x)+12x)x(τg(x)+12x)x.\left(\tau g(\textbf{x}^{*})+\frac{\textbf{1}}{2}-\textbf{x}^{*}\right)^{\top}\textbf{x}^{*}\leq\left(\tau g(\textbf{x}^{*})+\frac{\textbf{1}}{2}-\textbf{x}^{*}\right)^{\top}\textbf{x}. (43)

For any x𝒳𝒩(x,δ)\textbf{x}\in\mathcal{X}\cap\mathcal{N}(\textbf{x}^{*},\delta), we define the index sets 𝕁:={ln:xl=0,xl=1}\mathbb{J}:=\{l\in\mathbb{N}_{n}:x^{*}_{l}=0,x_{l}=1\} and 𝕁¯:={ln:xl=1,xl=0}\bar{\mathbb{J}}:=\{l\in\mathbb{N}_{n}:x^{*}_{l}=1,x_{l}=0\}. Then

(τg(x)+12x)(xx)\displaystyle\left(\tau g(\textbf{x}^{*})+\frac{\textbf{1}}{2}-\textbf{x}^{*}\right)^{\top}\left(\textbf{x}-\textbf{x}^{*}\right)
=\displaystyle= l𝕁(τgl(x)+12)l𝕁¯(τgl(x)12)\displaystyle\sum\nolimits_{l\in\mathbb{J}}\left(\tau g_{l}(\textbf{x}^{*})+\frac{1}{2}\right)-\sum\nolimits_{l\in\bar{\mathbb{J}}}\left(\tau g_{l}(\textbf{x}^{*})-\frac{1}{2}\right)
\displaystyle\leq τ(l𝕁gl(x)l𝕁¯gl(x))+δ2,\displaystyle\tau\left(\sum\nolimits_{l\in\mathbb{J}}g_{l}(\textbf{x}^{*})-\sum\nolimits_{l\in\bar{\mathbb{J}}}g_{l}(\textbf{x}^{*})\right)+\frac{\delta}{2},

which together with (43) and τ>δ2\tau>\frac{\delta}{2} yields that

l𝕁gl(x)l𝕁¯gl(x)δ2τ>1.\displaystyle\sum\nolimits_{l\in\mathbb{J}}g_{l}(\textbf{x}^{*})-\sum\nolimits_{l\in\bar{\mathbb{J}}}g_{l}(\textbf{x}^{*})\geq-\frac{\delta}{2\tau}>-1.

Since the entries in 𝑨,𝒃,𝒄,λ\boldsymbol{A},\boldsymbol{b},\boldsymbol{c},\lambda and ρ\rho are integral, then gl(x)g_{l}(\textbf{x}^{*}) is an integer for every lnl\in\mathbb{N}_{n}. This implies that

l𝕁gl(x)l𝕁¯gl(x)0.\displaystyle\sum\nolimits_{l\in\mathbb{J}}g_{l}(\textbf{x}^{*})-\sum\nolimits_{l\in\bar{\mathbb{J}}}g_{l}(\textbf{x}^{*})\geq 0.

Then for any x𝒳𝒩(x,δ)\textbf{x}\in\mathcal{X}\cap\mathcal{N}(\textbf{x}^{*},\delta), it follows from the convexity of LL that

L(x,λ,ρ)L(x,λ,ρ)\displaystyle L(\textbf{x},\lambda,\rho)-L(\textbf{x}^{*},\lambda,\rho)
\displaystyle\geq L(x,λ,ρ),xx\displaystyle\langle\nabla L(\textbf{x}^{*},\lambda,\rho),\textbf{x}-\textbf{x}^{*}\rangle
=\displaystyle= l𝕁gl(x)(xlxl)+l𝕁¯gl(x)(xlxl)\displaystyle\sum\nolimits_{l\in\mathbb{J}}g_{l}(\textbf{x}^{*})(x_{l}-x_{l}^{*})+\sum\nolimits_{l\in\bar{\mathbb{J}}}g_{l}(\textbf{x}^{*})(x_{l}-x_{l}^{*})
=\displaystyle= l𝕁gl(x)l𝕁¯gl(x)0.\displaystyle\sum\nolimits_{l\in\mathbb{J}}g_{l}(\textbf{x}^{*})-\sum\nolimits_{l\in\bar{\mathbb{J}}}g_{l}(\textbf{x}^{*})\geq 0.

Therefore, x\textbf{x}^{*} is a δ\delta-local minimizer of the problem (4). ∎∎

We can observe from Theorem 4.1 that every blockwise optimal solution of (4) is a τ\tau-stationary point. Conversely, if the entries in 𝑨,𝒃,𝒄,λ\boldsymbol{A},\boldsymbol{b},\boldsymbol{c},\lambda and ρ\rho are integral and τ>12maxj{nj}\tau>\frac{1}{2}\max_{j}\{n_{j}\}, then every τ\tau-stationary point of (4) is blockwise optimal. These results on relationships between τ\tau-stationary point, blockwise optimal solution and local (global) minimizer are summarized in Figure 1 intuitively.

δ\delta-local minimizerτ\tau-stationary pointCondition (a), τ>12maxj{nj}\tau>\frac{1}{2}\max_{j}\{n_{j}\}blockwise optimal solutionCondition (a), τ>δ/2\tau>\delta/20<τ<1/κ0<\tau<1/\kappaAssumption 3.1δ=n\delta=n
Figure 1: The relationships among τ\tau-stationary point, blockwise optimal solution and local minimizer. Condition(a): the entries in 𝑨,𝒃,𝒄,λ\boldsymbol{A},\boldsymbol{b},\boldsymbol{c},\lambda and ρ\rho are integral.

We finally present the main result on the convergence of the BCD method for solving (16b).

Theorem 4.2

(Convergence properties) Let {xt}t\{\textbf{\rm{x}}^{t}\}_{t\in\mathbb{N}} be a sequence generated by (16b). If the step size satisfies 0<τ<12κ0<\tau<\frac{1}{2\kappa}, then we have

  • (i)

    The sequence {L(xt,λ,ρ)}t\{L(\textbf{\rm{x}}^{t},\lambda,\rho)\}_{t\in\mathbb{N}} is nonincreasing and

    κ2xt+1xt2L(xt,λ,ρ)L(xt+1,λ,ρ).\frac{\kappa}{2}\|\textbf{\rm{x}}^{t+1}-\textbf{\rm{x}}^{t}\|^{2}\leq L(\textbf{\rm{x}}^{t},\lambda,\rho)-L(\textbf{\rm{x}}^{t+1},\lambda,\rho). (44)
  • (ii)

    The sequence {xt}\{\textbf{\rm{x}}^{t}\} converges to a τ\tau-stationary point after at most 2Cn+κnκ\lceil\frac{2C\sqrt{n}+\kappa n}{\kappa}\rceil iterations, where C=maxx𝒳L(x,λ,ρ)C=\max_{\textbf{\rm{x}}\in\mathcal{X}}\|\nabla L(\textbf{\rm{x}},\lambda,\rho)\|.

Proof:

(i) It is obvious from (40) that if 0<τ<12κ0<\tau<\frac{1}{2\kappa}, then (44) holds.

(ii) Given the parameters λ+m\lambda\in\mathbb{R}^{m}_{+} and ρ>0\rho>0, we can observe that the model (4) can be equivalently written as

minF(x):=L(x,λ,ρ)+δ𝒳(x),\min F(\textbf{x}):=L(\textbf{x},\lambda,\rho)+\delta_{\mathcal{X}}(\textbf{x}),

where δ𝒳(x)\delta_{\mathcal{X}}(\textbf{x}) is the indicator function of the set 𝒳\mathcal{X}, i.e., δ𝒳(x)=+,ifx𝒳\delta_{\mathcal{X}}(\textbf{x})=+\infty,\text{if}~{}\textbf{x}\in\mathcal{X} and 0 otherwise. To verify that the sequence {xt}\{\textbf{\rm{x}}^{t}\} converges to a τ\tau-stationary point, we examine that the conditions in [38, Theorem 1] hold.

Firstly, since the set 𝒳={x{0,1}n:𝑩x𝒅}\mathcal{X}=\{\textbf{x}\in\{0,1\}^{n}:\boldsymbol{B}\textbf{x}\leq\boldsymbol{d}\} is semi-algebraic, then δ𝒳(x)\delta_{\mathcal{X}}(\textbf{x}) is a Kurdyka-Łojasiewicz (KL) function [39, 38]. It is obvious to see that L(x,λ,ρ)L(\textbf{x},\lambda,\rho) is also a KL function, and hence the function F(x)F(\textbf{x}) satisfies the KL property, which is a crucial condition ensuring convergence.

Secondly, the Lipschitz constant κ>0\kappa>0 in Proposition 4.1 is bounded if ρ\rho has an upper bound, and the problem (4) is inf-bounded.

Thirdly, for any x,x¯𝒳\textbf{x},\bar{\textbf{x}}\in\mathcal{X} with x¯=(x1,,x¯j,,xp)\bar{\textbf{x}}=(\textbf{x}_{1},...,\bar{\textbf{x}}_{j},...,\textbf{x}_{p}),

xjL(x,λ,ρ)xjL(x¯,λ,ρ)\displaystyle\|\nabla_{\textbf{x}_{j}}L(\textbf{x},\lambda,\rho)-\nabla_{\textbf{x}_{j}}L(\bar{\textbf{x}},\lambda,\rho)\|
=\displaystyle= ρ𝑨j(𝑨x𝒃)+ρ𝑨j(𝑨x¯𝒃)+\displaystyle\left\|\rho\boldsymbol{A}_{j}^{\top}(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}-\rho\boldsymbol{A}_{j}^{\top}(\boldsymbol{A}\bar{\textbf{x}}-\boldsymbol{b})_{+}\right\|
\displaystyle\leq ρ𝑨j2(𝑨x𝒃)+(𝑨x¯𝒃)+\displaystyle\rho\|\boldsymbol{A}_{j}\|_{2}\left\|(\boldsymbol{A}\textbf{x}-\boldsymbol{b})_{+}-(\boldsymbol{A}\bar{\textbf{x}}-\boldsymbol{b})_{+}\right\|
\displaystyle\leq ρ𝑨j2𝑨x𝑨x¯ρ𝑨j22xjx¯j.\displaystyle\rho\|\boldsymbol{A}_{j}\|_{2}\left\|\boldsymbol{A}\textbf{x}-\boldsymbol{A}\bar{\textbf{x}}\right\|\leq\rho\|\boldsymbol{A}_{j}\|^{2}_{2}\left\|\textbf{x}_{j}-\bar{\textbf{x}}_{j}\right\|.

Therefore, the result follows from [38, Theorem 1].

Suppose x\textbf{x}^{*} is a τ\tau-stationary point, then for every x𝒳\textbf{x}\in\mathcal{X}, we have

L(x,λ,ρ)L(x,λ,ρ)\displaystyle L(\textbf{x},\lambda,\rho)-L(\textbf{x}^{*},\lambda,\rho)
\displaystyle\leq L(x,λ,ρ),xx+κ2xx2\displaystyle\langle\nabla L(\textbf{x}^{*},\lambda,\rho),\textbf{x}-\textbf{x}^{*}\rangle+\frac{\kappa}{2}\|\textbf{x}-\textbf{x}^{*}\|^{2}
\displaystyle\leq Cxx+κ2xx2Cn+κn2.\displaystyle C\|\textbf{x}-\textbf{x}^{*}\|+\frac{\kappa}{2}\|\textbf{x}-\textbf{x}^{*}\|^{2}\leq C\sqrt{n}+\frac{\kappa n}{2}. (45)

If each iteration always finds a new point until arriving at x\textbf{x}^{*}, we get

κT2\displaystyle\frac{\kappa T}{2} t=0T(L(xt,λ,ρ)L(xt+1,λ,ρ))\displaystyle\leq\sum\nolimits_{t=0}^{T}\left(L(\textbf{\rm{x}}^{t},\lambda,\rho)-L(\textbf{x}^{t+1},\lambda,\rho)\right)
L(x0,λ,ρ)L(x,λ,ρ),\displaystyle\leq L(\textbf{x}^{0},\lambda,\rho)-L(\textbf{x}^{*},\lambda,\rho), (46)

where the first inequality holds due to the conclusion in (i) and the fact xt+1xt1\|\textbf{x}^{t+1}-\textbf{x}^{t}\|\geq 1. Thus combining (4.1) with (4.1) yields that T2Cn+κnκT\leq\lceil\frac{2C\sqrt{n}+\kappa n}{\kappa}\rceil. It implies that we only need at most 2Cn+κnκ\lceil\frac{2C\sqrt{n}+\kappa n}{\kappa}\rceil steps to arrive at a τ\tau-stationary point. ∎∎

If each iteration of the BCD method always finds a new point, then according to Theorem 4.2 (i), we have t=0xt+1xt2<+\sum_{t=0}^{\infty}\|\textbf{\rm{x}}^{t+1}-\textbf{\rm{x}}^{t}\|^{2}<+\infty. By Theorem 1 in [38], we can conclude that limtxt=x,\lim_{t\rightarrow\infty}\textbf{x}^{t}=\textbf{x}^{*}, where x\textbf{x}^{*} is a limit point of the sequence {xt}t\{\textbf{x}^{t}\}_{t\in\mathbb{N}}. If we choose an initial point that is already close to an optimal solution and an appropriate step size τ\tau, based on specific convergence criteria detailed in Theorem 4.2, the BCD method can find a global solution of problem (4).

4.2 Convergence of ALM

Assume the BCD method returns a global minimizer x\textbf{x}^{*} to the augmented Lagrangian relaxation problem (4) in each inner loop. In this subsection, we focus on the convergence property of the projected subgradient method for solving the dual problem (5). For convenience, we introduce the following constants for subsequent analysis:

S\displaystyle S :=\displaystyle:= argmaxλ+m,ρ>0d(λ,ρ),\displaystyle{\text{arg}\max}_{\lambda\in\mathbb{R}_{+}^{m},\rho>0}~{}d(\lambda,\rho),
θ\displaystyle\theta :=\displaystyle:= min(λ,ρ)Sλ0λ2+(ρ0ρ)2.\displaystyle{\min}_{(\lambda,\rho)\in S}~{}\|\lambda^{0}-\lambda\|^{2}+(\rho^{0}-\rho)^{2}.

Let dgkd_{g}^{k} denote the subgradient of d(λk,ρk)d(\lambda^{k},\rho^{k}). We estimate the distance between the dual function value in each iteration and the optimal value of the problem (1) in the following theorem.

Theorem 4.3

If we take the step size αk=βkdgk\alpha^{k}=\frac{\beta_{k}}{\|d_{g}^{k}\|} with βk=θK\beta_{k}=\sqrt{\frac{\theta}{K}}, then,

fIPmaxk{1,2,,K}d(λk,ρk)ζ25θK,f^{\text{IP}}-\max_{k\in\{1,2,...,K\}}d(\lambda^{k},\rho^{k})\leq\frac{\zeta}{2}\sqrt{\frac{5\theta}{K}},

where ζ=maxx𝒳𝐀x𝐛4\zeta={\max}_{\textbf{\rm{x}}\in\mathcal{X}}~{}\|\boldsymbol{A}\textbf{\rm{x}}-\boldsymbol{b}\|^{4}. Furthermore, if the positive sequence {βk}k\{\beta_{k}\}_{k\in\mathbb{N}} is bounded and kβk2<+.\sum_{k\in\mathbb{N}}\beta_{k}^{2}<+\infty. Then (λk,ρk)(\lambda^{k},\rho^{k}) converges to some (λ,ρ)S(\lambda^{*},\rho^{*})\in S.

Proof:

Let {(λk,ρk)}k\{(\lambda^{k},\rho^{k})\}_{k\in\mathbb{N}} be the sequence generated by Algorithm 1. We derive from the Lemma 3 in [17] that

fIPmaxk{1,2,,K}d(λk,ρk)θ+k=1K(αkdgk)22k=1Kαk.f^{\text{IP}}-\max_{k\in\{1,2,...,K\}}d(\lambda^{k},\rho^{k})\leq\frac{\theta+\sum_{k=1}^{K}(\alpha^{k}\|d_{g}^{k}\|)^{2}}{2\sum_{k=1}^{K}\alpha^{k}}. (47)

For all kk\in\mathbb{N}, we have

dgk2=𝑨xk𝒃2+14(𝑨xk𝒃)+454𝑨xk𝒃454ζ.\|d_{g}^{k}\|^{2}=\|\boldsymbol{A}\textbf{x}^{k}-\boldsymbol{b}\|^{2}+\frac{1}{4}\|(\boldsymbol{A}\textbf{x}^{k}-\boldsymbol{b})_{+}\|^{4}\leq\frac{5}{4}\|\boldsymbol{A}\textbf{x}^{k}-\boldsymbol{b}\|^{4}\leq\frac{5}{4}\zeta.

Then the right-hand side of (47) satisfies

θ+k=1K(αkdgk)22k=1Kαk=5ζ(θ+k=1Kβk2)4k=1Kβk=ζ25θK.\frac{\theta+\sum_{k=1}^{K}(\alpha^{k}\|d_{g}^{k}\|)^{2}}{2\sum_{k=1}^{K}\alpha^{k}}=\frac{\sqrt{5}\zeta(\theta+\sum_{k=1}^{K}\beta_{k}^{2})}{4\sum_{k=1}^{K}\beta_{k}}=\frac{\zeta}{2}\sqrt{\frac{5\theta}{K}}.

Thus we arrive at the first result.

The second claim is then proved in [40, Theorem 7.4], where the proof for the subgradient method is readily extended to the projected subgradient method. ∎∎

Although the theoretical analysis of ALM-C relies on Assumption 3.1, our tests show that ALM-C is also effective in more general settings. Furthermore, we present the ALM-P method as a versatile alternative not relying on this assumption.

5 Applications

In this section, we show the performance of the proposed algorithms on two practical problems. One is the train timetabling problem and the other is the vehicle routing problem. To show the performance of compared methods, we define three termination criteria: a maximum number of iterations, a time limit, and an optimality gap based on the difference between the objective value generated by our methods and the best known value. All instances are tested on a MacBook Pro 2019 with 8GB of memory and Intel Core i5 (Turbo Boost up to 4.1 GHz) with 128 MB eDRAM.

5.1 Capacitated vehicle routing problem

We consider the capacitated vehicle routing problem with time windows (CVRPTW). The problem is defined on a complete directed graph G=(V,E)G=(V,E), where V={0,1,,n}V=\{0,1,...,n\} is the node set and EE is the edge set. Node 0 represents the depot where the vehicles are based, and nodes 1 to nn represent the customers that need to be served. Each edge (s,t)(s,t) in EE has an associated travel time TstT_{st}. Each customer ss has a demand csc_{s} and a service time window [as,bs][a_{s},b_{s}]. We let dstd_{st} be the distance from node ss to node tt and MM be a large constant. The objective is to construct a set of least-cost vehicle routes starting and ending at the depot, such that each customer is visited exactly once within their time window, and the total demand of customers served in each route does not exceed vehicle capacity CC.

To formulate this problem as an integer program, we define the following decision variables: (i) xstjx_{st}^{j}: binary variable equal to 1 if edge (s,t)(s,t) is used by vehicle jj, 0 otherwise. (ii) wsjw_{s}^{j}: continuous variable indicating the start of service time at customer ss by vehicle jj. Then the block structured integer linear programming formulation is:

min jp(s,t)Edstxstj\displaystyle\sum\nolimits_{j\in\mathbb{N}_{p}}\sum\nolimits_{(s,t)\in E}d_{st}x_{st}^{j} (48a)
s.t. jptV:tsxstj=1,sV\0\displaystyle\sum\nolimits_{j\in\mathbb{N}_{p}}\sum\nolimits_{t\in V:t\neq s}x_{st}^{j}=1,\hskip 45.52458pts\in V\backslash{0} (48b)
tV\sxstj=tV\sxtsj,iV,jp\displaystyle\sum\nolimits_{t\in V\backslash s}x_{st}^{j}=\sum\nolimits_{t\in V\backslash s}x_{ts}^{j},\hskip 18.49428pti\in V,~{}j\in\mathbb{N}_{p} (48c)
tV\0x0tj=1,jp\displaystyle\sum\nolimits_{t\in V\backslash 0}x_{0t}^{j}=1,\hskip 93.89418ptj\in\mathbb{N}_{p} (48d)
sVtV\scsxstjC,jp\displaystyle\sum\nolimits_{s\in V}\sum\nolimits_{t\in V\backslash s}c_{s}x_{st}^{j}\leq C,\hskip 52.63777ptj\in\mathbb{N}_{p} (48e)
wsj+TstM(1xstj)wtj,(s,t)E,jp\displaystyle w_{s}^{j}+T_{st}-M(1-x_{st}^{j})\leq w_{t}^{j},(s,t)\in E,j\in\mathbb{N}_{p} (48f)
aswsjbs,sV,jp\displaystyle a_{s}\leq w_{s}^{j}\leq b_{s},\hskip 81.09052pts\in V,~{}j\in\mathbb{N}_{p} (48g)
xstj0,1,(s,t)E,jp\displaystyle x_{st}^{j}\in{0,1},\hskip 81.09052pt(s,t)\in E,~{}j\in\mathbb{N}_{p} (48h)

The block structure lies in the routing variables xstjx_{st}^{j} for each vehicle jj, which are constrained by the flow balance, capacity and time window constraints. By relaxing the coupling constraints (48b), the problem decomposes into separate routing subproblems per vehicle.

5.1.1 Parameter Setting

We let ALM-C and ALM-P denote Algorithm 4 using the updates (16a) and (16b) in step 2, respectively. We compare our proposed methods with the Gurobi solver (version 11.0.0) and OR-tools on all the instances from the Solomon dataset [41]. Since the ADMM in [20] lacks adaptability for all instances, then we do not compare with it. To illustrate the scale of these instances, we present a subset of representative examples in Table I. The notations |F|,|V||F|,|V| and |E||E| represent the number of vehicles, the number of customers and the number of edges. nn and nwn_{w} denote the number of variables xx and ww, respectively. To evaluate the robustness of the compared methods, we conduct experiments on the C1-type instances with various problem sizes, as detailed in Table II. The results of all remaining instances from the Solomon dataset are presented in Table III.

We adopt a “vehicle and route-based” formulation to model the CVRPTW problem, which is different from the space-time network flow formulation used in [20]. Therefore, the subproblem is a route problem with capacity and time window constraints. We utilize the Gurobi solver to solve the subproblem for convenience. Note that the formulation does not affect Gurobi’s performance. To ensure a fair comparison and maximize Gurobi’s utilization, we have implemented efficient callback functions based on Danzig’s formulation to fine-tune its performance. We set a time limit of 2000 seconds for the compared methods, except for or-tools where the time limit is set to 500 seconds, and use the symbol “–” to signify that the solver failed to find a feasible solution within the allotted time limit.

TABLE I: A subset of representative examples of the Solomon datasets
No. |F||F| |V||V| |E||E| mm qq nn nwn_{w}
R101-50 12 50 2,550 50 30,636 30,600 612
R201-50 6 50 2,550 50 15,318 15,300 306
RC101-50 8 50 2,550 50 20,424 20,400 408
RC201-50 5 50 2,550 50 12,765 12,750 255
 C101-100 10 100 10,100 100 101,030 101,000 1,010
C201-100 3 100 10,100 100 30,309 30,300 303

5.1.2 Performances of the Proposed Algorithm

In the subsequent tables, the “ff^{*}” and “ff” columns correspond to the best-known objective values and the objective values of feasible solutions generated by these compared methods, respectively. The “Time” column denotes the CPU time (in seconds) that the methods taken by the algorithms to meet the stopping criteria. The optimality gap is defined by gap=1|ff|/|f|{}^{1}=|f-f^{*}|/|f^{*}|. Due to the lack of an inherent termination criterion in OR-tools, it runs for the entire limited time and outputs the corresponding feasible solution. Therefore, we do not report its solution time. Table II shows the stability of the compared methods under various problem sizes on the C1-type instances. We can observe that our methods outperform the OR-Tools and Gurobi, and are more stable than the OR-Tools. From Table III, we can find that the ALM-C algorithm outperforms the OR-Tools and Gurobi in most instances, achieving significantly lower optimality gaps and competitive computation times. The ALM-P algorithm also exhibits promising results, often outperforming Gurobi in efficiency and solution quality. Overall, both the proposed ALM-C and ALM-P algorithms demonstrate their superiority and robustness in solving the CVRPTW problem, providing high-quality solutions with good computational efficiency when compared to existing solvers and heuristic approaches.

TABLE II: Performance comparison on Solomon’s C1 instances with varying problem size and perturbation. A time limit of 500 seconds is set for OR-Tools.
ins. |V||V| |F||F| ff^{*} OR-Tools Gurobi ALM-P ALM-C
ff gap1 ff gap1 Time ff gap1 Time ff gap1 Time
c101 25 3 191.3 632.6 230.7% 191.8 0.3% 0.0 191.8 0.3% 2.4 191.8 0.3% 2.7
50 5 362.4 514.4 41.9% 363.2 0.2% 1.1 363.2 0.2% 5.9 363.2 0.2% 5.0
100 10 827.3 828.9 0.2% 828.9 0.2% 8.9 828.9 0.2% 57.8 828.9 0.2% 55.0
c102 25 3 190.3 786.1 313.1% 190.7 0.2% 1.9 190.7 0.2% 3.7 190.7 0.2% 4.6
50 5 361.4 667.9 84.8% 362.2 0.2% 25.5 362.2 0.2% 16.0 362.2 0.2% 15.3
100 10 827.3 828.9 0.2% - 2000 828.9 0.2% 167.4 828.9 0.2% 196.4
c103 25 3 190.3 786.1 313.1% 190.7 0.2% 7.0 190.7 0.2% 15.9 190.7 0.2% 11.0
50 5 361.4 667.9 84.8% 362.2 0.2% 1584.8 365.3 1.1% 40.0 362.2 0.2% 41.5
100 10 826.3 - - - 2000 839.0 1.5% 362.6 828.1 0.2% 259.9
c104 25 3 186.9 875.6 368.5% 187.4 0.3% 45.1 188.6 0.9% 33.5 188.6 0.9% 17.9
50 5 358.0 606.2 69.3% 360.1 0.6% 2000 362.2 1.2% 57.1 358.9 0.3% 319.8
100 10 822.9 1202.3 46.1% - 2000 890.9 8.3% 667.5 904.9 10.0% 363.3
c105 25 3 191.3 609.6 218.6% 191.8 0.3% 0.1 191.8 0.3% 3.3 191.8 0.3% 3.4
50 5 362.4 482.4 33.1% 363.2 0.2% 0.3 363.2 0.2% 6.4 363.2 0.2% 6.0
100 10 827.3 828.9 0.2% 828.9 0.2% 15.8 828.9 0.2% 53.3 828.9 0.2% 34.2
c106 25 3 191.3 639.6 234.3% 191.8 0.3% 0.1 191.8 0.3% 3.6 191.8 0.3% 3.8
50 5 362.4 430.4 18.8% 363.2 0.2% 1.4 363.2 0.2% 5.8 363.2 0.2% 5.9
100 10 827.3 828.9 0.2% 828.9 0.2% 650.6 828.9 0.2% 69.6 828.9 0.2% 47.0
c107 25 3 191.3 565.6 195.6% 191.8 0.3% 0.1 191.8 0.3% 3.2 191.8 0.3% 4.3
50 5 362.4 457.4 26.2% 363.2 0.2% 0.8 363.2 0.2% 6.1 363.2 0.2% 6.6
100 10 827.3 828.9 0.2% 828.9 0.2% 14.6 828.9 0.2% 65.2 828.9 0.2% 35.2
c108 25 3 191.3 564.6 195.1% 191.8 0.3% 0.9 191.8 0.3% 4.5 191.8 0.3% 4.4
50 5 362.4 - - 363.2 0.2% 18.6 363.2 0.2% 8.4 363.2 0.2% 14.2
100 10 827.3 - - - 2000 828.9 0.2% 98.0 828.9 0.2% 265.4
c109 25 3 191.3 475.6 148.6% 191.8 0.3% 9.6 191.8 0.3% 6.3 191.8 0.3% 9.7
50 5 362.4 363.2 0.2% 363.2 0.2% 1867.4 363.2 0.2% 12.3 363.2 0.2% 13.1
100 10 827.3 828.9 0.2% - 2000 828.9 0.2% 234.5 828.9 0.2% 304.3
TABLE III: Complete results on other Solomon’s instances, all instances are associated with 50 customers. A time limit of 500 seconds is set for OR-Tools.
ins. |J||J| ff^{*} OR-Tools Gurobi ALM-P ALM-C
ff ε\varepsilon ff ε\varepsilon tt ff ε\varepsilon tt ff ε\varepsilon tt
c201.50 3 360.2 1622 350.3% 361.8 0.4% 0.1 361.8 0.4% 4.2 361.8 0.4% 3.4
c202.50 3 360.2 1622 350.3% 361.8 0.4% 5.7 361.8 0.4% 19.1 366.8 1.8% 7.1
c203.50 3 359.8 1713.4 376.2% 361.4 0.4% 113.5 361.4 0.4% 43.7 361.4 0.4% 76.8
c204.50 2 350.1 1435.3 310.0% 351.7 0.5% 2000 366.9 4.8% 55.7 351.7 0.5% 166.3
c205.50 3 359.8 1729 380.5% 361.4 0.4% 1.1 361.4 0.4% 7.9 361.4 0.4% 7.0
c206.50 3 359.8 1741.8 384.1% 361.4 0.4% 1.5 361.4 0.4% 14.4 361.4 0.4% 19.1
c207.50 3 359.6 1622.3 351.1% 361.2 0.4% 11.9 361.2 0.5% 25.0 361.2 0.4% 125.7
c208.50 2 350.5 1629.2 364.8% 352.1 0.5% 8.0 355.9 1.5% 14.2 352.1 0.5% 31.5
r101.50 12 1044 1541.9 47.7% 1046.7 0.3% 2.1 1046.7 0.3% 38.5 1046.7 0.3% 47.3
r102.50 11 909 1374.8 51.2% 911.4 0.3% 2000 939.5 3.4% 248.9 925.3 1.8% 317.2
r103.50 9 772.9 1069.1 38.3% - - 2000 806.4 4.3% 289.8 803.8 4.0% 256.1
r104.50 6 625.4 743.5 18.9% - - 2000 654.4 4.6% 472.1 686.6 9.8% 649.9
r105.50 9 899.3 1101.7 22.5% 901.9 0.3% 268.5 906.13 0.8% 61.1 901.9 0.3% 60.2
r106.50 5 793 937.8 18.3% - - 2000 - - - - - -
r107.50 7 711.1 790.5 11.2% - - 2000 816.47 14.8% 120.5 741.5 4.3% 336.2
r108.50 6 617.7 698.1 13.0% - - 2000 645.7 4.5% 215.1 643.9 4.2% 466.5
r109.50 8 786.8 873.2 11.0% 805.6 2.4% 2000 796.3 1.2% 77.2 788.7 0.2% 324.9
r110.50 7 697 887.1 27.3% - - 2000 754.9 8.3% 560.6 768.1 10.2% 225.4
r111.50 7 707.2 784.5 10.9% - - 2000 750.0 6.0% 297.8 801.9 13.4% 220.8
r112.50 6 630.2 730.6 15.9% - - 2000 665.7 5.6% 476.5 698.0 10.8% 300.2
r201.50 6 791.9 1196.9 51.1% 794.3 0.3% 5.2 803.5 1.5% 27.0 801.3 1.2% 30.4
r202.50 5 698.5 1173.2 68.0% 723.0 3.5% 2000 735.7 5.3% 437.0 726.3 4.0% 337.2
r203.50 5 605.3 1173.2 93.8% 608.0 0.4% 2000 639.4 5.6% 168.6 653.0 7.9% 380.9
r204.50 2 506.4 1127.9 122.7% 512.4 1.2% 2000 524.0 3.5% 75.8 583.8 15.3% 401.6
r205.50 4 690.1 1132.3 64.1% 700.2 1.5% 2000 732.4 6.1% 74.5 731.7 6.0% 87.8
r206.50 4 632.4 1066.9 68.7% 657.2 3.9% 2000 682.1 7.9% 188.4 677.2 7.1% 266.6
r207.50 3 361.6 1046.8 189.5% - - 2000 362.6 0.3% 13.1 364.1 0.7% 25.2
r208.50 1 328.2 1019.6 210.7% - - 2000 329.3 0.3% 11.9 329.3 0.3% 13.3
r209.50 4 600.6 1043.6 73.8% 639.6 6.5% 2000 608.5 1.3% 217.9 609.4 1.5% 58.5
r210.50 4 645.6 1119.5 73.4% 661.0 2.4% 2000 710.2 10.0% 288.6 669.2 3.7% 253.4
r211.50 3 535.5 958.7 79.0% - - 2000 590.4 10.3% 168.8 555.1 3.7% 339.8
rc101.50 8 944 1108.3 17.4% 945.6 0.2% 2000 945.6 0.2% 191.2 945.6 0.2% 78.8
rc102.50 7 822.5 940.7 14.4% - - 2000 823.1 0.1% 510.3 823.1 0.1% 242.9
rc103.50 6 710.9 834.9 17.4% - - 2000 736.4 3.6% 134.7 751.3 5.7% 335.2
rc104.50 5 545.8 641.4 17.5% - - 2000 546.5 0.1% 101.6 546.5 0.1% 91.5
rc105.50 8 855.3 1112.7 30.1% - - 2000 873.2 2.1% 99.0 902.7 5.5% 134.4
rc106.50 6 723.2 793 9.7% - - 2000 733.7 1.5% 79.3 728.1 0.7% 152.4
rc107.50 6 642.7 752.3 17.1% - - 2000 650.3 1.2% 235.2 644.0 0.2% 183.7
rc108.50 6 598.1 690.3 15.4% - - 2000 600.7 0.4% 256.1 599.2 0.2% 276.4
rc201.50 5 684.8 1904.7 178.1% 686.3 0.2% 12.5 687.7 0.4% 31.9 686.3 0.2% 14.8
rc202.50 5 613.6 1172.6 91.1% 615.0 0.2% 2000 615.6 0.3% 62.2 615.0 0.2% 85.9
rc203.50 4 555.3 1163.1 109.5% 556.5 0.2% 2000 590.4 6.3% 256.7 558.5 0.6% 265.8
rc204.50 3 444.2 1090.3 145.5% 509.4 14.7% 2000 451.8 1.7% 218.4 462.3 4.1% 153.7
rc205.50 5 630.2 1222.7 94.0% 632.0 0.3% 2000 632.0 0.3% 89.8 632.0 0.3% 56.0
rc206.50 5 610 1088.5 78.4% 611.7 0.3% 2000 611.7 0.3% 36.8 611.7 0.3% 42.9
rc207.50 4 558.6 998.3 78.7% - - 2000 591.7 5.9% 315.5 608.5 8.9% 281.1
rc208.50 2 269.1 936.9 248.2% - - 2000 269.6 0.2% 65.0 269.6 0.2% 109.1

To demonstrate the advantages of our methods, we show the convergence curve of the primal bound and dual bound of Gurobi, comparing it to the solution found by our solver with the instance “C109.50” as an example. As we can observe in Figure 2, our methods achieve near-optimal solutions in around 15 seconds, significantly faster than Gurobi which takes approximately 300 seconds. We notice that as the objective values increase, the corresponding constraint violation decreases simultaneously, facilitating fast convergence.

Refer to caption
(a) Objective Value
Refer to caption
(b) Constraint Violation
Refer to caption
(c) Primal and Dual Bound of Gurobi
Figure 2: Comparison of Gurobi and our methods

5.2 Train timetabling problem

We consider following space-time network model for the train timetabling problem (TTP) on a macro level, which is based on the model in [42]. We use a directed, acyclic and multiplicative graph G=(V,E)G=(V,E) to characterize the train timetabling problem, where VV and EE denote the set of all nodes and the set of all arcs. For each train jpj\in\mathbb{N}_{p}, the sets or parameters with superscript or subscript notation corresponds to relevant object to jj. For each arc eEje\in E_{j}, we introduce a binary variable xex_{e} equal to 1 if the arc ee is selected. For each node vVv\in V, let δj+(v)\delta^{+}_{j}(v) and δj(v)\delta^{-}_{j}(v) be the sets of arcs in EjE_{j} leaving and entering node vv, respectively. Then the integer programming model of TTP is given by

max\displaystyle\max jpeEjpexe\displaystyle\sum\nolimits_{j\in\mathbb{N}_{p}}\sum\nolimits_{e\in E^{j}}p_{e}x_{e} (49a)
s.t. eδj+(σ)xe1,jp\displaystyle\sum\nolimits_{e\in\delta_{j}^{+}(\sigma)}x_{e}\leq 1,\hskip 88.2037ptj\in\mathbb{N}_{p} (49b)
eδj(v)xe=eδj+(v)xe,jp,vV\{σ,τ}\displaystyle\sum_{e\in\delta_{j}^{-}(v)}x_{e}=\sum_{e\in\delta_{j}^{+}(v)}x_{e},~{}j\in\mathbb{N}_{p},\hskip 6.25958ptv\in V\backslash\{\sigma,\tau\} (49c)
eδj(τ)xe1,jp\displaystyle\sum\nolimits_{e\in\delta_{j}^{-}(\tau)}x_{e}\leq 1,\hskip 91.04872ptj\in\mathbb{N}_{p} (49d)
v𝒩(v)j𝒯(v)eδj(v)xe1,vV\displaystyle\sum\nolimits_{v^{\prime}\in\mathcal{N}(v)}\sum\nolimits_{j\in\mathcal{T}(v^{\prime})}\sum\nolimits_{e\in\delta_{j}^{-}(v^{\prime})}x_{e}\leq 1,v\in V (49e)
eCxe1,C𝒞\displaystyle\sum\nolimits_{e\in C}x_{e}\leq 1,\hskip 105.2751ptC\in\mathcal{C} (49f)
xe{0,1},eE,\displaystyle x_{e}\in\{0,1\},\hskip 120.92421pte\in E, (49g)

where pep_{e} is the “profit” of using a certain arc ee. σ\sigma and τ\tau denote artificial origin and destination nodes, respectively. 𝒯(v)\mathcal{T}(v) and 𝒩(v)\mathcal{N}(v) denote the set of trains may passing through node vv and the set of nodes conflicted with node vv, respectively. 𝒞\mathcal{C} denotes the (exponentially large) family of maximal subsets CC of pairwise incompatible arcs. In this model, (49b), (49c), (49d) imply the arcs of train jj should form a valid path in GG, (49e) represents headway constraints, (49f) forbids the simultaneous selection of incompatible arcs, imposing the track capacity constraints.

Let xj={xeeEj}\textbf{x}_{j}=\{x_{e}\mid e\in E_{j}\}. Then we can rewrite this space-time network model in the general form as (1). Our goal is to show that the proposed Algorithm 4 is fully capable to provide implementable time tables for the Jinghu railway. Specifically, our algorithms are tested on the time-tabling problem for Beijing-Shanghai high-speed railway (or Jinghu high-speed railway in Mandarin). As one of the busiest railways in the world, the Beijing-Shanghai high-speed railway transported over 210 million passengers in 2019***For details, see https://en.wikipedia.org/wiki/Beijing-Shanghai_high-speed_railway.. In our test case, the problem consists of 29 stations and 292 trains in both directions (up and down), including two major levels of speed: 300 km/h and 350 km/h. Several numerical experiments are carried out based on the data of Beijing-Shanghai high-speed railway to demonstrate the feasibility and effectiveness of the proposed strategy.

We compare our proposed methods with the Gurobi solver and ADMM [21] on small and real-world instances, as presented in Tables IV and V. The notations |F|,|S||F|,|S| and |T||T| represent the number of trains, the number of stations, and the time window. In most large-scale cases, the Gurobi solver takes a much longer time to solve, so we set a time limit of two hours. Since (49a) maximizes the positive revenue, we have a negative cost if reversing the objective to a minimization problem, then both subproblems (16a) and (16b) are solved by the Bellman-Ford algorithm, which is an efficient tool for solving the shortest path problem.

5.2.1 Performances on small instances

We first validate our algorithm on a smaller sub-network using a subset of stations of the Jinghu railway. We set the revenue of each arc proportional to the distance between two stations, which corresponds to an intuition that longer rides should bear higher incomes. For our proposed methods, we set uniformly ρ0=20\rho^{0}=20 and σ=1.2\sigma=1.2 for all instances. We set a time limit of 100 seconds for our algorithms. Since the optimal value is unknown, we report the upper bounds (UB) obtained by the Gurobi solver as a reference for comparison in Table IV. We can observe that our ALM-C performs competitively with the Gurobi solver in terms of both the optimal value and computation time.

TABLE IV: Performance on four small examples on sub-networks of the Jinghu railway for maximizing the total revenue
|F||F| |S||S| |T||T| Gurobi ADMM ALM-P ALM-C
UB ff Time ff Time ff Time ff^{*} Time
20 15 200 20952 20952 5.0 20952 14.2 20952 13.1 20952 10.0
25 15 200 23396 23396 11.5 23396 31.3 21085 22.1 23396 11.4
30 15 300 25707 25707 14.4 24552 14.1 25707 23.0 25707 9.2
40 15 300 34305 34305 24.3 31817 100.0 32061 100.0 34305 17.0

5.2.2 Performances on real-world instances

To verify the efficiency of the ALM-C and ALM-P on large-scale data with all stations involved, we consider the following five examples of ascending problem size in Table V. Similarly, we try to maximize the total revenue of our schedule. In practice, the revenue of an arc may be ambiguous. Besides, since Jinghu high-speed railway always has a high level of utilization, introducing new trains is always beneficial since it further covers unmet demand. In this view, we introduce an alternative objective function to simply maximize the number of scheduled trains. Perhaps not surprisingly, this simplified objective can dramatically speed up our methods for practical interest. This is due to the fact that the coefficient pep_{e} in the objective function is sparse, resulting in faster identification of the optimal solution under this model.

For our proposed methods, we set ρ0=103\rho^{0}=10^{-3} and σ=2\sigma=2 for the instances No. 1-2, ρ0=102\rho^{0}=10^{-2} and σ=1.1\sigma=1.1 for the instances No. 3 and No. 5, ρ0=103\rho^{0}=10^{-3} and σ=1.1\sigma=1.1 for the instance No. 4. We set x0\textbf{x}^{0} and λ0\lambda^{0} to be zero for all instances. Both the number of variables and the number of constraints of the model (49) are very large for this practical TTP instance, whose dimensions are up to tens of millions. Tables VI and VII show the performance results of our proposed methods and the Gurobi solver, where gap:=2(UBf)/f{}^{2}:=(\text{UB}-f)/f.

TABLE V: Five examples of the large practical railway network
No. |F||F| |S||S| |T||T| mm qq nn
1 50 29 300 49,837 231,722 308,177
2 50 29 600 113,737 1,418,182 1,396,572
3 100 29 720 145,135 1,788,088 4,393,230
4 150 29 960 199,211 3,655,151 9,753,590
5 292 29 1080 228,584 13,625,558 26,891,567
TABLE VI: Performance comparison between the Gurobi solver, ADMM, ALM-P and ALM-C on the large networks for maximizing total revenue in Table V.
No. Gurobi ADMM ALM-P ALM-C
ff gap2 Time ff gap2 Time ff gap2 Time ff gap2 Time
1 6.18e+04 13.6% 7200.0 6.94e+04 3.0% 54.1 6.93e+04 3.1% 84.3 6.98e+04 2.5% 72.2
2 7.61e+04 45.3% 7200.0 1.34e+05 3.5% 96.2 1.34e+05 3.4% 144.4 1.35e+05 3.0% 126.1
3 9.69e+04 61.5% 7200.0 2.33e+05 7.3% 558.3 2.36e+05 6.3% 414.9 2.36e+05 6.3% 396.6
4 1.23e+05 74.5% 7200.0 4.38e+05 9.6% 1422.7 4.42e+05 8.7% 924.3 4.45e+05 8.1% 714.9
5 2.95e+05 74.0% 7200.0 9.15e+05 19.5% 3600.0 9.60e+05 15.6% 2094.2 9.63e+05 15.3% 1722.7
TABLE VII: Performance comparison between the Gurobi solver, ADMM, ALM-P and ALM-C on the large networks for maximizing the number of trains in the timetable in Table V
No. ff^{*} Gurobi ADMM ALM-P ALM-C
ff gap1 Time ff gap1 Time ff gap1 Time ff gap1 Time
1 27 3836.0 27 3.4 27 3.4 27 3.5
2 50 50 0 2965.8 50 0 8.2 50 0 8.1 50 0 8.2
3 100 11 89.0% 7200.0 89 11.0% 3001.4 100 0 127.4 100 0 96.4
4 150 14 90.7% 7200.0 148 1.3% 3005.3 150 0 99.5 150 0 96.4
5 292 30 89.7% 7200.0 286 2.1% 3003.8 292 0 203.1 292 0 172.1

The results clearly demonstrate the high effectiveness of our methods compared to Gurobi, especially when dealing with large-scale data. For the instance No. 1, both ALM-C and ALM-P can schedule 27 trains in a few seconds, which are at least 1000 times faster than the Gurobi solver and meanwhile obtain satisfactory accuracy performance. In particular, when the scale of data is up to tens of millions in instance No. 5, our ALM-C and ALM-P successfully schedule all 292 trains. In contrast, the Gurobi solver produces a relatively small number of trains, only 30 trains, and takes much longer. The results of other instances also illustrate the effectiveness of this technique. Therefore, we can conclude that the customized ALM provides a fast and global optimal solution for this practical problem.

6 Conclusion

In this paper, we study general integer programming with block structure. Benefiting from its special structure, we extend the augmented Lagrangian method, originally designed for continuous problems, to effectively solve the problem (1). By introducing a novel augmented Lagrangian function, we establish the strong duality and optimality for the problem (1). Furthermore, we provide the convergence results of the proposed methods for both the augmented Lagrangian relaxation and dual problems. To obtain high-quality feasible solutions, we develop a customized ALM combined with refinement techniques to iteratively improve the primal and dual solution quality simultaneously. The numerical experiments demonstrate that the customized ALM is time-saving and performs well for finding optimal solutions to a wide variety of practical problems.

References

  • [1] P. Wang, C. Shen, A. van den Hengel, and P. H. Torr, “Large-scale binary quadratic optimization using semidefinite relaxation and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 3, pp. 470–485, 2016.
  • [2] B. S. Y. Lam and A. W.-C. Liew, “A fast binary quadratic programming solver based on stochastic neighborhood search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 32–49, 2020.
  • [3] X. Lin, Z. J. Hou, H. Ren, and F. Pan, “Approximate mixed-integer programming solution with machine learning technique and linear programming relaxation,” in 2019 3rd International Conference on Smart Grid and Smart Cities (ICSGSC).   IEEE, 2019, pp. 101–107.
  • [4] R. Anderson, J. Huchette, W. Ma, C. Tjandraatmadja, and J. P. Vielma, “Strong mixed-integer programming formulations for trained neural networks,” Mathematical Programming, vol. 183, no. 1, pp. 3–39, 2020.
  • [5] F. Eisenbrand, C. Hunkenschröder, K.-M. Klein, M. Kouteckỳ, A. Levin, and S. Onn, “An algorithmic theory of integer programming,” arXiv preprint arXiv:1904.01361, 2019.
  • [6] J. Cslovjecsek, M. Kouteckỳ, A. Lassota, M. Pilipczuk, and A. Polak, “Parameterized algorithms for block-structured integer programs with large entries,” in Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA).   SIAM, 2024, pp. 740–751.
  • [7] J. A. De Loera, R. Hemmecke, S. Onn, and R. Weismantel, “N-fold integer programming,” Discrete Optimization, vol. 5, no. 2, pp. 231–241, 2008.
  • [8] L. Chen, On Block-Structured Integer Programming and Its Applications.   Cham: Springer International Publishing, 2019, pp. 153–177.
  • [9] A. H. Land and A. G. Doig, “An automatic method for solving discrete programming problems,” in 50 Years of Integer Programming 1958-2008.   Springer, 2010, pp. 105–132.
  • [10] R. E. Gomory, “Outline of an algorithm for integer solutions to linear programs,” Bulletin of the American Mathematical Society, vol. 64, pp. 275–278, 1958.
  • [11] J. F. Benders, “Partitioning procedures for solving mixed-variables programming problems,” Numerische Mathematik, vol. 4, no. 1, pp. 238–252, 1962.
  • [12] G. B. Dantzig and P. Wolfe, “Decomposition principle for linear programs,” Operations Research, vol. 8, no. 1, pp. 101–111, 1960.
  • [13] A. M. Geoffrion, “Lagrangean relaxation for integer programming,” in Approaches to Integer Programming.   Springer, 1974, pp. 82–114.
  • [14] M. Conforti, G. Cornuéjols, and G. Zambelli, Integer programming.   Springer, 2014, vol. 271.
  • [15] A. Caprara, M. Monaci, P. Toth, and P. L. Guida, “A lagrangian heuristic algorithm for a real-world train timetabling problem,” Discrete Applied Mathematics, vol. 154, no. 5, pp. 738–753, 2006.
  • [16] B. Wu and B. Ghanem, “p\ell_{p}-box ADMM: A versatile framework for integer programming,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1695–1708, 2018.
  • [17] K. Sun, M. Sun, and W. Yin, “Decomposition methods for global solutions of mixed-integer linear programs,” arXiv preprint arXiv:2102.11980, 2021.
  • [18] M. Feizollahi, “Large-scale unit commitment: Decentralized mixed integer programming approaches,” Ph.D. dissertation, Georgia Institute of Technology, 2015.
  • [19] R. Takapoui, “The alternating direction method of multipliers for mixed-integer optimization applications,” Ph.D. dissertation, Stanford University, 2017.
  • [20] Y. Yao, X. Zhu, H. Dong, S. Wu, H. Wu, L. C. Tong, and X. Zhou, “ADMM-based problem decomposition scheme for vehicle routing problem with time windows,” Transportation Research Part B: Methodological, vol. 129, pp. 156–174, 2019.
  • [21] Q. Zhang, R. M. Lusby, P. Shang, and X. Zhu, “Simultaneously re-optimizing timetables and platform schedules under planned track maintenance for a high-speed railway network,” Transportation Research Part C: Emerging Technologies, vol. 121, p. 102823, 2020.
  • [22] Y. Kanno and S. Kitayama, “Alternating direction method of multipliers as a simple effective heuristic for mixed-integer nonlinear optimization,” Structural and Multidisciplinary Optimization, vol. 58, no. 3, pp. 1291–1295, 2018.
  • [23] E. L. Johnson, G. L. Nemhauser, and M. W. Savelsbergh, “Progress in linear programming-based algorithms for integer programming: An exposition,” Informs Journal on Computing, vol. 12, no. 1, pp. 2–23, 2000.
  • [24] J. Feldman, M. J. Wainwright, and D. R. Karger, “Using linear programming to decode binary linear codes,” IEEE Transactions on Information Theory, vol. 51, no. 3, pp. 954–972, 2005.
  • [25] M. Laurent and F. Rendl, “Semidefinite programming and integer programming,” Handbooks in Operations Research and Management Science, vol. 12, pp. 393–514, 2005.
  • [26] P. Wang, C. Shen, and A. Van Den Hengel, “A fast semidefinite approach to solving binary quadratic problems,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1312–1319.
  • [27] A. Camisa, I. Notarnicola, and G. Notarstefano, “A primal decomposition method with suboptimality bounds for distributed mixed-integer linear programming,” in 2018 IEEE Conference on Decision and Control (CDC).   IEEE, 2018, pp. 3391–3396.
  • [28] R. Vujanic, P. M. Esfahani, P. J. Goulart, S. Mariéthoz, and M. Morari, “A decomposition method for large scale MILPs, with performance guarantees and a power system application,” Automatica, vol. 67, pp. 144–156, 2016.
  • [29] P. Muts, I. Nowak, and E. M. Hendrix, “The decomposition-based outer approximation algorithm for convex mixed-integer nonlinear programming,” Journal of Global Optimization, vol. 77, no. 1, pp. 75–96, 2020.
  • [30] W. Murray and K.-M. Ng, “An algorithm for nonlinear optimization problems with binary variables,” Computational Optimization and Applications, vol. 47, no. 2, pp. 257–288, 2010.
  • [31] S. Lucidi and F. Rinaldi, “Exact penalty functions for nonlinear integer programming problems,” Journal of Optimization Theory and Applications, vol. 145, no. 3, pp. 479–488, 2010.
  • [32] C. Yu, K. L. Teo, and Y. Bai, “An exact penalty function method for nonlinear mixed discrete programming problems,” Optimization Letters, vol. 7, no. 1, pp. 23–38, 2013.
  • [33] M. J. Feizollahi, S. Ahmed, and A. Sun, “Exact augmented Lagrangian duality for mixed integer linear programming,” Mathematical Programming, vol. 161, no. 1, pp. 365–387, 2017.
  • [34] C. E. Blair and R. G. Jeroslow, “The value function of a mixed integer program: I,” Discrete Mathematics, vol. 19, no. 2, pp. 121–138, 1977.
  • [35] J. A. González and J. Castro, “A heuristic block coordinate descent approach for controlled tabular adjustment,” Computers &\& Operations Research, vol. 38, no. 12, pp. 1826–1835, 2011.
  • [36] S. Jäger and A. Schöbel, “The blockwise coordinate descent method for integer programs,” Mathematical Methods of Operations Research, vol. 91, no. 2, pp. 357–381, 2020.
  • [37] P. Tseng, “Convergence of a block coordinate descent method for nondifferentiable minimization,” Journal of Optimization Theory and Applications, vol. 109, no. 3, pp. 475–494, 2001.
  • [38] J. Bolte, S. Sabach, and M. Teboulle, “Proximal alternating linearized minimization for nonconvex and nonsmooth problems,” Mathematical Programming, vol. 146, no. 1, pp. 459–494, 2014.
  • [39] H. Attouch, J. Bolte, and B. F. Svaiter, “Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods,” Mathematical Programming, vol. 137, no. 1-2, pp. 91–129, 2013.
  • [40] A. Ruszczynski, Nonlinear Optimization.   Princeton University Press, 2011.
  • [41] M. M. Solomon, “Algorithms for the vehicle routing and scheduling problems with time window constraints,” Operations research, vol. 35, no. 2, pp. 254–265, 1987.
  • [42] A. Caprara, M. Fischetti, and P. Toth, “Modeling and solving the train timetabling problem,” Operations Research, vol. 50, no. 5, pp. 851–861, 2002.