This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

E2E-AT: A Unified Framework for Tackling Uncertainty
in Task-aware End-to-end Learning

Wangkun Xu1, Jianhong Wang2, Fei Teng1
Abstract

Successful machine learning involves a complete pipeline of data, model, and downstream applications. Instead of treating them separately, there has been a prominent increase of attention within the constrained optimization (CO) and machine learning (ML) communities towards combining prediction and optimization models. The so-called end-to-end (E2E) learning captures the task-based objective for which they will be used for decision making. Although a large variety of E2E algorithms have been presented, it has not been fully investigated how to systematically address uncertainties involved in such models. Most of the existing work considers the uncertainties of ML in the input space and improves robustness through adversarial training. We extend this idea to E2E learning and prove that there is a robustness certification procedure by solving augmented integer programming. Furthermore, we show that neglecting the uncertainty of COs during training causes a new trigger for generalization errors. To include all these components, we propose a unified framework that covers the uncertainties emerging in both the input feature space of the ML models and the COs. The framework is described as a robust optimization problem and is practically solved via end-to-end adversarial training (E2E-AT). Finally, the performance of E2E-AT is evaluated by a real-world end-to-end power system operation problem, including load forecasting and sequential scheduling tasks111Accepted by AAAI-24. Our code is available at https://github.com/xuwkk/E2E-AT..

Introduction

ML-based prediction algorithms are widely used in real-world applications, such as load forecasting in power system operation, demand forecasting in retailing, and inventory stock forecasting in commerce. Although training such models is straightforwardly a supervised learning problem, the training criteria may not be aligned with the ultimate goal of the downstream decision-making tasks. For instance, the forecast load is further used for generator dispatch in the power system, and the demand forecasting can be used to guide manufacturing. That is, the system designer first trains an ML forecaster using standard statistic training loss, e.g. the mean squared error (MSE) or cross entropy (CE), and then applies the forecast as input to the decision making. In fact, task-aware cost, which is modeled as a constrained optimization (CO) problem, can benefit from training a more economical forecast model. Therefore, task-aware end-to-end learning is proposed by combining ML and COs. However, the uncertainties involved in E2E learning have not been fully understood. This paper aims to bridge the gap from the perspective of robust optimization and adversarial robustness.

In the literature, E2E learning is also named as (smart) predict-then-optimize (Elmachtoub and Grigas 2022), integrated learning and optimization (Sadana et al. 2023), as well as decision-focused learning (Wilder, Dilkina, and Tambe 2019). In the paradigm of contextual optimization, it appears as early as in (Bengio 1997) and has recently shown a surge of interest. The E2E model is integrated to minimize the task-aware cost subject to task constraints. As a result, the CO is first solved in the forward pass to obtain the optimal decision. In the backward pass, the Jacobian from the optimal decision to the forecast value needs to be calculated. Among many approaches, this paper adopts the implicit differentiation method, which is more efficient than the unrolling approach (Domke 2012) and more accurate than the surrogate loss function approach (Elmachtoub and Grigas 2022). Practically, OptNet (Amos and Kolter 2017) applies the implicit function theorem (Krantz and Parks 2002) to denote the Jacobian after formulating the Karush-Kuhn-Ticher (KKT) condition of parametric quadratic programming (QP), which is further extended to disciplined convex programming in CvxpyLayers (Agrawal et al. 2019).

E2E learning has been applied to many critical industrial activities, such as better management of the power system (Stratigakos et al. 2022; Vohra, Rajaei, and Cremer 2023), constraint satisfaction in control (Chen et al. 2021), and routing behavior of the transportation network (Liu et al. 2023). Therefore, it is essential to understand the vulnerability of this learning framework and to study the associated robust enhancement. In detail, the E2E model contains three parts: (a). a parametric forecast model that maps the contextual information (e.g. the input feature) to the interest of forecast; (b). CO models that take the forecast as input and return decisions; and (c). a task-aware loss function that encodes the ultimate goal of decision making. Meanwhile, the parameters of COs and task-aware loss can be classified into predictable and unpredictable parameters. While the uncertainty of predictable parameters can be modeled by the forecast model, the uncertainty of unpredictable parameters are overlooked in literature. Conventionally, the uncertainty of the data is only defined by the input of the forecast model, the worst case of which can be found by adversarial attack and treated by adversarial training (Madry et al. 2017). In the E2E model, the definition of data is augmented by the unpredictable parameters of COs222We merge the unpredictable parameter of task-aware cost as part of CO.. Our main contribution is to treat multiple sources of uncertainty uniformly as a robust optimization problem, which can be practically solved by end-to-end adversarial training (E2E-AT). Finally, our method can be viewed as a natural interconnection between adversarial training, the implicit layer, and E2E learning.

Preliminaries

Adversarial Training and Certified Robustness

In supervised learning, a parametric model 𝒇(𝒙;𝜽)\bm{f}(\bm{x};\bm{\theta}) can be defined to map from input 𝒙\bm{x} to label 𝒚\bm{y} for (𝒙,𝒚)𝒟(\bm{x},\bm{y})\in\mathcal{D} by minimizing the following empirical loss:

min𝜽i𝒟(𝒇(𝒙i;𝜽),𝒚i)\operatorname{min}_{\bm{\theta}}\sum_{i\in\mathcal{D}}\mathcal{L}(\bm{f}(\bm{x}^{i};\bm{\theta}),\bm{y}^{i}) (1)

In this paper, we use (𝒙,𝒚)𝒟(\bm{x},\bm{y})\in\mathcal{D} and i𝒟i\in\mathcal{D} interchangeably to denote a sample of the dataset.

It has been well studied that the deep neural network is prone to small perturbations on its input. In security-critical applications, robustness has become an emerging factor. Adversarial training has been shown to improve the robustness of NN, which transforms (1) into a robust optimization (Madry et al. 2017):

min𝜽i𝒟max𝜹iΔ(𝒇(𝒙i+𝜹i;𝜽),𝒚i)\operatorname{min}_{\bm{\theta}}\sum_{i\in\mathcal{D}}\operatorname{max}_{\bm{\delta}_{i}\in\Delta}\mathcal{L}(\bm{f}(\bm{x}^{i}+\bm{\delta}^{i};\bm{\theta}),\bm{y}^{i}) (2)

where Δ={𝜹:𝜹ϵ}\Delta=\{\bm{\delta}:\|\bm{\delta}\|_{\infty}\leq\epsilon\} is the attack budget set for some small ϵ>0\epsilon>0.

Robust optimization (2) is practically solved by iterative approaches in which the adversarial perturbation is first solved by inner minimization through gradient ascent. To keep the attack within Δ\Delta, projected gradient descent (PGD) (Madry et al. 2017) is adopted:

𝜹t+1i=𝒫Δ(𝜹ti+γsign(𝜹(𝒇(𝒙i+𝜹ti),𝒚i)))\bm{\delta}^{i}_{t+1}=\mathcal{P}_{\Delta}\left(\bm{\delta}_{t}^{i}+\gamma\cdot\operatorname{sign}\left(\nabla_{\bm{\delta}}\mathcal{L}\left(\bm{f}\left(\bm{x}^{i}+\bm{\delta}^{i}_{t}\right),\bm{y}^{i}\right)\right)\right) (3)

where 𝒫Δ\mathcal{P}_{\Delta} is the projector on Δ\Delta and γ\gamma is the step size.

Although adversarial training is effective in improving robustness, it may give the wrong sign of security as inner maximization is inexactly solved. A certification approach can be made on the feedforward neural network with ReLU activations, e.g., a piecewise linear neural network333Convolution layer and other piecewise linear activations, such as leaky ReLU, are also piecewise linear.. Using the big-M method, the neural network with dd layers can be represented by the following set (Tjeng, Xiao, and Tedrake 2017)

𝒞nn(𝒙;𝜽)={𝒚:𝒛1=𝒙,𝒛i+1𝑾i𝒛i+𝒃i,𝒛i+1𝟎,𝒖i𝒗i𝒛i+1𝑾i𝒛i+𝒃i𝒛i+1+(𝟏𝒗i)𝒍i,𝒗i{0,1}|𝒗i|,i=1,,d2𝒚=𝑾d1𝒛d1+𝒃d1}\mathcal{C}_{\text{nn}}(\bm{x};\bm{\theta})=\left\{\bm{y}:\begin{array}[]{l}\bm{z}_{1}=\bm{x},\bm{z}_{i+1}\geq\bm{W}_{i}\bm{z}_{i}+\bm{b}_{i},\\ \bm{z}_{i+1}\geq\bm{0},\bm{u}_{i}\cdot\bm{v}_{i}\geq\bm{z}_{i+1}\\ \bm{W}_{i}\bm{z}_{i}+\bm{b}_{i}\geq\bm{z}_{i+1}+(\bm{1}-\bm{v}_{i})\bm{l}_{i},\\ \bm{v}_{i}\in\{0,1\}^{|\bm{v}_{i}|},\quad i=1,\cdots,d-2\\ \bm{y}=\bm{W}_{d-1}\bm{z}_{d-1}+\bm{b}_{d-1}\end{array}\right\} (4)

where 𝜽i=(𝑾i,𝒃i)\bm{\theta}_{i}=(\bm{W}_{i},\bm{b}_{i}) is the weights of the ii-th layer. 𝒖i\bm{u}_{i} and 𝒍i\bm{l}_{i} are the upper and lower bounds of the output of ii-th layer, which can be efficiently estimated by interval bound propagation (IBP) (Gowal et al. 2018). 𝒗i\bm{v}_{i} is an integer vector that controls the activation of ReLUs.

Based on (4), the inner maximization of (2) becomes:

max𝜹i(𝒚^i,𝒚i), subject to 𝜹iΔ,𝒚^i𝒞nn(𝒙i+𝜹i;𝜽)\operatorname{max}_{\bm{\delta}^{i}}\mathcal{L}(\hat{\bm{y}}^{i},\bm{y}^{i}),\text{ subject to }\bm{\delta}^{i}\in\Delta,\hat{\bm{y}}^{i}\in\mathcal{C}_{\text{nn}}(\bm{x}^{i}+\bm{\delta}^{i};\bm{\theta}) (5)

As 𝒞nn(𝒙;𝜽)\mathcal{C}_{\text{nn}}(\bm{x};\bm{\theta}) is defined by mixed integer linear constraints, (5) can be solved exactly for certain types of loss function. When cross-entropy loss is used, we can choose to maximize the output of each logit, and if all the logits are not greater than the ground-truth logit within the attack budget, the NN is said to be certified robust at the candidate sample. For a regression task, (5) can be formulated as mixed-integer linear programming (MILP) to maximize or minimize the forecast value (Xu and Teng 2023).

End-to-End Machine Learning

End-to-end learning takes the fusion of prediction (ML) and decision making (CO), which aims to find the map from the data to the optimal decision such that the learned ML model can optimally reflect the CO via training. Meanwhile, the E2E learning distinguishes itself from separate supervised learning followed by the CO in which the prediction divergence and the task-aware cost can have a large mismatch (Sadana et al. 2023).

Referring to a recent review (Kotary et al. 2021), the CO can be modelled as

𝒛=argmin𝒛(𝒛;𝒚) subject to 𝒛𝒞(𝒛;𝒚)\bm{z}^{\star}=\operatorname{argmin}_{\bm{z}}\ell(\bm{z};\bm{y})\text{ subject to }\bm{z}\in\mathcal{C}(\bm{z};\bm{y}) (6)

where ()\ell(\cdot) is the objective function, 𝒛\bm{z} is the decision variable, 𝒚\bm{y} is the parameter, and 𝒞\mathcal{C} is the constraint set.

Let 𝒚^\hat{\bm{y}} be the output of 𝒇(𝒙;𝜽)\bm{f}(\bm{x};\bm{\theta}) and let 𝒛^\hat{\bm{z}}^{\star} be the minimizer of (6) parameterized by 𝒚^\hat{\bm{y}}. There are two options to learn the E2E model guided by (6). In the supervised learning setting, a suitable loss function \mathcal{M} can be chosen to minimize the difference to ground-truth decision, e.g. (𝒛^,𝒛)\mathcal{M}(\hat{\bm{z}}^{\star},\bm{z}^{\star}) or ((𝒛^;𝒚^),(𝒛;𝒚))\mathcal{M}(\ell(\hat{\bm{z}}^{\star};\hat{\bm{y}}),\ell(\bm{z}^{\star};\bm{y})) (Kong et al. 2022) . Alternatively, the regret function (Elmachtoub and Grigas 2022) can be implemented:

regret=(𝒛^;𝒚^)(𝒛;𝒚)\operatorname{regret}=\ell(\hat{\bm{z}}^{\star};\hat{\bm{y}})-\ell(\bm{z}^{\star};\bm{y}) (7)

The ground truth decision 𝒛\bm{z}^{\star} in (7) is not necessary to compute, as (𝒛;𝒚)\ell(\bm{z}^{\star};\bm{y}) is a constant (Wilder, Dilkina, and Tambe 2019). Therefore, we argue that the regret function (7) is more intrinsic as the training loss has the same format as the CO objective (6).

A Heuristic Example

Before we move into a detailed formulation of the robustness of E2E learning, we first highlight a misleading formulation, which violates the intention of E2E learning and introduces model uncertainties between training and inference.

Consider two E2E learning problems based on the supervised setting:

argminϑ,𝒛^M(𝒛^):=i𝒟((𝒛^i;𝒚^i),(𝒛i;𝒚i))subject to𝒛^i𝒞(𝒛i;𝒚^i),i𝒟\begin{array}[]{rl}\operatorname{argmin}_{\bm{\vartheta},\hat{\bm{z}}}&M(\hat{\bm{z}}):=\sum_{i\in\mathcal{D}}\mathcal{M}(\ell(\hat{\bm{z}}^{i};\hat{\bm{y}}^{i}),\ell(\bm{z}^{i\star};\bm{y}^{i}))\\ \text{subject to}&\hat{\bm{z}}^{i}\in\mathcal{C}(\bm{z}^{i};\hat{\bm{y}}^{i}),\quad i\in\mathcal{D}\\ \end{array} (8)

whose minimum point is denoted as (𝜽1,𝒛^1)(\bm{\theta}^{\star}_{1},\hat{\bm{z}}^{\star}_{1}). And,

argminϑ,𝒛^M(𝒛^):=i𝒟((𝒛^i;𝒚^i),(𝒛i;𝒚i))subject to𝒛^iargmin𝒛i{(𝒛i;𝒚^i):𝒛i𝒞(𝒛i;𝒚^i)},i𝒟\begin{array}[]{rl}\operatorname{argmin}_{\bm{\vartheta},\hat{\bm{z}}}&M(\hat{\bm{z}}):=\sum_{i\in\mathcal{D}}\mathcal{M}(\ell(\hat{\bm{z}}^{i};\hat{\bm{y}}^{i}),\ell(\bm{z}^{i\star};\bm{y}^{i}))\\ \text{subject to}&\hat{\bm{z}}^{i}\in\operatorname{argmin}_{\bm{z}^{i}}\{\ell(\bm{z}^{i};\hat{\bm{y}}^{i}):\bm{z}^{i}\in\mathcal{C}(\bm{z}^{i};\hat{\bm{y}}^{i})\},\\ &\qquad\qquad\qquad\qquad\qquad\qquad i\in\mathcal{D}\end{array} (9)

with minimum point (𝜽2,𝒛^2)(\bm{\theta}^{\star}_{2},\hat{\bm{z}}^{\star}_{2}).

E2E learning problem (8) takes the sample-wise constraints of (6) as its constraints while (9) is a bilevel optimization problem whose upper level minimizes the difference to the ground-truth objective (𝒛i;𝒚i)\ell(\bm{z}^{i\star};\bm{y}^{i}) and each lower level solves the sample-wise decision-making problem in parallel.

In the inference stage, we first predict 𝒚^=f(𝒙;𝜽)\hat{\bm{y}}=f(\bm{x};\bm{\theta}^{\star}) and then solve the optimization problem (6) parameterized by 𝒚^\hat{\bm{y}}. The optimal decision variables are denoted as 𝒛^r(𝜽1)\hat{\bm{z}}^{\star}_{r}(\bm{\theta}^{\star}_{1}) and 𝒛^r(𝜽2)\hat{\bm{z}}^{\star}_{r}(\bm{\theta}^{\star}_{2}), respectively.

Proposition 1

Given two formulations (8) and (9) of E2E learning, M(𝐳^1)M(𝐳^2)=M(𝐳^r(𝛉2))M(𝐳^r(𝛉1))M(\hat{\bm{z}}_{1}^{\star})\leq M(\hat{\bm{z}}_{2}^{\star})=M(\hat{\bm{z}}_{r}^{\star}(\bm{\theta}_{2}^{\star}))\leq M(\hat{\bm{z}}_{r}^{\star}(\bm{\theta}_{1}^{\star})).

The proof can be found in the appendix.

It shows that formulation (8) can result in different decisions at the training and inference stages. Therefore, Proposition 1 demonstrates that a misformulation of CO can possibly lead to misled decision making at inference. Broadly speaking, the ignorance of the optimization model contributes to a new source of uncertainties, causing generalization error, in addition to the well-studied uncertainties in the dataset (e.g. out-of-distribution, adversarial attack, etc.). This motivates us to take into account the uncertainties in both the input space and the CO formulation.

Sequential E2E Learning as Multi-Level Optimization

Proposition 1 implies formulating E2E learning as a multilevel optimization problem by respecting the sequence of downstream tasks. Given 𝜽\bm{\theta}, sequential decision makings can be denoted as lower-level problems after the prediction.

Formally, at the inference stage, the cost of one-time decision making on (𝒙,𝒚)𝒟(\bm{x},\bm{y})\in\mathcal{D} can be written as:

min(𝒛^1,,𝒛^m;𝒚,ϕ0)subject to𝒛^iargmin𝒛i{i(𝒛i;𝒚,ϕi):𝒛i𝒞i(𝒛i;𝒚,𝒛^i1,ϕi)},i=2,,m𝒛^1argmin𝒛1{1(𝒛1;𝒚,ϕ1):𝒛1𝒞1(𝒛1;𝒚,𝒚^,ϕ1)}𝒚^=𝒇(𝒙;𝜽)\begin{array}[]{rl}\operatorname{min}&\mathcal{L}(\hat{\bm{z}}_{1},\cdots,\hat{\bm{z}}_{m};\bm{y},\bm{\phi}_{0})\\ \text{subject to}&\hat{\bm{z}}_{i}\in\operatorname{argmin}_{\bm{z}_{i}}\{\ell_{i}(\bm{z}_{i};\bm{y},\bm{\phi}_{i}):\\ &\qquad\bm{z}_{i}\in\mathcal{C}_{i}(\bm{z}_{i};\bm{y},\hat{\bm{z}}_{i-1},\bm{\phi}_{i})\},\;i=2,\cdots,m\\ &\hat{\bm{z}}_{1}\in\operatorname{argmin}_{\bm{z}_{1}}\{\ell_{1}(\bm{z}_{1};\bm{y},\bm{\phi}_{1}):\\ &\qquad\bm{z}_{1}\in\mathcal{C}_{1}(\bm{z}_{1};\bm{y},\hat{\bm{y}},\bm{\phi}_{1})\}\\ &\hat{\bm{y}}=\bm{f}(\bm{x};\bm{\theta})\end{array} (10)

Inference (10) contains mm downstream tasks with objective i()\ell_{i}(\cdot)s, constraint set 𝒞i()\mathcal{C}_{i}(\cdot)s, and unpredictable parameter ϕi\bm{\phi}_{i}s. It can be compactly denoted as

min𝒛^(𝒛^;𝜽,𝒙,𝒚,ϕ)subject to𝒛^𝒞E2E(𝒛;𝜽,𝒙,𝒚,ϕ)\begin{array}[]{ll}\operatorname{min}_{\hat{\bm{z}}}&\mathcal{L}(\hat{\bm{z}};\bm{\theta},\bm{x},\bm{y},\bm{\phi})\\ \text{subject to}&\hat{\bm{z}}\in\mathcal{C}_{\text{E2E}}(\bm{z};\bm{\theta},\bm{x},\bm{y},\bm{\phi})\end{array} (11)

where 𝒛^=[𝒛^1,,𝒛^m]\hat{\bm{z}}=[\hat{\bm{z}}_{1},\cdots,\hat{\bm{z}}_{m}] and ϕ=[ϕ0,,ϕm]\bm{\phi}=[\bm{\phi}_{0},\cdots,\bm{\phi}_{m}]. 𝒞E2E()\mathcal{C}_{\text{E2E}}(\cdot) represents the constraint set representing all constraints of (10). Note that we consider the COs as parametric functions that can be uniquely modeled by (𝜽,𝒙,𝒚,ϕ)(\bm{\theta},\bm{x},\bm{y},\bm{\phi}).

To train 𝒇(;𝜽)\bm{f}(\cdot;\bm{\theta}), the empirical training loss can be minimized:

min𝜽i𝒟(𝒛^i;𝜽,𝒙i,𝒚i,ϕ)subject to𝒛^i𝒞E2E(𝒛i;𝜽,𝒙i,𝒚i,ϕ),i𝒟\begin{array}[]{ll}\textrm{min}_{\bm{\theta}}&\sum_{i\in\mathcal{D}}\mathcal{L}(\hat{\bm{z}}^{i};\bm{\theta},\bm{x}^{i},\bm{y}^{i},\bm{\phi})\\ \text{subject to}&\hat{\bm{z}}^{i}\in\mathcal{C}_{\text{E2E}}(\bm{z}^{i};\bm{\theta},\bm{x}^{i},\bm{y}^{i},\bm{\phi}),\quad i\in\mathcal{D}\end{array} (12)

Since this paper does not focus on solving optimizations, we restrict downstream optimizations to quadratic programming (QP). Furthermore, QP has been widely implemented in many industrial applications and is mostly discussed in the E2E learning literature (Kotary et al. 2021). Meanwhile, since QP is convex and if the Slater condition holds, the Karush–Kuhn–Tucker (KKT) condition is sufficient for optimality (Boyd and Vandenberghe 2004). Therefore, optimizations at the lower level can be replaced by the corresponding KKT conditions, known as the mathematical program with equilibrium constraints (MPEC) (Luo, Pang, and Ralph 1996). Therefore, if a linear parametric model is considered, it is possible to solve (12) exactly using optimization software.

In addition to the linear model, stochastic gradient descent (SGD) needs to be applied on the mini-batches of 𝒟\mathcal{D} to train the NN model. SGD requires 1) a forward pass in which the optimizations are solved and 2) a backward pass to update the NN parameters. Denote the equality part of KKT condition of the ii-th optimization as

𝒈(𝒛i;𝒛^i1,𝒚,ϕi)=𝟎\bm{g}(\bm{z}^{\star}_{i};\hat{\bm{z}}_{i-1},\bm{y},\bm{\phi}_{i})=\bm{0} (13)

which can be viewed as differentiable layer (OptNet) (Amos and Kolter 2017) by the implicit function theorem (Krantz and Parks 2002):

𝒛i𝒛^i1=(𝒈(𝒛i;𝒛^i1,𝒚,ϕi)𝒛i)1𝒈(𝒛i;𝒛^i1,𝒚,ϕi)𝒛^i1\frac{\partial\bm{z}_{i}^{\star}}{\partial\hat{\bm{z}}_{i-1}}=-\left(\frac{\partial\bm{g}(\bm{z}^{\star}_{i};\hat{\bm{z}}_{i-1},\bm{y},\bm{\phi}_{i})}{\partial\bm{z}_{i}}\right)^{-1}\frac{\partial\bm{g}(\bm{z}^{\star}_{i};\hat{\bm{z}}_{i-1},\bm{y},\bm{\phi}_{i})}{\partial\hat{\bm{z}}_{i-1}} (14)

Note that in (13) and (14), the optimal dual variables (𝝀,𝝂)(\bm{\lambda}^{\star},\bm{\nu}^{\star}) in the KKT conditions are omitted for simplicity. As long as the Jacobian matrix is not singular, the gradient of the output 𝒛i\bm{z}_{i}^{\star} to the input 𝒛^i1\hat{\bm{z}}_{i-1} exists, allowing backpropagation through the differentiable layers.

Unified Robustness Framework

Refer to caption
Figure 1: An illustration of E2E learning. We consider the uncertainties in the input data (𝒙,𝒚)𝒟(\bm{x},\bm{y})\in\mathcal{D} and the uncertainties in the COs, specifically, the unpredictable parameter ϕ\bm{\phi}.

When treating the E2E framework as an integrated model, the data source includes both conventionally defined data samples (𝒙,𝒚)𝒟(\bm{x},\bm{y})\in\mathcal{D} and the unpredictable parameter ϕ\bm{\phi} of COs. Small input uncertainties have been shown to cause a significant performance drop, and it is reasonable to draw a similar conclusion for the unpredictable parameter. In fact, Proposition 1 shows that any mismatches between the COs used for training and inference should be explicitly considered. Although COs can take an infinite number of formulations, without loss of generality, we restrict the uncertainties of COs in the parameters of objective and constraints. We argue that the unpredictable parameter used during training may not be the same as that used for real-time decision-making. For example, in power system operation, the production costs of the generators can vary over time, and the resistance and susceptance of transmission lines can be altered both intentionally and unintentionally. These parameters are usually not known to the system operator in advance or at least not fully aware when training the forecast model. See Fig. 1 for an illustration.

Formulation

Consider the uncertainty of the input 𝒙+𝜹x𝒳\bm{x}+\bm{\delta}_{x}\in\mathcal{X} and the unpredictable parameter ϕ+𝜹ϕΦ\bm{\phi}+\bm{\delta}_{\phi}\in\Phi. Denote 𝝍=(𝒙,ϕ)Ψ:=𝒳×Φ\bm{\psi}=(\bm{x},\bm{\phi})\in\Psi:=\mathcal{X}\times\Phi. The worst scenario, which maximizes the task-aware objective, can be formulated as

max𝝍Ψ(𝒛^;𝜽,𝝍,𝒚)subject to 𝒛^𝒞E2E(𝒛;𝜽,𝝍,𝒚)\begin{array}[]{ll}\operatorname{max}_{\bm{\psi}\in\Psi}&\mathcal{L}(\hat{\bm{z}};\bm{\theta},\bm{\psi},\bm{y})\\ \text{subject to }&\hat{\bm{z}}\in\mathcal{C}_{\text{E2E}}(\bm{z};\bm{\theta},\bm{\psi},\bm{y})\end{array} (15)

in which 𝜽\bm{\theta} is fixed.

Consequently, a robust optimization can be formulated. A unified E2E adversarial training (E2E-AT) considering both input and CO uncertainties becomes

min𝜽i𝒟max𝝍iΨi({𝒛^i;𝜽,𝝍i,𝒚i)subject to 𝒛^i𝒞E2E(𝒛i;𝜽,𝝍i,𝒚i),i𝒟\begin{array}[]{rl}\operatorname{min}_{\bm{\theta}}&\sum_{i\in\mathcal{D}}\operatorname{max}_{\bm{\psi}^{i}\in\Psi^{i}}\mathcal{L}(\{\hat{\bm{z}}^{i};\bm{\theta},\bm{\psi}^{i},\bm{y}^{i})\\ \text{subject to }&\hat{\bm{z}}^{i}\in\mathcal{C}_{\text{E2E}}(\bm{z}^{i};\bm{\theta},\bm{\psi}^{i},\bm{y}^{i}),\quad i\in\mathcal{D}\end{array} (16)

where the constrains are subject to both minimization and maximization. Adversarial training can be adopted to solve (16). Similarly to the implicit function theorem (14), the gradient of the constraint exists, regardless of the minimization or maximization. Therefore, Danskin’s theorem can be used by first solving the inner maximization through gradient ascent (with 𝜽\bm{\theta} fixed) and then for the outer gradient descent (with 𝝍\bm{\psi} fixed). Although using the Danskin theorem requires one to exactly solve the inner maximization, it can give a descent direction for suboptimal 𝝍\bm{\psi}, e.g. solved by PGD (3) and is applicable to various adversarial training algorithms (Madry et al. 2017; Zhang et al. 2019; Dong et al. 2020).

Certified Robustness

Although PGD (3) is effective for E2E-AT, it cannot verify the robustness, as it only finds the local maximum (Xiao et al. 2018). In particular, a robustness certification on (11) verifies if an adversarial example exists within the budget Δ\Delta such that the task-aware objective is altered by a certain amount. The key is to find the exact adversarial attack in (15). We show that for specific type of objective and COs (e.g. affine-parametric QPs), optimal solution to (15) can be solved exactly, which extends the certified robustness in piecewise linear neural network (4).

Proposition 2

The affine-parametric QP:

𝒛i+1:=argmin𝒛12𝒛T𝑸𝒛+𝒒T𝒛subjectto𝑨𝒛+𝑮𝒛i𝒃𝑪𝒛+𝑯𝒛i=𝒅\begin{array}[]{rll}\bm{z}_{i+1}:=&\arg\min_{\bm{z}}&\frac{1}{2}\bm{z}^{T}\bm{Q}\bm{z}+\bm{q}^{T}\bm{z}\\ &\mathrm{subject\ to}&\bm{A}\bm{z}+\bm{G}\bm{z}_{i}\leq\bm{b}\\ &&\bm{C}\bm{z}+\bm{H}\bm{z}_{i}=\bm{d}\end{array} (17)

can be equivalently written as the set of mixed integer linear constraints:

𝑸𝒛i+1+𝒒+𝑨T𝝀i+1+𝑪T𝝂i+1=0\bm{Q}\bm{z}_{i+1}+\bm{q}+\bm{A}^{T}\bm{\lambda}_{i+1}+\bm{C}^{T}\bm{\nu}_{i+1}=0 (18)
𝑪𝒛i+1+𝑯𝒛i𝒅=𝟎\bm{C}\bm{z}_{i+1}+\bm{H}\bm{z}_{i}-\bm{d}=\bm{0}
𝝀i+1𝟎\bm{\lambda}_{i+1}\geq\bm{0}
𝑨𝒛i+1+𝑮𝒛i𝒃𝟎\bm{A}\bm{z}_{i+1}+\bm{G}\bm{z}_{i}-\bm{b}\leq\bm{0}
𝝀i+1𝝋𝑴,𝑨𝒛i+1+𝑮𝒛i𝒃(𝝋𝟏)𝑴,𝝋{0,1}|𝝋|\bm{\lambda}_{i+1}\leq\bm{\varphi}\bm{M},\;\bm{A}\bm{z}_{i+1}+\bm{G}\bm{z}_{i}-\bm{b}\geq(\bm{\varphi}-\bm{1})\bm{M},\;\bm{\varphi}\in\{0,1\}^{|\bm{\varphi}|}

where 𝐐,𝐪,𝐀,𝐆,𝐛,𝐂,𝐇,𝐝\bm{Q},\bm{q},\bm{A},\bm{G},\bm{b},\bm{C},\bm{H},\bm{d} are the parameters with proper dimensions. 𝛗\bm{\varphi} is binary vector and 𝐌\bm{M} is a large positive number. The equalities and inequalities are element-wise.

The proof can be found in the appendix.

Due to the complexity of integer programming, we restrict the original settings in (Amos and Kolter 2017) by assuming linearity in uncertain terms for certified robustness. For example, when the uncertainty in the input feature is considered, the parameter 𝒛i\bm{z}_{i} that represents the optimal decision of the previous task is decoupled from the variable 𝒛\bm{z} and is affine so that the reformulation is linear. When the uncertainty of CO is considered, we assume that the uncertain unpredictable parameter is decoupled from the variable as well. We note that this setting follows the disciplined parametrized programming (DPP) (Agrawal et al. 2019).

Since both affine-parametric QP (17) and neural network (4) can be represented by mixed-integer linear constraints, maximizing an affine function subject to these constraints becomes mixed-integer linear programming (MILP), which can be effectively solved and used to certify the worst possible cost.

Discussion on the Robustness

Previously, the uncertainty involved in COs has been considered in many E2E learning algorithms. From the perspective of contextual optimization, E2E learning applies ML to predict the uncertain parameter (Sadana et al. 2023). In such setting, probabilistic forecast and stochastic program can be implemented (Donti, Amos, and Kolter 2017). However, we view the entire E2E learning as an integrated model such that the uncertainty of the intermediate variable can be merged within the E2E training objective. Alternatively, we separate the parameters of COs into two parts. The first part (predictable parameter) is forecasted by the ML while the second part is unpredictable. Indirectly, the uncertainty of the predictable parameter is tackled by adversarial training on the input feature, while the uncertainty of unpredictable parameter needs to be tackled as well.

We note that the physical meaning of the uncertainty of the unpredictable parameter is different from that of the adversarial attack in the input feature space. Although the ultimate goal is to improve the robustness of the ML, the adversarial attack assumes that there exists a malicious party that can find the worst attack. However, the uncertainty of the unpredictable parameter always exists regardless of the malicious party. Inspired by the previous work (Donti et al. 2021; Agarwal et al. 2022), we can alternatively view the E2E-AT on the unpredictable parameter as the following optimization problem:

min𝜽αi𝒟(𝒛¯i;𝜽,𝒙i,𝒚i,ϕ¯)+(1α)i𝒟𝐄(𝒛^i,ϕi)[(𝒛^i;𝜽,𝒙i,𝒚i,ϕi)]subject to𝒛¯i𝒞E2E(𝒛i;𝜽,𝒙i,𝒚i,ϕ¯),i𝒟𝒛^i𝒞E2E(𝒛i;𝜽,𝒙i,𝒚i,ϕi),ϕiΦi,i𝒟\begin{array}[]{rl}\textrm{min}_{\bm{\theta}}&\quad\alpha\sum_{i\in\mathcal{D}}\mathcal{L}(\bar{\bm{z}}^{i};\bm{\theta},\bm{x}^{i},\bm{y}^{i},\bar{\bm{\phi}})\\ &+(1-\alpha)\sum_{i\in\mathcal{D}}\mathbf{E}_{(\hat{\bm{z}}^{i},\bm{\phi}^{i})}[\mathcal{L}(\hat{\bm{z}}^{i};\bm{\theta},\bm{x}^{i},\bm{y}^{i},\bm{\phi}^{i})]\\ \text{subject to}&\bar{\bm{z}}^{i}\in\mathcal{C}_{\text{E2E}}(\bm{z}^{i};\bm{\theta},\bm{x}^{i},\bm{y}^{i},\bar{\bm{\phi}}),\;i\in\mathcal{D}\\ &\hat{\bm{z}}^{i}\in\mathcal{C}_{\text{E2E}}(\bm{z}^{i};\bm{\theta},\bm{x}^{i},\bm{y}^{i},\bm{\phi}^{i}),\;\bm{\phi}^{i}\in{\Phi}^{i},i\in\mathcal{D}\end{array} (19)

where 𝒛¯i\bar{\bm{z}}^{i} is the decision variable from the COs parameterized by the nominal unpredictable parameter ϕ¯\bar{\bm{\phi}}.

It can be argued that the NN forecaster is trained by considering the expected task-aware cost due to uncertainties. The learning objective (19) takes the nominal unpredictable parameter (denoted as ϕ¯\bar{\bm{\phi}}) and the expected uncertainties over 𝚽i\bm{\Phi}^{i} into account, which are balanced by the hyperparameter α\alpha. As shown in (19), this stochastic program can be solved by sampling ϕi𝚽i\bm{\phi}^{i}\in\bm{\Phi}^{i} during training. In addition, if the uncertainty set of the unpredictable parameter is independent of the sample, 𝚽\bm{\Phi} is not subject to index ii.

Final Training Objective

In E2E-AT, (19) is solved by robust optimization by finding the maximum over 𝝍iΨi\bm{\psi}^{i}\in\Psi^{i}, as in (16). We also take the input uncertainty into account:

min𝜽αi𝒟(𝒛¯i;𝜽,𝝍¯i,𝒚i)+(1α)i𝒟max𝝍iΨi(𝒛^i;𝜽,𝝍i,𝒚i)subject to𝒛¯i𝒞E2E(𝒛i;𝜽,𝝍¯,𝒚i)𝒛^i𝒞E2E(𝒛i;𝜽,𝝍i,𝒚i),i𝒟\begin{array}[]{rl}\operatorname{min}_{\bm{\theta}}&\quad\alpha\cdot\sum_{i\in\mathcal{D}}\mathcal{L}(\bar{\bm{z}}^{i};\bm{\theta},\bar{\bm{\psi}}^{i},\bm{y}^{i})\\ &+(1-\alpha)\cdot\sum_{i\in\mathcal{D}}\operatorname{max}_{\bm{\psi}^{i}\in\Psi^{i}}\mathcal{L}(\hat{\bm{z}}^{i};\bm{\theta},\bm{\psi}^{i},\bm{y}^{i})\\ \text{subject to}&\bar{\bm{z}}^{i}\in\mathcal{C}_{\text{E2E}}(\bm{z}^{i};\bm{\theta},\bar{\bm{\psi}},\bm{y}^{i})\\ &\hat{\bm{z}}^{i}\in\mathcal{C}_{\text{E2E}}(\bm{z}^{i};\bm{\theta},\bm{\psi}^{i},\bm{y}^{i}),\quad i\in\mathcal{D}\end{array} (20)

The new adversarial training objective provides an upper bound on the expectation part of (19). Meanwhile, similar to adversarial training on image tasks (Zhang et al. 2019), α\alpha can be used to balance the clean and adversarial accuracies. When α1\alpha\rightarrow 1, (20) becomes the original E2E learning and when α0\alpha\rightarrow 0, it becomes pure adversarial training. In addition, clean and adversarial training losses may not directly reflect clean and robust accuracy for image tasks, causing an unbalanced training objective. In E2E-AT, the two objectives are defined by the task, which is the exact metric during decision making.

Previously in (Donti et al. 2021; Agarwal et al. 2022), the authors reformulate a bilevel optimization problem into robust optimization and use the implicit function theorem for constraint satisfaction. Connecting (19) to (20), we extend (Donti et al. 2021) to training ML models. It can be seen that for each mini-batch in E2E-AT, a similar robust optimization is solved as (Donti et al. 2021). The implicit function theorem is also used to learn the task-aware objective while satisfying the constraints.

‘Free’ E2E Adversarial Training

In E2E-AT, the number of forward and backward passes is equal to no_batch×(no_pgd+1)×no_epoch\texttt{no\_batch}\times(\texttt{no\_pgd}+1)\times\texttt{no\_epoch} (assuming that all minibatches have the same size). Adversarial training is computationally ineffective due to intensive backpropagation (controlled by the complexity of the neural network). The computational burden is even higher in E2E-AT as in each forward pass, the COs need to be solved (controlled by the complexity of the COs). To save training time, we adopt the gradient reuse strategy. In the adversarial training for free (Shafahi et al. 2019), the attack vector and the model parameter are repeatedly updated in the same mini-batch for no_pgd times. Then, epoch_no is divided by no_pgd to maintain the total number of model updates unchanged. This results in no_batch×no_epoch\texttt{no\_batch}\times\texttt{no\_epoch} numbers of forward and backward passes, which is the same as the clean E2E training.

Experiment

In the experiment, the NN is trained to forecast the load in the power system. The robustness of various E2E-AT settings is explored. Detailed experiment settings and results can be found in the Appendix.

Power System Operations

A practical power system operation problem, named as network constrained economic dispatch (NCED), is considered, which has been widely used in the US and can be formulated as two-stage QP or LP (Conejo and Baringo 2018). In stage one (also known as dispatch), the set points of the generator are determined based on the forecast load. The goal of the first stage is to minimize the generator cost while meeting the physical constraints of the grid. When the generator has been dispatched, we consider a realization on the actual load by solving the second stage problem (also known as redispatch), in which any mismatches on the load and generation from stage one, as well as the violation of the physical constraints of the grid, will be penalized by extra cost:

𝑷g=Dispatch(𝒇(𝒙;𝜽),𝒃)\bm{P}_{g}^{\star}=\text{Dispatch}(\bm{f}(\bm{x};\bm{\theta}),\bm{b}) (21a)
𝑷ls,𝑷gs=Redispatch(𝒚,𝑷g,𝒃)\bm{P}_{ls}^{\star},\bm{P}_{gs}^{\star}=\text{Redispatch}(\bm{y},\bm{P}_{g}^{\star},\bm{b}) (21b)

where 𝑷g\bm{P}_{g}^{\star} is the optimal generator dispatch, 𝑷ls\bm{P}_{ls}^{\star} is the load shedding and 𝑷gs\bm{P}_{gs}^{\star} is the power storage. 𝒃\bm{b} is the susceptance of the transmission line (when the resistance is close to zero, the susceptance is reciprocal to the reactance). The task-aware objective is defined as

(𝜽)=𝒄gT𝑷g+𝒄lsT𝑷ls+𝒄gsT𝑷gs\mathcal{L}(\bm{\theta})=\bm{c}_{g}^{T}\bm{P}_{g}+\bm{c}_{ls}^{T}\bm{P}_{ls}+\bm{c}_{gs}^{T}\bm{P}_{gs} (22)

where 𝒄g\bm{c}_{g}, 𝒄ls\bm{c}_{ls}, and 𝒄gs\bm{c}_{gs} are the coefficients such that 𝒄ls𝒄gs𝒄g\bm{c}_{ls}\gg\bm{c}_{gs}\gg\bm{c}_{g}. That is, the load shedding is more costly.

Training Settings

Table 1: Performances of the E2E-AT.
Training Method Clean Input Attack, ϵx\epsilon_{x} CO Attack, ϵϕ\epsilon_{\phi} Integrated Attack, (ϵx,ϵϕ\epsilon_{x},\epsilon_{\phi})
ϵx\epsilon_{x} ϵϕ\epsilon_{\phi} α\alpha N/A 0.01 0.02 0.03 0.05 0.10 0.15 (0.01,0.05) (0.02,0.10) (0.03,0.15)
NAT: Natural Training with Task Loss
0 0 0 201.6 653.6 1188.5 1792.1 754.6 2044.3 3468.4 1153.9 3090.8 5214.4
AT-MSE: Adversarial Training with MSE Loss
0.02 N/A N/A 396.5 676.6 978.3 1278.4 1223.8 2576.7 3923.6 1475.4 3120.1 4778.3
0.03 N/A N/A 437.0 653.6 886.3 1125.5 1208.8 2517.2 3881.6 1410.5 2956.5 4624.0
AT-INPUT: Adversarial Training with Task Loss on the Input Uncertainty
0.02 0 1.0 219.1 445.2 745.0 1106.1 672.4 1929.0 3334.02 934.4 2706.3 4854.5
0.5 203.1 438.9 790.1 1246.7 660.7 1932.7 3349.6 914.8 2763.9 4930.5
0.03 0 1.0 239.4 448.6 705.0 1022.1 638.7 1787.5 3220.6 855.5 2445.9 4507.0
0.5 227.6 556.8 976.6 1435.3 783.7 2074.9 3503.1 1106.8 2948.5 4999.8
AT-PARA: Adversarial Training with Task Loss on the CO Uncertainty
0 0.05 1.0 212.3 398.84 726.8 1197.4 214.9 590.1 1931.3 399.1 939.1 2620.0
0.5 206.7 419.3 798.2 1299.9 215.6 741.2 2120.4 419.8 1114.8 3072.6
0 0.15 1.0 214.0 488.7 900.1 1434.5 214.0 221.0 453.2 493.0 925.8 1502.0
0.5 216.5 423.5 804.6 1309.8 216.5 223.1 465.6 429.6 817.8 1324.6
AT-BOTH: Adversarial Training with Task Loss on the Integrated Uncertainties
0.02 0.05 1.0 228.2 399.5 655.7 992.0 254.2 823.5 2156.7 400.5 1128.7 2956.2
0.5 212.1 490.1 879.9 1336.5 289.4 1257.9 2711.8 516.0 1793.9 3903.8
0.03 0.15 1.0 274.7 551.0 856.5 1226.3 274.7 279.0 495.7 550.6 883.4 1274.0
0.5 239.2 429.1 666.3 977.6 239.2 250.8 508.3 429.0 667.6 1053.7

We use an open source load forecasting dataset from the Texas Backbone Power System (Lu et al. 2023) on a modified IEEE bus-14 system. We randomly collect 1.0k samples and use a feedforward neural network with three hidden layers to forecast the load444More experiment can be found in the appendix.. We do E2E-AT with a). input feature uncertainties 𝜹x\bm{\delta}_{x}, b). uncertainty of the unpredictable parameters 𝜹ϕ\bm{\delta}_{\phi}, and c). integrated uncertainties of both (𝜹x,𝜹ϕ)(\bm{\delta}_{x},\bm{\delta}_{\phi}). We first implement natural (or clean) E2E learning, based on which we warm-start the E2E-ATs. Adam optimizer is used, and ‘Adversarial training for free’ (Shafahi et al. 2019) is applied to reuse the gradients for PGD.

In detail, a). Since meteorological features have been normalized into [0,1][0,1], we attack with a normalized budget ϵx{0.02,0.03}\epsilon_{x}\in\{0.02,0.03\}. The inner maximization is solved with 7 PGD steps and the step size is dynamically set as ϵx/7×2\epsilon_{x}/7\times 2. We summarize this setting from (Zhang et al. 2019). We clamp the attacked input into [0,1][0,1] whenever it is updated. b). We consider uncertainties on the susceptance 𝒃\bm{b} in the redispatch problem. Since each transmission line can have different nominal susceptance, we set the budget ϵϕ{0.05,0.15}\epsilon_{\phi}\in\{0.05,0.15\} as the proportion to the individual nominal value, which is consistent with the common operation range of susceptance (Xu, Jaimoukha, and Teng 2022). c). We do E2E-AT with (ϵx,ϵϕ){(0.02,0.05),(0.03,0.15)}(\epsilon_{x},\epsilon_{\phi})\in\{(0.02,0.05),(0.03,0.15)\} for the integrated uncertainties. The other settings are the same as in (a) and (b).

Performance of E2E-AT

Refer to caption
(a) Natural Training
Refer to caption
(b) Adversarial Training 𝜹=0.02,α=1\bm{\delta}=0.02,\alpha=1
Refer to caption
(c) Adversarial Training 𝜹=0.03,α=1\bm{\delta}=0.03,\alpha=1
Figure 2: Input space adversarial attack using the exact MILP formulation. ‘Clean’: cost of the clean sample; ‘PGD-30’: cost of adversarial sample found by 30 PGD steps; ‘Exact’: cost of adversarial sample found by exact MILP; ‘Verify’: cost of the adversarial sample by feeding the MILP solutions into the dispatch and redispatch problems.
Table 2: CO uncertainties with random sampling of 𝒃\bm{b}. The adversarial attack reported in Table 1 is denoted as PGD-7.
Training Method Clean ϵ=0.05\bm{\epsilon=0.05} ϵ=0.1\epsilon=0.1 ϵ=0.15\epsilon=0.15
Random PGD-7 Random PGD-7 Random PGD-7
NAT 201.6 267.2 754.6 396.6 2044.3 655.5 3468.4
AT-PARA ϵϕ=0.05,𝜶=1.0\bm{\epsilon_{\phi}=0.05,\alpha=1.0} 212.3 212.3 214.9 218.4 590.1 313.3 1931.3
AT-PARA ϵϕ=0.05,𝜶=0.5\bm{\epsilon_{\phi}=0.05,\alpha=0.5} 206.7 206.7 215.6 222.6 741.2 313.1 2020.4
AT-PARA ϵϕ=0.15,𝜶=1.0\bm{\epsilon_{\phi}=0.15,\alpha=1.0} 214.0 214.0 214.0 214.0 221.0 222.9 453.2
AT-PARA ϵϕ=0.15,𝜶=0.5\bm{\epsilon_{\phi}=0.15,\alpha=0.5} 216.5 216.5 216.5 216.5 223.1 219.4 465.6

Multi-run adversarial attacks are evaluated in Table 1. For each sample and attack scenario, we randomly select three starting points within the attack budget and report the worst task-aware cost (22) to reduce the variance. Specifically, five training algorithms are compared. NAT: natural training with task-aware loss; AT-MSE: adversarial training with MSE loss; AT-INPUT: adversarial training with task-aware loss and input uncertainties; AT-PARA: adversarial training with task-aware loss and unpredictable CO uncertainties; and AT-BOTH: adversarial training with task-aware loss and integrated uncertainties.

We first highlight that uncertainties in COs can significantly increase the cost, e.g. 15 times higher than the clean cost when ϵϕ=0.15\epsilon_{\phi}=0.15 for NAT. The AT-MSE performs better for input attacks, compared to the NAT, but performs poorly on the CO attacks. This is because AT-MSE is only trained with input uncertainties. Second, the performance of E2E-AT is similar to conventional adversarial training for image classification (Madry et al. 2017; Zhang et al. 2019). For instance, training with a larger attack budget can result in better robustness on attacks with a smaller budget but can inevitably increase the clean cost. Meanwhile, the hyperparameter α\alpha gives a trade-off between clean and robust accuracy in most cases.

In addition to common findings on adversarial robustness of image tasks, some unique findings of E2E-AT are highlighted. First, AT-INPUT and AT-PARA are more effective in the uncertainty with which they are trained. However, it is observed that E2E-AT based on one source of uncertainty can also improve the robustness of the other. For example, in AT-PARA, the cost of the input attack is even lower than that trained by AT-INPUT, when ϵx\epsilon_{x} is small. Moreover, AT-PARA is more effective in the integrated attack than AT-INPUT. In fact, any input uncertainty eventually feeds into the COs, which becomes the uncertainties of the predictable parameters in COs. Finally, AT-BOTH not only improves the robustness of integrated uncertainty, but improves the robustness of the individual’s. All of the findings demonstrate that the uncertainties of both sources can be treated together in a unified way.

Certified Robustness

Certification on the robustness of the input space is considered, as susceptance 𝒃\bm{b} is coupled with the decision variable and Proposition 2 is not applicable. Using Proposition 2 and the mixed integer reformulation of NN (4), the exact input space adversarial attack on sample (𝒙,𝒚)(\bm{x},\bm{y}) can be found by a mixed integer linear program (MILP). Interval bound propagation (IBP) (Gowal et al. 2018) is used to estimate the bounds of layers in NN. As for COs, we set M=105M=10^{5} by experience. Due to the large computation burden, we randomly sample 15 same samples and solve the robust certification using Gurobi.

First, the branch-and-bound algorithm is applied whose optimality is guaranteed if feasible. Meanwhile, the MILP formulation is verified as the same task-aware cost is achieved when solving the downstream COs parameterized by the optimal attack vector. Second, it can be observed that the exact attack vector causes the worse cost degradation, compared to the PGD-30 attacks. Finally, AT-INPUT can effectively reduce the task-aware cost on the exact input space attack.

Parameter Uncertainties in COs

As shown in (19), the uncertainties of CO can be modeled by stochastic CO. To model stochastic susceptance 𝒃\bm{b}, we randomly alter the susceptance (random attack) for each sample and report the task-aware cost in Table 2. Under the random attack, the task-aware cost of NAT increases, clearly demonstrating the need for adversarial training. After E2E-AT, task-aware cost under random attacks is significantly reduced and is equal to the corresponding clean cost when the attack budget is small. Although the clean cost increases, it is still lower than the average cost under random attacks. The task-aware cost is also empirically upper bound by the adversarial attack (e.g. PGD-7), which verifies our argument on connecting the stochastic COs with E2E-AT (16) and (20).

Conclusion

This paper proposes a unified framework for tackling uncertainties in task-aware E2E learning. We argue that the uncertainties occur at both the input feature of ML and the unpredictable parameter of COs. A robust program is formulated, which is practically solved by adversarial training (E2E-AT). Through theoretical analysis and experiment, we demonstrate that 1). the CO uncertainty can cause significant generalization degradation which has been overlooked before; 2). The optimal adversarial attack on affine-parametric QP can be found by solving the mixed integer (linear) program; and 3). adversarial training can effectively improve the robustness of E2E learning in a unified way.

Acknowledgments

This research is supported by the Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/Y025946/1. Wangkun Xu is also supported by the PhD scholarship of the Department of EEE, Imperial College London. Jianhong Wang is fully supported by the UKRI Turing AI World-Leading Researcher Fellowship, EP/W002973/1.

Appendix

Proofs

Proof to Proposition 1

First, M(𝒛^1)M(𝒛^2)M(\hat{\bm{z}}_{1}^{\star})\leq M(\hat{\bm{z}}_{2}^{\star}) can be directly verified since (9) has a tighter constraint on the optimality condition in the lower level problem than it in (8). For fixed 𝜽2\bm{\theta}_{2}^{\star}, the optimal decision 𝒛^r(𝜽2)\hat{\bm{z}}_{r}^{\star}(\bm{\theta}_{2}^{\star}) is achieved when each subproblem, represented as the lower level problem in (9), achieves its optimum. Therefore, (𝒛^2i;𝒚i)=(𝒛^ri(𝜽2);𝒚i)\ell(\hat{\bm{z}}_{2}^{i\star};\bm{y}^{i})=\ell(\hat{\bm{z}}_{r}^{i\star}(\bm{\theta}_{2}^{\star});\bm{y}^{i}) and M(𝒛^2)=M(𝒛^r(𝜽2))M(\hat{\bm{z}}_{2}^{\star})=M(\hat{\bm{z}}_{r}^{\star}(\bm{\theta}_{2}^{\star})). It also shows that (𝜽2,𝒛r(𝜽2))(\bm{\theta}_{2}^{\star},\bm{z}_{r}^{\star}(\bm{\theta}_{2}^{\star})) is the minimizer of (9). Note that multiple global minimizers are also satisfied. Since (𝜽1,𝒛r(𝜽1))(\bm{\theta}_{1}^{\star},\bm{z}_{r}^{\star}(\bm{\theta}_{1}^{\star})) is also feasible with (9), it gives M(𝒛^r(𝜽2))M(𝒛^r(𝜽1))M(\hat{\bm{z}}_{r}^{\star}(\bm{\theta}_{2}^{\star}))\leq M(\hat{\bm{z}}_{r}^{\star}(\bm{\theta}_{1}^{\star})), which finalizes the proof.

Proof to Proposition 2

We give the proof by first introducing the complementary linearization (Fortuny-Amat and McCarl 1981; Kazempour, Conejo, and Ruiz 2010):

Proposition 3

The complementary condition a0,b0,ab=0a\geq 0,b\geq 0,a\cdot b=0 can be replaced by

a0,b0,aψM,b(1ψ)M,ψ{0,1}a\geq 0,\;b\geq 0,\;a\leq\psi M,\;b\leq(1-\psi)M,\;\psi\in\{0,1\}

where MM is a large enough constant.

For the optimal primal-dual pair (𝒛i+1,𝝀i+1,𝝂i+1)(\bm{z}_{i+1},\bm{\lambda}_{i+1},\bm{\nu}_{i+1}), the Karush–Kuhn–Tucker (KKT) condition (Boyd and Vandenberghe 2004) gives that

𝑸𝒛i+1+𝒒+𝑨T𝝀i+1+𝑪T𝝂i+1=0\bm{Q}\bm{z}_{i+1}+\bm{q}+\bm{A}^{T}\bm{\lambda}_{i+1}+\bm{C}^{T}\bm{\nu}_{i+1}=0 (23a)
𝑪𝒛i+1+𝑯𝒛i𝒅=0\bm{C}\bm{z}_{i+1}+\bm{H}\bm{z}_{i}-\bm{d}=0 (23b)
diag(𝝀i+1)(𝑨𝒛i+1+𝑮𝒛i+𝒃)=0\operatorname{diag}(\bm{\lambda}_{i+1})(\bm{A}\bm{z}_{i+1}+\bm{G}\bm{z}_{i}+\bm{b})=0 (23c)
𝑨𝒛i+1+𝑮𝒛i𝒃0\bm{A}\bm{z}_{i+1}+\bm{G}\bm{z}_{i}-\bm{b}\leq 0 (23d)
𝝀i+10\bm{\lambda}_{i+1}\geq 0 (23e)

Each of the lower-level problems can be equivalently written in this form. Since the complementary condition states that at least one of the inequality constraints or the dual variable equals zero, (23c), (23d), and (23e) can be rewritten by Proposition 3, which finalizes the proof.

Network Constrained Economic Dispatch

The stage-one problem is a generator dispatch problem, which can be modelled as:

(𝑷g,ϑ,𝒔)=argmin𝑷g,ϑ,𝒔𝒄gT𝑷g+cls𝟏T𝒔s.t. 𝑷¯g𝑷g𝑷¯g𝑨Tdiag(𝒃)𝑨ϑ=𝑪g𝑷g𝑪l(𝒚^𝒔)𝑷¯fdiag(𝒃)𝑨ϑ𝑷¯f𝒔0ϑref=0\begin{array}[]{rl}(\bm{P}_{g}^{\star},\bm{\vartheta}^{\star},\bm{s}^{\star})&=\arg\min_{\bm{P}_{g},\bm{\vartheta},\bm{s}}\bm{c}_{g}^{T}\bm{P}_{g}+c_{ls}\bm{1}^{T}\bm{s}\\ \text{s.t. }&\underline{\bm{P}}_{g}\leq\bm{P}_{g}\leq\bar{\bm{P}}_{g}\\ &\bm{A}^{T}\operatorname{diag}(\bm{b})\bm{A}\bm{\vartheta}=\bm{C}_{g}\bm{P}_{g}-\bm{C}_{l}(\hat{\bm{y}}-\bm{s})\\ &\underline{\bm{P}}_{f}\leq\operatorname{diag}(\bm{b})\bm{A}\bm{\vartheta}\leq\bar{\bm{P}}_{f}\\ &\bm{s}\geq 0\\ &\bm{\vartheta}_{\text{ref}}=0\end{array} (24)

In (24), 𝑷g\bm{P}_{g} and ϑ\bm{\vartheta} are the vector of generator dispatch and voltage angle in each bus. A linear cost function is considered. All the equality and inequality signs in the constraints are element-wise. The first constraint represents the upper and lower bounds for each generator. The second constraint represents the thermal limit on the power flow in each transmission line where 𝒃\bm{b} is the line susceptance, 𝑨\bm{A} is the incidence matrix of the power system. The third constraint represents the requirement for power balance on each bus, where 𝑪g\bm{C}_{g} and 𝑪l\bm{C}_{l} are the incidence matrices of the generator and the load, respectively. 𝒚^\hat{\bm{y}} represents the forecast load which is parameterized by the neural network model, for example, 𝒚^=f(𝒙;𝜽)\hat{\bm{y}}=f(\bm{x};\bm{\theta}). We also add a slack variable 𝒔𝟎\bm{s}\geq\bm{0} as a compensation for infeasibility with large cost coefficient clsc_{ls}. The fifth constraint indicates that the system is referenced to the slack bus with constant phase angle.

When the generator has been dispatched as 𝑷g\bm{P}_{g}^{\star}, we consider a realization on the actual load 𝒚\bm{y} by solving the second stage problem (also known as redispatch problem):

(𝑷ls,𝑷gs,ϑ)=argmin𝑷ls,𝑷gs,ϑ𝒄lsT𝑷ls+𝒄gsT𝑷gss.t. 𝑨Tdiag(𝒃)𝑨ϑ=𝑪g(𝑷g𝑷gs)𝑪l(𝒚𝑷ls)𝑷¯fdiag(𝒃)𝑨ϑ𝑷¯f𝑷ls0,𝑷gs0ϑref=0\begin{array}[]{rl}(\bm{P}_{ls}^{\star},\bm{P}_{gs}^{\star},\bm{\vartheta}^{\star})&=\arg\min_{\bm{P}_{ls},\bm{P}_{gs},\bm{\vartheta}}\bm{c}_{ls}^{T}\bm{P}_{ls}+\bm{c}_{gs}^{T}\bm{P}_{gs}\\ \text{s.t. }&\bm{A}^{T}\operatorname{diag}(\bm{b})\bm{A}\bm{\vartheta}=\bm{C}_{g}(\bm{P}_{g}^{\star}-\bm{P}_{gs})\\ &\qquad\qquad\qquad\qquad-\bm{C}_{l}({\bm{y}}-\bm{P}_{ls})\\ &\underline{\bm{P}}_{f}\leq\operatorname{diag}(\bm{b})\bm{A}\bm{\vartheta}\leq\bar{\bm{P}}_{f}\\ &\bm{P}_{ls}\geq 0,\bm{P}_{gs}\geq 0\\ &\bm{\vartheta}_{\text{ref}}=0\end{array} (25)

The objective of stage two is to balance the load and to solve any violation of physical constraints of power grid using load shedding 𝑷ls\bm{P}_{ls} if the generator dispatch is lower than the actual load and energy storage 𝑷gs\bm{P}_{gs} if the generator dispatch is higher than the actual load. Since load shedding (similar to blackout) is more critical and should be avoided as much as possible, it is assigned by a larger penalty, i.e. 𝒄ls𝒄gs\bm{c}_{ls}\gg\bm{c}_{gs}.

We highlight that our formulation on power system operation is more realistic than (Donti, Amos, and Kolter 2017), in which the behavior of different loads and network constraints are ignored. This results in more complex COs. In detail, the two COs have more than 60 decision variables and 150 constraints in total, which needs to be exactly solved for every forward pass.

Certifying the Load Forecasting E2E Learning

Using Proposition 2 and the mixed integer reformulation of NN (4), an exact adversarial attack on input (𝒙,𝒚)(\bm{x},\bm{y}) can be found by MILP. It is known that the value of the ‘big M’ used for NN linearization (e.g. the lower and upper bounds of the output of each NN layer (4), the upper bounds of the active inequality dual variables, and the lower bounds of the inactive inequality constraints of the COs) is essential to the performance of MILP solution. Therefore, interval bound propagation (IBP) (Gowal et al. 2018) is used to estimate the layers’ bounds in NN. As for COs, we set M=105M=10^{5} by experience.

Formulation

By using Proposition 2 and the mixed integer reformulation of NN (4), the exact input space adversarial attack on sample (𝒙,𝒚)(\bm{x},\bm{y}) can be found by an MILP:

𝜹=max𝜹𝒄gT𝑷g+𝒄lsT𝑷ls+𝒄gsT𝑷gssubject to𝑷ls,𝑷gs𝒞Lin-KKTReispatch(𝑷ls,𝑷gs;𝑷g)𝑷g𝒞Lin-KKTDispatch(𝑷g;𝒚^)𝒚^𝒞nn(𝒙+𝜹;𝜽)\begin{array}[]{rl}\bm{\delta}^{\star}=\operatorname{max}_{\bm{\delta}}&\bm{c}_{g}^{T}\bm{P}_{g}+\bm{c}_{ls}^{T}\bm{P}_{ls}+\bm{c}_{gs}^{T}\bm{P}_{gs}\\ \text{subject to}&\bm{P}_{ls},\bm{P}_{gs}\in\mathcal{C}_{\text{Lin-KKT}}^{\text{Reispatch}}(\bm{P}_{ls},\bm{P}_{gs};\bm{P}_{g})\\ &\bm{P}_{g}\in\mathcal{C}_{\text{Lin-KKT}}^{\text{Dispatch}}(\bm{P}_{g};\hat{\bm{y}})\\ &\hat{\bm{y}}\in\mathcal{C}_{\text{nn}}(\bm{x}+\bm{\delta};\bm{\theta})\end{array} (26)

where 𝒞nn()\mathcal{C}_{\text{nn}}(\cdot) is the mixed integer linear representation of the trained neural network by (4), 𝒞Lin-KKTDispatch()\mathcal{C}_{\text{Lin-KKT}}^{\text{Dispatch}}(\cdot) and 𝒞Lin-KKTReispatch()\mathcal{C}_{\text{Lin-KKT}}^{\text{Reispatch}}(\cdot) are the linearized KKT conditions of the lower level dispatch and redispatch problems, respectively by Proposition 2. Note that (26) is a rather simplified representation that omits the detailed formulation of the constraints, the integer variables in 𝒞nn()\mathcal{C}_{\text{nn}}(\cdot), 𝒞Lin-KKTDispatch\mathcal{C}_{\text{Lin-KKT}}^{\text{Dispatch}}, and 𝒞Lin-KKTReispatch\mathcal{C}_{\text{Lin-KKT}}^{\text{Reispatch}}, dual variblaes, as well as the associated lower and upper bounds (big-M) of the integers. Nonetheless, (26) is formulated as MILP, which can be solved by solvers like Gurobi.

Interval Bound Propagation (IBP)

Based on (4), the lower and upper bounds of each layer in NN can be estimated linearly as:

𝒍^i=max{𝟎,𝒍i}𝒖^i=max{𝟎,𝒖i}𝒍i+1=max{𝟎,𝑾i}𝒍^i+min{𝟎,𝑾i}𝒖^i+𝒃i𝒖i+1=min{𝟎,𝑾i}𝒍^i+max{𝟎,𝑾i}𝒖^i+𝒃i\begin{split}\hat{\bm{l}}_{i}&=\max{\{\bm{0},\bm{l}_{i}\}}\\ \hat{\bm{u}}_{i}&=\max{\{\bm{0},\bm{u}_{i}\}}\\ \bm{l}_{i+1}&=\max{\{\bm{0},\bm{W}_{i}\}}\cdot\hat{\bm{l}}_{i}+\min{\{\bm{0},\bm{W}_{i}\}}\cdot\hat{\bm{u}}_{i}+\bm{b}_{i}\\ \bm{u}_{i+1}&=\min{\{\bm{0},\bm{W}_{i}\}}\cdot\hat{\bm{l}}_{i}+\max{\{\bm{0},\bm{W}_{i}\}}\cdot\hat{\bm{u}}_{i}+\bm{b}_{i}\end{split} (27)

for i=1,,d1i=1,\cdots,d-1. The initial bound is determined by the attack budget ϵ\epsilon, that is, l^1=𝒙ϵ𝟏\hat{l}_{1}=\bm{x}-\epsilon\cdot\bm{1} and u^1=𝒙+ϵ𝟏\hat{u}_{1}=\bm{x}+\epsilon\cdot\bm{1}.

Further Analysis on the Relationship between Input and CO Uncertainties

Based on the experiment results in Table 1 and Table 3, we speculate that the robustness of one uncertainty can improve the robustness of the other, that is, the two uncertainties are not contradictory in E2E-AT. We thereby give a more detailed discussion.

To start, the two sources of uncertainties, e.g., input and CO uncertainties, are handled by training a same robust NN, which is a parametric model to forecast the predictable parameter of CO. According to the sensitivity analysis (Boyd and Vandenberghe 2004), a robust E2E model under uncertain COs implies that the robust NN can forecast a parameter such that the activations of constraints will not be significantly changed. From the structure of E2E learning, the input uncertainty is amplified by the NN (Gowal et al. 2018), which becomes an additional uncertain parameter of the COs. Similarly, an input-robust E2E model also requires that the NN forecast does not trigger significant constraint violations. Therefore, we speculate that the E2E-AT training objectives under input and CO uncertainties are not contradictory, as they are both reflected and controlled by the behavior of the COs.

However, the effectiveness of AT-INPUT on CO uncertainties (or AT-PARA on input uncertainties) depends on the individual attack budget, which will be discussed later by the extra experiment.

Details on experiment settings

Reproducibility

Our experiment is reproducible and open source on GitHub555https://github.com/xuwkk/E2E-AT..

Data Source

The IEEE bus-14 system is modified from PyPower666https://github.com/rwl/PYPOWER/blob/master/pypower/case14.py..

The meteorological features in the Texas Backbone Power System (Lu et al. 2023) include temperature (k), longwave radiation (w / m2), shortwave radiation (w / m2), zonal wind speed (m / s), meridional wind speed (m / s) and wind speed (m / s) which are normalized into [0,1]. The calendar feature includes the cosine and sin of the weekday in a week and hour in a day according to their individual period. We pack the meteorological features of 14 buses as well as the 4 calendric features. Therefore, a single datum is (𝒙i,𝒚i)𝐑4+614×𝐑14(\bm{x}^{i},\bm{y}^{i})\in\mathbf{R}^{4+6*14}\times\mathbf{R}^{14}. We map the dataset to the scale that is suitable for the bus-14 system. In detail, we start at small ground-truth load profile and gradually increase to just have the feasible solution of the dispatch and redispatch problems.

Packages

During inference, we formulate the dispatch and redispatch problem by Cvxpy (Diamond and Boyd 2016), which are solved by calling Gurobi777https://www.gurobi.com.. When calculating the gradient, we use PyTorch automatic differentiation package and CvxpyLayers to implement fast batched forward and backward passes(Agrawal et al. 2019).

Network structure

The forecast neural network has three hidden layers with output sizes of 200, 200, and 100. ReLU activations are added between layers. We also add a ReLU activation before the convex layer so that the forecast load is guaranteed to be positive and the adversarial attack cannot result in negative forecast load as well. The ReLU layer is also added when evaluating the certified approach.

Training Settings

For all experiments, we set batch size as 32 and use Adam optimizer (Kingma and Ba 2014). We train the NN with MSE loss for 250 epochs. The learning rate is 10310^{-3} with cosine annealing. We store the NN states at 200 epochs and warm-start natural E2E learning for 50 epochs with learning rate 10510^{-5}. For E2E-AT, we warm-start training from the state trained by natural E2E learning for 100 epochs with learning rate 10510^{-5}. As we set the PGD step to be 7, it is equivalent to 14 epochs under adversarial training for free setting.

For the input space attack, we only attack the meteorological features in the input as the calendric features are discrete and can be easily verified by the operator. For uncertainties in the unpredictable parameter in COs, the susceptance of the transmission line is attacked. This is a realistic setting, as the susceptance of the transmission lines may vary due to temperature changes or can be intentionally altered by the system operator via electronic devices (Xu, Jaimoukha, and Teng 2022). However, such changes cannot be detected when forecasting the load and during the dispatch stage.

In addition, we noticed that the objective of E2E adversarial training is nonsmooth. For example, the training objective can be significantly increased when certain inactive CO constraints become active or vice versa. Therefore, we use gradient clip to restrict the 1-norm of the gradient to be maximum 2.0 when updating the network. We found that this setting can result in more stable training.

Extra Experiment Results

Robust Performance on the Test Dataset

In the main content, we train on 1.0k random samples in the Texas backbone power system and report the clean and robust task-aware cost. Here we train on the entire dataset which contains 8760 samples (i.e., one hour resolution of year 2019). We do random train-test split with proportion 8:2. The training epoch is increased to 200 (or 28 epochs using ‘adversarial training for free’) and the batch size is increased to 64. The remaining training settings are the same as before. We then attack the trained model and report the performance on the test dataset.

To observe some new results, we increase the attack budget of the input space uncertainty as ϵx[0.015,0.04,0.06]\epsilon_{x}\in[0.015,0.04,0.06] to have a task-aware cost comparable to the parameter attack and set the parameter attack budget as ϵϕ[0.05,0.10,0.15]\epsilon_{\phi}\in[0.05,0.10,0.15]. Furthermore, We report the worst task-aware cost within three multi-runs. Note that we select the worst attack vector for each sample in the minibatch.

The extra experiment results are shown in Table 3. In general terms, the extra experiment in the large data set illustrates similar results to those in Table 1. And the main difference is caused by the increase in the input attack budget ϵx\epsilon_{x}.

Table 3: Performances of the E2E-AT. The model is trained on a larger dataset and evaluated on the test dataset.
Training Method Clean Input Attack, ϵx\epsilon_{x} CO Attack, ϵϕ\epsilon_{\phi} Integrated Attack, (ϵx,ϵϕ\epsilon_{x},\epsilon_{\phi})
ϵx\epsilon_{x} ϵϕ\epsilon_{\phi} α\alpha N/A 0.015 0.04 0.06 0.05 0.10 0.15 (0.015,0.05) (0.04,0.10) (0.06,0.15)
NAT: Natural Training with Task Loss
0 0 0 220.3 943.8 2366.2 3528.0 852.8 2254.5 3679.3 1526.4 4287.7 6477.6
AT-MSE: Adversarial Training with MSE Loss
0.015 N/A N/A 356.9 836.0 1792.3 2552.5 1294.3 2705.0 4069.1 1727.0 3864.4 5844.9
0.04 N/A N/A 462.3 745.3 1300.0 1820.6 1292.0 2647.1 4024.5 1558.3 3414.7 5273.2
0.06 N/A N/A 646.4 846.5 1229.5 1586.8 1352.7 2652.2 4022.7 1540.6 3169.1 4967.4
AT-INPUT: Adversarial Training with Task Loss on the Input Uncertainty
0.015 0 1.0 302.1 430.3 692.3 981.3 369.2 884.5 1939.9 488.3 1374.5 2981.5
0.5 240.9 544.4 1322.8 2009.5 756.5 2038.0 3455.0 1074.3 3361.1 5566.8
0.04 0 1.0 429.1 539.4 755.7 950.9 449.5 517.3 656.0 552.1 802.2 1050.6
0.5 280.9 509.7 972.0 1403.5 739.3 1875.1 3259.2 940.9 2687.7 4647.4
0.06 0 1.0 436.4 528.2 706.5 859.0 449.6 490.7 584.6 538.1 729.9 940.9
0.5 307.8 432.1 659.9 863.5 553.0 1350.5 2500.5 660.0 1788.9 3268.0
AT-PARA: Adversarial Training with Task Loss on the CO Uncertainty
0 0.05 1.0 223.0 1003.9 2532.6 3717.9 231.1 967.3 2526.4 1014.9 2908.2 5028.1
0.5 224.8 1019.5 2538.8 3717.7 239.0 1096.5 2677.8 1028.5 3000.9 5222.0
0 0.10 1.0 227.3 934.2 2456.0 3653.9 227.3 262.7 956.7 949.4 2483.2 3912.2
0.5 226.9 927.3 2325.5 3392.5 227.1 269.6 1072.2 928.5 2326.8 3760.9
0 0.15 1.0 244.8 1042.6 2552.3 3737.4 244.8 245.5 269.1 1039.3 2545.7 3847.1
0.5 237.1 974.3 2470.4 3646.2 237.1 237.7 286.6 978.2 2455.3 3733.9
AT-BOTH: Adversarial Training with Task Loss on the Integrated Uncertainties
0.015 0.05 1.0 305.9 447.8 734.3 1063.7 308.7 389.2 643.5 450.2 816.0 1296.1
0.5 267.2 743.7 1734.4 2593.4 404.5 1671.6 3150.6 806.5 2865.1 5163.7
0.04 0.10 1.0 385.4 462.6 600.0 731.7 385.8 394.0 438.8 461.4 608.8 782.4
0.5 293.3 516.2 961.2 1376.6 296.4 456.6 1069.9 516.2 1061.1 1849.6

We summarize some of the findings below.

  1. 1.

    In all E2E-AT, the clean accuracy decreases as the attack budget increases, which is consistent to the conventional adversarial training. Moreover, model trained under larger attack budget can also improve robustness under smaller attack budget.

  2. 2.

    Hyperparameter α\alpha can balance clean accuracy and robust accuracy, which is consistent with conventional adversarial training.

  3. 3.

    The AT-MSE can impact the robust accuracy under input attack (but still much higher than E2E-AT), but cannot improve the robustness of CO, which actually becomes worse compared to NAT. This is because the AT-MSE only captures the input uncertainties and fits to the MSE loss. Therefore, it cannot be generalized to the uncertainty of CO.

  4. 4.

    Both input and CO uncertainties must be considered when designing the E2E learning task. E2E-AT is an effective approach to improve the model robustness. And the unified training (AT-BOTH) can result in the best task-aware costs in most of the cases.

  5. 5.

    In the new experiment, it can be observed that training under input space adversaries can improve not only the input robustness but the CO robustness. We actually have opposite results of the experiment in Table 1 in which the robustness of CO can improve the robustness of the input. We argue that the main reason is caused by the different influence of the input and CO uncertainties under different attack budgets. This explains why AT-BOTH can significantly improve the robustness of CO but only has a limited improvement on the robustness of input uncertainty. We will discuss this point in the next section.

Gradient Analysis of E2E-AT

From Table 3, we compare the uncertainty of the input and unpredictable parameters in CO by designing the attack budget ϵx\epsilon_{x} and ϵϕ\epsilon_{\phi} to have a similar task-aware cost in the clean E2E model. We conclude that the robustness of CO is easier to improve than the robustness of input, and improving the robustness of input can improve the robustness of CO. To verify this point, we calculate the average 1\ell_{1} norm of the gradient of the NN when input and unpredictable parameter attacks are separately generated. Both the NN without E2E-AT (warm started by E2E learning) and the NN updated by AT-BOTH (with ϵx=0.04\epsilon_{x}=0.04 and ϵϕ=0.10\epsilon_{\phi}=0.10) are evaluated.

First, as shown in Table 4, the gradient of E2E-AT is very large and requires a gradient clip during training. Second, although the initial task-aware costs are similar for input and parameter adversaries (2366 vs 2354), the initial gradient under input adversaries is 10 times higher than it of parameter adversaries. Even after AT-BOTH, the difference is still more than 10 times. This implies that the impact of the CO uncertainties on the task-aware cost is much less than the input uncertainties under the current attack budget.

Table 4: Un-clipped average gradients for E2E-AT in 1\ell_{1}-norm.
Before E2E-AT After E2E-AT
Input Attack 4.5×1054.5\times 10^{5} 8.4×1048.4\times 10^{4}
CO Attack 5.8×1045.8\times 10^{4} 5.7×1035.7\times 10^{3}

These findings do not violate our initial goal of unifying uncertainties in E2E learning. Actually, we can show that the improvements in the two uncertainties are not contradictory. Otherwise, improving the uncertainties of one cannot improve the uncertainty of the other. To verify the idea, we plot the cosine similarities of the gradients of NN for all mini-batches of size 16 on the entire training dataset.

Refer to caption
(a) Before E2E-AT (warm start from E2E learning)
Refer to caption
(b) After E2E-AT
Figure 3: Cosine similarities of the gradients of NN under the input space adversarial attack and unpredictable parameter attack in CO.

As shown in Fig.3, the cosine similarities are close to one, meaning that training on one of the uncertainties can also improve the robustness of the other, which is the same as observed by the experiment results.

From the experiment results in Table 1 and 3, as well as the discussions before, we conjecture that, assuming the training objectives on improving the robustness of different uncertainties sources are not contradictory:

  1. 1.

    The gradients of NN under the adversaries/uncertainties from different sources can be used to model their impacts on E2E-AT. It can also give a hint on setting the individual attack budget. And to have more balanced adversarial training behaviors among different uncertainty sources, the attack budgets might be set according to the initial NN gradients other than the task-aware costs.

  2. 2.

    The dominating uncertainty, for example, the one with a higher NN gradient, can dominate the training process and be also effective on the remaining uncertainties. On the contrary, uncertainty with a smaller impact cannot significantly improve the robustness of the others.

Future Work

In addition to conjectures, a thorough investigation of the relationship between multiple sources of uncertainty could be an interesting future work, since conventional adversarial training usually has a single source of uncertainty. Theoretical analysis is needed and Multi-task learning (Yu et al. 2020), which simultaneously trains for various objectives, can be borrowed to handle multi-uncertainties. In addition, solving the exact attack vector is an MILP problem which cannot be used for adversarial training due to its high complexity. Developing certified and tractable E2E-AT is also important for security purposes.

References

  • Agarwal et al. (2022) Agarwal, A.; Donti, P. L.; Kolter, J. Z.; and Pileggi, L. 2022. Employing adversarial robustness techniques for large-scale stochastic optimal power flow. Electric Power Systems Research, 212: 108497.
  • Agrawal et al. (2019) Agrawal, A.; Amos, B.; Barratt, S.; Boyd, S.; Diamond, S.; and Kolter, J. Z. 2019. Differentiable convex optimization layers. Advances in neural information processing systems, 32.
  • Amos and Kolter (2017) Amos, B.; and Kolter, J. Z. 2017. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, 136–145. PMLR.
  • Bengio (1997) Bengio, Y. 1997. Using a financial training criterion rather than a prediction criterion. International journal of neural systems, 8(04): 433–443.
  • Boyd and Vandenberghe (2004) Boyd, S. P.; and Vandenberghe, L. 2004. Convex optimization. Cambridge university press.
  • Chen et al. (2021) Chen, B.; Donti, P. L.; Baker, K.; Kolter, J. Z.; and Bergés, M. 2021. Enforcing policy feasibility constraints through differentiable projection for energy optimization. In Proceedings of the Twelfth ACM International Conference on Future Energy Systems, 199–210.
  • Conejo and Baringo (2018) Conejo, A. J.; and Baringo, L. 2018. Power system operations, volume 11. Springer.
  • Diamond and Boyd (2016) Diamond, S.; and Boyd, S. 2016. CVXPY: A Python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1): 2909–2913.
  • Domke (2012) Domke, J. 2012. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, 318–326. PMLR.
  • Dong et al. (2020) Dong, Y.; Deng, Z.; Pang, T.; Zhu, J.; and Su, H. 2020. Adversarial distributional training for robust deep learning. Advances in Neural Information Processing Systems, 33: 8270–8283.
  • Donti et al. (2021) Donti, P.; Agarwal, A.; Bedmutha, N. V.; Pileggi, L.; and Kolter, J. Z. 2021. Adversarially robust learning for security-constrained optimal power flow. Advances in Neural Information Processing Systems, 34: 28677–28689.
  • Donti, Amos, and Kolter (2017) Donti, P.; Amos, B.; and Kolter, J. Z. 2017. Task-based end-to-end model learning in stochastic optimization. Advances in neural information processing systems, 30.
  • Elmachtoub and Grigas (2022) Elmachtoub, A. N.; and Grigas, P. 2022. Smart “predict, then optimize”. Management Science, 68(1): 9–26.
  • Fortuny-Amat and McCarl (1981) Fortuny-Amat, J.; and McCarl, B. 1981. A representation and economic interpretation of a two-level programming problem. Journal of the operational Research Society, 32: 783–792.
  • Gowal et al. (2018) Gowal, S.; Dvijotham, K.; Stanforth, R.; Bunel, R.; Qin, C.; Uesato, J.; Arandjelovic, R.; Mann, T.; and Kohli, P. 2018. On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715.
  • Kazempour, Conejo, and Ruiz (2010) Kazempour, S. J.; Conejo, A. J.; and Ruiz, C. 2010. Strategic generation investment using a complementarity approach. IEEE transactions on power systems, 26(2): 940–948.
  • Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kong et al. (2022) Kong, L.; Cui, J.; Zhuang, Y.; Feng, R.; Prakash, B. A.; and Zhang, C. 2022. End-to-end stochastic optimization with energy-based model. Advances in Neural Information Processing Systems, 35: 11341–11354.
  • Kotary et al. (2021) Kotary, J.; Fioretto, F.; Van Hentenryck, P.; and Wilder, B. 2021. End-to-end constrained optimization learning: A survey. arXiv preprint arXiv:2103.16378.
  • Krantz and Parks (2002) Krantz, S. G.; and Parks, H. R. 2002. The implicit function theorem: history, theory, and applications. Springer Science & Business Media.
  • Liu et al. (2023) Liu, Z.; Yin, Y.; Bai, F.; and Grimm, D. K. 2023. End-to-end learning of user equilibrium with implicit neural networks. Transportation Research Part C: Emerging Technologies, 150: 104085.
  • Lu et al. (2023) Lu, J.; Li, X.; Li, H.; Chegini, T.; Gamarra, C.; Yang, Y.; Cook, M.; and Dillingham, G. 2023. A Synthetic Texas Backbone Power System with Climate-Dependent Spatio-Temporal Correlated Profiles. arXiv preprint arXiv:2302.13231.
  • Luo, Pang, and Ralph (1996) Luo, Z.-Q.; Pang, J.-S.; and Ralph, D. 1996. Mathematical programs with equilibrium constraints. Cambridge University Press.
  • Madry et al. (2017) Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
  • Sadana et al. (2023) Sadana, U.; Chenreddy, A.; Delage, E.; Forel, A.; Frejinger, E.; and Vidal, T. 2023. A Survey of Contextual Optimization Methods for Decision Making under Uncertainty. arXiv preprint arXiv:2306.10374.
  • Shafahi et al. (2019) Shafahi, A.; Najibi, M.; Ghiasi, M. A.; Xu, Z.; Dickerson, J.; Studer, C.; Davis, L. S.; Taylor, G.; and Goldstein, T. 2019. Adversarial training for free! Advances in Neural Information Processing Systems, 32.
  • Stratigakos et al. (2022) Stratigakos, A.; Camal, S.; Michiorri, A.; and Kariniotakis, G. 2022. Prescriptive trees for integrated forecasting and optimization applied in trading of renewable energy. IEEE Transactions on Power Systems, 37(6): 4696–4708.
  • Tjeng, Xiao, and Tedrake (2017) Tjeng, V.; Xiao, K.; and Tedrake, R. 2017. Evaluating robustness of neural networks with mixed integer programming. arXiv preprint arXiv:1711.07356.
  • Vohra, Rajaei, and Cremer (2023) Vohra, R.; Rajaei, A.; and Cremer, J. L. 2023. End-to-end learning with multiple modalities for system-optimised renewables nowcasting. arXiv preprint arXiv:2304.07151.
  • Wilder, Dilkina, and Tambe (2019) Wilder, B.; Dilkina, B.; and Tambe, M. 2019. Melding the data-decisions pipeline: Decision-focused learning for combinatorial optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 1658–1665.
  • Xiao et al. (2018) Xiao, K. Y.; Tjeng, V.; Shafiullah, N. M.; and Madry, A. 2018. Training for faster adversarial robustness verification via inducing relu stability. arXiv preprint arXiv:1809.03008.
  • Xu, Jaimoukha, and Teng (2022) Xu, W.; Jaimoukha, I. M.; and Teng, F. 2022. Robust moving target defence against false data injection attacks in power grids. IEEE Transactions on Information Forensics and Security, 18: 29–40.
  • Xu and Teng (2023) Xu, W.; and Teng, F. 2023. Availability Adversarial Attack and Countermeasures for Deep Learning-based Load Forecasting. In 2023 IEEE Belgrade PowerTech, 01–06.
  • Yu et al. (2020) Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; and Finn, C. 2020. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33: 5824–5836.
  • Zhang et al. (2019) Zhang, H.; Yu, Y.; Jiao, J.; Xing, E.; El Ghaoui, L.; and Jordan, M. 2019. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, 7472–7482. PMLR.