¹¹institutetext: Los Alamos National Laboratory, Los Alamos, NM 87544, USA ¹¹email: {rbparker,mjgarcia,rbent}@lanl.gov
²²institutetext: Dowson Farms
²²email: [email protected]
³³institutetext: Texas A&M University, College Station, TX 77843, USA
³³email: [email protected]

Formulations and scalability of neural network surrogates in nonlinear optimization problems

Robert B. Parker 11 Oscar Dowson 22 Nicole LoGiudice 33 Manuel Garcia 11 Russell Bent 11

Abstract

We compare full-space, reduced-space, and gray-box formulations for representing trained neural networks in nonlinear optimization problems. We test these formulations on a transient stability-constrained, security-constrained alternating current optimal power flow (SCOPF) problem where the transient stability criteria are represented by a trained neural network surrogate. Optimization problems are implemented in JuMP and trained neural networks are embedded using a new Julia package: MathOptAI.jl. To study the bottlenecks of the three formulations, we use neural networks with up to 590 million trained parameters. The full-space formulation is bottlenecked by the linear solver used by the optimization algorithm, while the reduced-space formulation is bottlenecked by the algebraic modeling environment and derivative computations. The gray-box formulation is the most scalable and is capable of solving with the largest neural networks tested. It is bottlenecked by evaluation of the neural network’s outputs and their derivatives, which may be accelerated with a graphics processing unit (GPU). Leveraging the gray-box formulation and GPU acceleration, we solve our test problem with our largest neural network surrogate in 2.5 $\times$ the time required for a simpler SCOPF problem without the stability constraint.

Keywords:

Surrogate modeling Neural networks Nonlinear optimization

1 Introduction

Nonlinear local optimization is a powerful tool for engineers and operations researchers for its ability to handle accurate physical models, respect explicit constraints, and solve large-scale problems [2]. However, it is often the case that practitioners wish to include components that do not easily fit into the differentiable and algebraic frameworks of nonlinear optimization. Examples include when a mechanistic model is not available [21], is time-consuming to simulate [7], or renders the optimization problem too complicated to solve reliably [5].

A recent trend has been to replace troublesome components by a trained neural network surrogate and then embed the trained neural network into a nonlinear optimization model [16]. Several open-source software packages, e.g., OMLT [7] and gurobi-machinelearning [12], make it easy to embed neural network models in Python-based modeling environments for nonlinear optimization. A trained neural network may be represented in an optimization problem with different formulations, e.g., full-space and reduced-space formulations [7, 20], which have been compared by Kilwein [13] for a security-constrained AC optimal power flow (SCOPF) problem. A third approach, first suggested by Casas [6], is called a gray-box formulation, in which function and derivative evaluations of the surrogate are handled by the neural network modeling software (here, PyTorch [18]), rather than the algebraic modeling environment (here, JuMP [15]).

While full-space, reduced-space, and gray-box formulations have been compared in [6], the bottlenecks of these formulations have not been carefully identified. This paper profiles these three formulations of a neural network predicting transient feasibility in an SCOPF problem. We demonstrate that the gray-box formulation is the most scalable, and that it can naturally take advantage of GPU acceleration built into PyTorch. We find that the full-space formulation is bottlenecked by the solution of a linear system of equations, and the reduced-space formulation is bottlenecked by JuMP and its automatic differentiation system. Our work provides a clear benchmark and direction for future work. Additionally, we provide MathOptAI.jl, a new open-source library for embedding trained machine learning predictors into optimization models built with JuMP [15]. MathOptAI.jl is available at https://github.com/lanl-ansi/MathOptAI.jl under a BSD-3 license.

2 Background

2.1 Nonlinear optimization

We study nonlinear optimization problems in the form given by Equation (1):

\min_{x}f(x)\text{ s.t. }g(x)=0;~{}x\geq 0.

(1)

We consider interior point methods, such as IPOPT [22], for solving (1). These methods iteratively compute search directions $d$ by solving the linear system (2):

\begin{bmatrix}(\nabla^{2}\mathcal{L}(x)+\alpha)&\nabla g(x)^{T}\\ \nabla g(x)&0\end{bmatrix}d=-\begin{bmatrix}\nabla f(x)+\nabla g(x)^{T}\lambda+\beta\\ g(x)\end{bmatrix},

(2)

where $\alpha$ and $\beta$ are additional terms that are not shown for simplicity. The matrix on the left-hand side is referred to as the Karush-Kuhn Tucker, or KKT, matrix. To construct this system, solvers rely on callbacks that provide the Jacobian $\nabla g$ and, optionally, the Hessian of the Lagrangian function, $\nabla^{2}\mathcal{L}$ . These are typically provided by the automatic differentiation system of an algebraic modeling environment. If the Hessian $\nabla^{2}\mathcal{L}$ is not available, a limited-memory quasi-Newton approximation is used [17].

2.2 Neural network predictors

A neural network predictor is a function denoted $y={\rm NN}(x)$ . We consider neural networks defined by repeated application of an affine transformation and a nonlinear activation function $\sigma$ over $L$ layers:

\displaystyle y_{l}

\displaystyle=\sigma_{l}(W_{l}y_{l-1}+b_{l})

\displaystyle l\in\{1,\dots,L\},

(3)

where $y_{0}=x$ and $y=y_{L}$ . Weights $W_{l}$ and biases $b_{l}$ are parameters that are optimized to minimize error on a set of training data representing desired inputs and outputs of the neural network. To a nonlinear optimization solver using a trained neural network, $W_{l}$ and $b_{l}$ are constant. To fit assumptions made by these solvers, we consider only smooth activation functions, e.g., sigmoid and hyperbolic tangent functions.

2.3 Algebraic representations of a neural network

In this section, we explain the three ways in which we encode pre-trained neural network predictors into the constraints of a nonlinear optimization model of the form (1). The three approaches are denominated full-space, reduced-space, and gray-box.

2.3.1 Full-space

In the full-space formulation, we add an intermediate vector-valued decision variable $z_{l}$ to represent the output of the affine transformation in each layer $l$ , and we add a vector-valued decision variable $y_{l}$ to represent the output of each nonlinear activation function. We then add a linear equality constraint to enforce the relationship between $y_{l-1}$ and $z_{l}$ and a nonlinear equality constraint to enforce the relationship between $z_{l}$ and $y_{l}$ . Thus, the neural network in (3) is encoded by the constraints:

	$\displaystyle{W_{l}}y_{l-1}-z_{l}$	$\displaystyle=-b_{l}$		$\displaystyle l\in\{1,\dots,L\}$		(4)
	$\displaystyle y_{l}-\sigma_{l}(z_{l})$	$\displaystyle=0$		$\displaystyle l\in\{1,\dots,L\}.$		(4)

The full-space approach prioritizes small expressions and small, sparse nonlinear constraints at the cost of introducing additional variables and constraints for each layer of the neural network. This formulation conforms to the assumptions of JuMP’s reverse-mode automatic differentiation algorithm: 1) nonlinear constraints can be written as a set of scalar-valued functions, and 2) they are sparse in the sense that each scalar constraint contains relatively few variables and has a simple functional form.

2.3.2 Reduced-space

In the reduced-space formulation we add a single vector-valued decision variable $y$ to represent the output of the final activation function, and we add a single vector-valued nonlinear equality constraint that encodes the complete network. Thus, the neural network in (3) is encoded as the vector-valued constraint:

y-\sigma_{L}({W_{L}}(\ldots\sigma_{l}({W_{l}}(\ldots\sigma_{1}({W}_{1}x+b_{1})\ldots)+b_{l})\ldots)+b_{L})=0.

(5)

The benefit of the reduced-space approach is that we add only a single vector-valued decision variable $y$ and a single vector-valued nonlinear equality constraint (each of dimension of the output size of the last layer). The downside is that the nonlinear constraint is a complicated expression with a very large number of terms. This is made worse by the fact that JuMP scalarizes vector-valued expressions in nonlinear constraints. Thus, instead of efficiently representing the affine relationship ${W_{1}}x+b_{1}$ by storing the matrix and vector, JuMP instead represents the expression as a sum of scalar products.

2.3.3 Gray-box

In the gray-box formulation, we do not attempt to encode the neural network algebraically. Instead we exploit the fact that nonlinear local solvers such as IPOPT require only callback oracles to evaluate the constraint function $g(x)$ and the Jacobian $\nabla g(x)$ (and, optionally, the Hessian $\nabla^{2}\mathcal{L}$ ). Using JuMP’s support for user-defined nonlinear operators, we implement the evaluation of the neural network as a nonlinear operator ${\rm NN}(x)$ , and we use PyTorch’s built-in automatic differentiation support to compute the Jacobian $\nabla{\rm NN}(x)$ and Hessians $\nabla^{2}{\rm NN}(x)$ . Thus, the neural network in (3) is encoded as the vector-valued constraint:

y-{\rm NN}(x)=0.

(6)

Like the reduced-space formulation, the gray-box approach adds only a small number of variables and constraints to the optimization problem. By contrast, the gray-box approach uses the automatic differentiation system of the neural network modeling software, which is better-suited to the dense, nested, vector-valued expressions that define the neural network. However, as explicit representation of the constraints are not exposed to the solver, the gray-box formulation cannot support relaxations used by nonlinear global optimization solvers.

2.4 MathOptAI.jl

Encoding a trained neural network into an optimization model using the forms described in Section 2.3 is tedious and error-prone. To simplify experimentation, we developed a new Julia package, MathOptAI.jl, which is a JuMP extension for embedding a range of machine learning models into a JuMP model. In addition to supporting neural networks trained using PyTorch, which are the focus of this paper, MathOptAI.jl also supports Julia-based deep-learning libraries, as well as other machine learning models such as decision trees and Gaussian Processes. MathOptAI.jl is provided as an open-source package at https://github.com/lanl-ansi/MathOptAI.jl under a BSD-3 license.

3 Test problem

We compare the three different neural network formulations on a transient stability-constrained, security-constrained ACOPF problem [11].

Stability-constrained optimal power flow Security-constrained optimal power flow (SCOPF) is a well-established problem for dispatching generators in an electric power network in which feasibility of the network (i.e., the ability to meet demand) is enforced for a set of contingencies [1]. Each contingency $k$ represents the loss of a set of generators and/or power lines. We consider a variant of this problem where, in addition to enforcing steady-state feasibility, we enforce feasibility of the transient response resulting from the contingency. In particular, we enforce that the transient frequency at each bus is at least ${\bf\eta}=59.4$ Hz for the 30 second interval following each contingency. This problem is given by Equation 7:

\min_{S^{g},V}c(\mathbb{R}(S^{g}))\text{ s.t. }\left\{\begin{array}[]{ll}F_{k}(S^{g},V,{\bf S^{d}})\leq 0&k\in\{0,\dots,K\}\\ G_{k}(S^{g},{\bf S^{d}})\geq{\bf\eta}\mathbbm{1}&k\in\{1,\dots,K\}.\\ \end{array}\right.

(7)

Here, $S^{g}$ is a vector of complex AC power generations for each generator in the network, $V$ is a vector of complex bus voltages, $c$ is a quadratic cost function, and $\bf S^{d}$ is a constant vector of complex power demands. Here $F_{k}\leq 0$ represents the set of constraints enforcing feasibility of the power network for contingency $k$ , where $k=0$ refers to the base network, and $G_{k}$ maps generations and demands to the minimum frequency at each bus over the interval considered.

In this work, we consider an instance of Problem 7 defined on a 37-bus synthetic test network [4, 3]. In this case, $G_{k}$ has 117 inputs and 37 outputs. We consider a single contingency that outages generator 5 on bus 23. We choose a small network model with a single contingency because our goal is to test the different neural network formulations, not the SCOPF formulation itself.

Stability surrogate model Instead of considering the differential equations describing transient behavior of the power network directly in the optimization problem, we approximate $G_{k}$ with a neural network trained on data from 110 high-fidelity simulations using PowerWorld [19] with generations and loads uniformly sampled from within a $\pm 20\%$ interval of each nominal value. We use sequential neural networks with $\tanh$ activation functions with between two and 20 layers and between 50 and 4,000 neurons per layer. These networks have between 7,000 and 592 million trained parameters. The networks are trained to minimize mean squared error using the Adam optimizer [14] until training loss is below 0.01 for 1,000 consecutive epochs. We use a simple training procedure and small amount of data because our goal is to test optimization formulations with embedded neural networks, rather than the neural networks themselves.

4 Results

4.1 Computational setting

We model the SCOPF problem using PowerModels [9], PowerModelsSecurityConstrained [8], and JuMP [15]. Neural networks are modeled using PyTorch [18] and embedded into the optimization problem using MathOptAI.jl. Optimization problems are solved using the IPOPT nonlinear optimization solver [22] with MA27 [10] as the linear solver. The full-space and reduced-space models support evaluation on a CPU but not on a GPU. Because gray-box models use PyTorch, they can be evaluated on a CPU or GPU. We run our experiments on the Venado supercomputer. CPU-only experiments use compute nodes with two 3.4 GHz NVIDIA Grace CPUs and 240 GB of RAM, while CPU+GPU experiments use nodes with a Grace CPU with 120 GB of RAM and an NVIDIA H100 GPU.

4.2 Structural results

Table 1 shows the numbers of variables, constraints, and nonzero entries of the derivative matrices for the optimization problem with different neural networks and formulations. We note that the reduced-space and gray-box formulations have approximately the same numbers of constraints and variables as the original problem, but more nonzero entries in the Jacobian and Hessian matrices due to the dense, nonlinear stability constraints. With these formulations, the structure of the optimization problem does not change as the neural network surrogate adds more interior layers. By contrast, the full-space formulation grows in numbers of variables, constraints, and nonzeros as the neural network gets larger. These problem structures suggest that the full-space formulation will lead to expensive KKT matrix factorizations, while this will not be an issue for reduced-space and gray-box formulations.

Table 1: Numbers of variables, constraints, and nonzeros for different networks and formulations

Parameters	Formulation	N. Variables	N. Constraints	Jacobian NNZ	Hessian NNZ
–	No surrogate	1155	1460	5822	1398
7k	Full-space	1292	1634	11796	1448
25k	Full-space	1592	1934	27996	1598
578k	Full-space	4192	4534	567896	2898
7M	Full-space	17192	17534	7144896	9398
All networks	Reduced-space	1155	1497	8708	4479
All networks	Gray-box	1192	1534	8782	4479

4.3 Runtime results

Table 2: Solve times with different neural networks and formulations

Parameters	Formulation	Hessian	Platform	Build time	Solve time	Iterations	Time/iter.
–	No surrogate	Exact	CPU	45 ms	0.4 s	41	9 ms
7k	Full-space	Exact	CPU	0.1 s	2 s	468	4 ms
25k	Full-space	Exact	CPU	0.3 s	5 s	642	8 ms
578k	Full-space	Exact	CPU	0.3 s	699 s	755	0.9 s
7M	Full-space^∗	Exact	CPU	–	–	–	–
7k	Reduced-space	Exact	CPU	0.1 s	7 s	49	0.1 s
25k	Reduced-space	Exact	CPU	1 s	1125 s	41	27 s
578k	Reduced-space^†	Exact	CPU	–	–	–	–
7M	Reduced-space^‡	Exact	CPU	–	–	–	–
7k	Gray-box	Exact	CPU	0.1 s	8 s	41	0.2 s
25k	Gray-box	Exact	CPU	0.1 s	9 s	42	0.2 s
578k	Gray-box	Exact	CPU	0.1 s	11 s	42	0.3 s
7M	Gray-box	Exact	CPU	0.1 s	22 s	42	0.5 s
68M	Gray-box	Exact	CPU	0.1 s	140 s	42	3 s
592M	Gray-box	Exact	CPU	0.6 s	748 s	42	18 s
7k	Gray-box	Exact	CPU+GPU	0.1 s	7 s	41	0.2 s
25k	Gray-box	Exact	CPU+GPU	0.1 s	7 s	42	0.2 s
578k	Gray-box	Exact	CPU+GPU	0.1 s	8 s	42	0.2 s
7M	Gray-box	Exact	CPU+GPU	0.1 s	8 s	42	0.2 s
68M	Gray-box	Exact	CPU+GPU	0.2 s	9 s	42	0.2 s
592M	Gray-box	Exact	CPU+GPU	0.7 s	15 s	42	0.3 s
7k	Gray-box	Approx.	CPU	0.1 s	0.3 s	61	6 ms
25k	Gray-box	Approx.	CPU	48 ms	0.3 s	57	6 ms
578k	Gray-box	Approx.	CPU	0.1 s	1 s	66	15 ms
7M	Gray-box	Approx.	CPU	0.1 s	6 s	57	0.1 s
68M	Gray-box	Approx.	CPU	0.1 s	7 s	56	0.1 s
592M	Gray-box	Approx.	CPU	0.9 s	17 s	56	0.3 s
7k	Gray-box	Approx.	CPU+GPU	50 ms	0.5 s	63	7 ms
25k	Gray-box	Approx.	CPU+GPU	48 ms	0.4 s	58	7 ms
578k	Gray-box	Approx.	CPU+GPU	0.1 s	0.5 s	62	8 ms
7M	Gray-box	Approx.	CPU+GPU	0.1 s	0.5 s	57	9 ms
68M	Gray-box	Approx.	CPU+GPU	0.2 s	1 s	56	21 ms
592M	Gray-box	Approx.	CPU+GPU	0.7 s	1 s	56	23 ms
^∗ Fails with a segfault, possibly due to memory requirements of MA27
^† Exceeds resource manager’s memory limits
^‡ Exceeds ten-hour time limit

Runtimes for the different formulations with neural network surrogates of increasing size are given in Table 2. For gray-box formulations, we compare optimization solves using exact and approximate Hessian evaluations and different hardware platforms. The results immediately show that full-space and reduced-space formulations are not scalable to neural networks with more than a few million trained parameters. The full-space formulation fails with a segmentation fault—likely due to memory issues in MA27—while the reduced-space formulation exceeds time and memory limits building the constraint expressions in JuMP. A breakdown of solve times, given in Table 3, confirms the bottlenecks in these formulations. The full-space formulation spends almost all of its solve time in the IPOPT algorithm, which we assume is dominated by KKT matrix factorization, while the reduced-space formulation spends most of its solve time evaluating the Hessian.

By contrast, the gray-box formulation is capable of solving the optimization problem with the largest neural network surrogates tested. While a CPU-only solve with exact Hessian matrices takes an unacceptably-long 748 s, a GPU-accelerated solve with approximate Hessian matrices solves in only one second. This is slower than the original SCOPF problem (with no stability constraint) by a factor of 2.5, which may be acceptable for some applications. In all cases, the solve time with the gray-box formulation is dominated by function and Hessian evaluations, which explains the large speed-ups obtained with the GPU (50 $\times$ and 17 $\times$ for the 592M-parameter network with exact and approximate Hessians).

Approximating the Hessian matrix also speeds up the solve significantly. Hessian approximation is not a common approach when exact Hessians are available because it can lead to slow and unreliable convergence. In the 592M-parameter case, approximating the Hessian increases the iteration count by 14, but it makes up for this by decreasing the time per iteration by a factor of 60. These results suggest that this is an appropriate trade-off for optimization problems constrained by large neural networks.

Table 3: Solve time breakdowns for selected neural networks and formulations

Formulation	Parameters	Hessian	Platform	Solve time	Percent of solve time (%)
Formulation	Parameters	Hessian	Platform	Solve time	Function	Jacobian	Hessian	Solver	Other
Full-space	578k	Exact	CPU	699 s	0.1	$<$ 0.1	0.2	99+	$<$ 0.1
Reduced-space	25k	Exact	CPU	1125 s	2	0.5	97	0.4	0.3
Gray-box	592M	Exact	CPU	748 s	97	2	1	$<$ 0.1	$<$ 0.1
Gray-box	592M	Exact	CPU+GPU	15 s	54	1	42	2	0.6
Gray-box	592M	Approx.	CPU	17 s	96	2	–	2	$<$ 0.1
Gray-box	592M	Approx.	CPU+GPU	1 s	76	6	–	17	0.1

5 Conclusion

This work demonstrates that nonlinear local optimization problems may incorporate neural networks with hundreds of millions of trained parameters, with modest overhead, using a gray-box formulation that exploits efficient automatic differentiation, Hessian approximation, and and GPU acceleration. A disadvantage of the gray-box formulation is that it is not suitable for global optimization as the non-convex neural network constraints cannot be relaxed. Additionally, relative performance of the formulations may change in different applications. This motivates future research and development to improve the performance of the full-space and reduced-space formulations. The full-space formulation may be improved by decomposing the KKT matrix to exploit the structure of the neural network’s Jacobian, while the reduced-space formulation may be improved by exploiting vector-valued functions and common subexpressions in JuMP.

References

[1] Aravena, I., Molzahn, D.K., Zhang, S., Petra, C.G., Curtis, F.E., Tu, S., Wächter, A., Wei, E., Wong, E., Gholami, A., Sun, K., Sun, X.A., Elbert, S.T., Holzer, J.T., Veeramany, A.: Recent developments in security-constrained AC optimal power flow: Overview of challenge 1 in the ARPA-E grid optimization competition. Operations Research 71(6), 1997–2014 (2023). https://doi.org/10.1287/opre.2022.0315
[2] Biegler, L.T.: Nonlinear Programming: Concepts, Algorithms, and Applications to Chemical Processes. Society for Industrial and Applied Mathematics, USA (2010)
[3] Birchfield, A.: Hawaii synthetic grid – 37 buses (2023), https://electricgrids.engr.tamu.edu/hawaii40/, accessed 2024-12-10
[4] Birchfield, A.B., Xu, T., Gegner, K.M., Shetye, K.S., Overbye, T.J.: Grid structural characteristics as validation criteria for synthetic networks. IEEE Transactions on Power Systems 32(4), 3258–3265 (2017). https://doi.org/10.1109/TPWRS.2016.2616385
[5] Bugosen, S.I., Laird, C.D., Parker, R.B.: Process flowsheet optimization with surrogate and implicit formulations of a Gibbs reactor. Systems and Control Transactions 3, 113–120 (2024). https://doi.org/https://doi.org/10.69997/sct.148498
[6] Casas, C.A.E.: Robust NMPC of Large-Scale Systems and Surrogate Embedding Strategies for NMPC. Masters thesis, University of Waterloo, Waterloo, Ontario, Canada (2024)
[7] Ceccon, F., Jalving, J., Haddad, J., Thebelt, A., Tsay, C., Laird, C.D., Misener, R.: OMLT: Optimization & machine learning toolkit. Journal of Machine Learning Research 23(349), 1–8 (2022), http://jmlr.org/papers/v23/22-0277.html
[8] Coffrin, C.: PowerModelsSecurityConstrained.jl (2022), https://github.com/lanl-ansi/PowerModelsSecurityConstrained.jl, accessed 2024-12-10
[9] Coffrin, C., Bent, R., Sundar, K., Ng, Y., Lubin, M.: Powermodels.jl: An open-source framework for exploring power flow formulations. In: 2018 Power Systems Computation Conference (PSCC). pp. 1–8 (June 2018). https://doi.org/10.23919/PSCC.2018.8442948
[10] Duff, I.S., Reid, J.K.: The multifrontal solution of indefinite sparse symmetric linear. ACM Transactions on Mathematical Software (TOMS) 9(3), 302–325 (1983)
[11] Gan, D., Thomas, R., Zimmerman, R.: Stability-constrained optimal power flow. IEEE Transactions on Power Systems 15(2), 535–540 (2000). https://doi.org/10.1109/59.867137
[12] Gurobi Optimization: Gurobi Machine Learning Manual (December 2024), https://gurobi-machinelearning.readthedocs.io/en/stable/
[13] Kilwein, Z., Jalving, J., Eydenberg, M., Blakely, L., Skolfield, K., Laird, C., Boukouvala, F.: Optimization with neural network feasibility surrogates: Formulations and application to security-constrained optimal power flow. Energies 16(16) (2023). https://doi.org/10.3390/en16165913, https://www.mdpi.com/1996-1073/16/16/5913
[14] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017), https://arxiv.org/abs/1412.6980
[15] Lubin, M., Dowson, O., Garcia, J.D., Huchette, J., Legat, B., Vielma, J.P.: JuMP 1.0: Recent improvements to a modeling language for mathematical optimization. Mathematical Programming Computation 15(3), 581–589 (2023). https://doi.org/10.1007/s12532-023-00239-3, https://doi.org/10.1007/s12532-023-00239-3
[16] López-Flores, F.J., Ramírez-Márquez, C., Ponce-Ortega, J.M.: Process systems engineering tools for optimization of trained machine learning models: Comparative and perspective. Industrial & Engineering Chemistry Research 63(32), 13966–13979 (2024). https://doi.org/10.1021/acs.iecr.4c00632
[17] Nocedal, J.: Updating quasi-Newton matrices with limited storage. Mathematics of Computation 35(151), 773–782 (1980). https://doi.org/https://doi.org/10.2307/2006193, http://www.jstor.org/stable/2006193
[18] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)
[19] PowerWorld Corporation, Champaign, IL, USA: PowerWorld Simulator Manual, https://www.powerworld.com/WebHelp/ Accessed December 10, 2024
[20] Schweidtmann, A.M., Mitsos, A.: Deterministic global optimization with artificial neural networks embedded. Journal of Optimization Theory and Applications 180(3), 925–948 (2019)
[21] Thebelt, A., Wiebe, J., Kronqvist, J., Tsay, C., Misener, R.: Maximizing information from chemical engineering data sets: Applications to machine learning. Chemical Engineering Science 252, 117469 (2022). https://doi.org/https://doi.org/10.1016/j.ces.2022.117469
[22] Wächter, A., Biegler, L.T.: On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical programming 106(1), 25–57 (2006)