This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Comparing Inverse Optimization and Machine Learning Methods for Imputing a Convex Objective Function

\nameElaheh H.Iraj and Daria Terekhov Email address: [email protected], [email protected] (corresponding author) Department of Mechanical, Industrial and Aerospace Engineering, Gina Cody School of Engineering and Computer Science, Concordia University, Montréal, QC, Canada
Abstract

Inverse optimization (IO) aims to determine optimization model parameters from observed decisions. However, IO is not part of a data scientist’s toolkit in practice, especially as many general-purpose machine learning packages are widely available as an alternative. When encountering IO, practitioners face the question of when, or even whether, investing in developing IO methods is worthwhile. Our paper provides a starting point toward answering these questions, focusing on the problem of imputing the objective function of a parametric convex optimization problem. We compare the predictive performance of three standard supervised machine learning (ML) algorithms (random forest, support vector regression and Gaussian process regression) to the performance of the IO model of Keshavarz \BOthers. (\APACyear2011). While the IO literature focuses on the development of methods tailored to particular problem classes, our goal is to evaluate general ‘out-of-the-box’ approaches. Our experiments demonstrate that determining whether to use an ML or IO approach requires considering (i) the training set size, (ii) the dependence of the optimization problem on external parameters, (iii) the level of confidence with regards to the correctness of the optimization prior, and (iv) the number of critical regions in the solution space.

keywords:
inverse optimization; machine learning; data efficiency; parametric optimization; multi-parametric programming

1 Introduction

Inverse optimization (IO) imputes missing optimization model parameters from data that represents minimally sub-optimal solutions of that unknown optimization problem (e.g., past decisions of an optimizing agent). In the academic literature, IO has been shown successful in a variety of problems, including medical decision making (Chan \BOthers., \APACyear2014), electricity demand forecasting (Saez-Gallego \BBA Morales, \APACyear2017), the household activity pattern problem (Chow \BBA Recker, \APACyear2012) and economic lot-sizing (Egri \BOthers., \APACyear2014). However, IO is rarely used by practitioners.

Previous work has established an analogy between IO and regression (Chan \BOthers., \APACyear2019), and provided a statistical inference perspective on IO (Aswani \BOthers., \APACyear2018). More generally, one can notice that imputing parameter values from data can be viewed as a machine learning problem (Tan \BOthers., \APACyear2019), a perspective that we adopt and explore in this paper. Viewing an inverse (parametric) optimization problem as a learning problem raises questions that have previously not been explored in the literature and that are of practical interest. First, how well would classic machine learning methods perform on such problems? Second, what characteristics would make a problem challenging for classic machine learning methods and would require the more specialized methods of IO? Our interactions with data science teams of two large companies have led us to believe that answering these questions is essential if a data scientist wants to consider adding IO to their toolkit.

While most of the IO literature develops specialized approaches, our goal is to compare fairly general methods, as we believe these are more likely to be used by practitioners. After establishing our perspective on inverse parametric optimization as a learning problem, we experimentally compare the performance of a classic IO method with three out-of-the-box machine learning (ML) methods, namely Gaussian process regression, random forest and support vector regression. We use the IO method of Keshavarz \BOthers. (\APACyear2011) as it is well suited and easy to apply to inverse parametric convex optimization problems. We show that the chosen ML methods can indeed perform well on a problem that has been well-studied in the IO literature. To identify the characteristics that would make an IO problem challenging for ML (highlighting the need for sophisticated IO models), we conduct experiments on random parametric optimization problems (POPs) generated using the parametric optimization toolbox of Oberdieck \BOthers. (\APACyear2016). To the best of our knowledge, no previous papers have compared IO and ML methods on this class of problems. Our experiments with randomly-generated POPs demonstrate that the choice of an ML or IO approach should depend on (i) the size of the training set, (ii) the nature of the dependence of the optimization problem on external parameters, (iii) the level of confidence with regards to the correctness optimization prior, (iv) the number of critical regions in the solution space of the POP.

2 Background and Related Work

To define an IO problem, we first need to postulate a forward optimization problem, i.e., the problem whose solutions, perhaps corrupted by noise, we observe. The initial focus of the IO literature was to impute coefficients of forward optimization problems that do not have any dependence on an external parameter (non-parametric). Burton \BBA Toint (\APACyear1992, \APACyear1994) study the inverse shortest path problem in which the goal is to find a minimal perturbation of the arc costs to make an observed set of paths optimal. Given an observed value for decision variables 𝐱\mathbf{x}, Ahuja \BBA Orlin (\APACyear2001) find a 𝐜\mathbf{c} vector of minimal distance to a known vector 𝐜^\hat{\mathbf{c}} that would make 𝐱\mathbf{x} optimal. Chan \BOthers. (\APACyear2014) and Chan \BOthers. (\APACyear2019) study the inverse linear programming problem when the observed data is noisy and no prior 𝐜^\hat{\mathbf{c}} is given. Tavaslıoğlu \BOthers. (\APACyear2018) characterize the inverse feasible region: the set of objectives that would make a given set of feasible solutions to a linear program optimal. Shahmoradi \BBA Lee (\APACyear2020) introduce a way to measure how sensitive the set of optimal cost vectors is to changes in a given data set.

A parametric optimization problem (also known as a multi-parametric programming problem) is a family of optimization problems parametrized by an independent parameter; the forward parametric problem involves finding a function that would map values of the independent parameter to optimal solutions (Pistikopoulos \BOthers., \APACyear2011). As an example, consider the following parameteric linear program with independent parameters (u1,u2)(u_{1},u_{2}): min(c1+c2u1)x1+(c3+c4u2)x2 s.t. a1x1+a2x2b1\min(c_{1}+c_{2}u_{1})x_{1}+(c_{3}+c_{4}u_{2})x_{2}\textrm{ s.t. }a_{1}x_{1}+a_{2}x_{2}\geq b_{1}, where instantiating u1u_{1} and u2u_{2} leads to a standard (non-parametric) optimization problem. There is substantial interest in the inverse parametric optimization problem (Keshavarz \BOthers., \APACyear2011; Chow \BBA Recker, \APACyear2012; Aswani \BOthers., \APACyear2018; Esfahani \BOthers., \APACyear2018; Saez-Gallego \BBA Morales, \APACyear2017; Kovács, \APACyear2019; Tan \BOthers., \APACyear2019, \APACyear2020). In such a problem, pairs (𝐮k,𝐱k)(\mathbf{u}_{k},\mathbf{x}_{k}) where 𝐮k\mathbf{u}_{k} is an independent parameter for observation kk and 𝐱k\mathbf{x}_{k} is an observed value for the decision variables 𝐱\mathbf{x} are given; the goal is to impute objective function and/or constraint coefficients that would make the observed points optimal or near-optimal solutions of the given parametric forward optimization problem.

Classic IO methods (Burton \BBA Toint, \APACyear1992; Ahuja \BBA Orlin, \APACyear2001; Keshavarz \BOthers., \APACyear2011) rely on some form of optimality conditions, such as the Karush-Kuhn-Tucker (KKT) conditions. More recently, methods based on a combination of ML and IO have emerged (Tan \BOthers., \APACyear2019, \APACyear2020; Babier \BOthers., \APACyear2021). Aswani \BOthers. (\APACyear2019) and Fernández-Blanco \BOthers. (\APACyear2019) include comparison of an IO approach to several ML algorithms in the context of a particular application. Related work in the ML literature includes, most notably, structured prediction (Taskar \BOthers., \APACyear2005). Despite this related ML-based work, a comparison of classic methods on general parametric IO problems has not previously been done and the questions posed in the introduction have not been answered.

3 Problem Definition

We study IO for parametric optimization problems (POPs), which are ‘compatible’ with an ML perspective unlike the non-parametric optimization ones. POPs are mathematical programs where the objective function and constraints are functions of one or multiple external parameters (Pistikopoulos \BOthers., \APACyear2011; Oberdieck \BOthers., \APACyear2016). The forward parametric optimization problem (FOP) of interest to us is:

𝐅𝐎𝐏(u,c):minimizex\displaystyle\mathbf{FOP}(\textbf{u},\textbf{c}):\underset{\textbf{x}}{\text{minimize}} f(u,x,c)\displaystyle\quad f(\textbf{u},\textbf{x},\textbf{c}) (1)
s.t. g(u,x)0,\displaystyle\quad g(\textbf{u},\textbf{x})\leq\textbf{0},

where 𝐱n\mathbf{x}\in\mathbb{R}^{n} are the decision variables, 𝐮v\mathbf{u}\in\mathbb{R}^{v} are the independent parameters (‘features’ in ML parlance), and 𝐜n\mathbf{c}\in\mathbb{R}^{n} are the coefficients of the objective function ff. We assume ff and gg are differentiable and convex in x for each value of u. We let J\pazocal{J} be the index set of variables with |J|=n|\pazocal{J}|=n, and I\pazocal{I} be the index set of constraints with |I|=m|\pazocal{I}|=m; we let 𝟎\mathbf{0} denote the column vector of zeros of the required dimension.

In an inverse parametric optimization problem, observations of pairs D={(x^1,𝐮^1),,(x^K,𝐮^K)}\pazocal{D}=\{(\hat{\textbf{x}}_{1},\hat{\mathbf{u}}_{1}),\dots,(\hat{\textbf{x}}_{K},\hat{\mathbf{u}}_{K})\} are given; the goal is to find a 𝐜\mathbf{c} that would make x^k\hat{\textbf{x}}_{k} minimally sub-optimal for the corresponding 𝐅𝐎𝐏(𝐮^k,c)\mathbf{FOP}(\hat{\mathbf{u}}_{k},\textbf{c}), for each kK={1,,K}k\in\pazocal{K}=\{1,\dots,K\}:

minimizec{kKL(𝐱^k,𝐱kpred)|xkpredargminxXu^kf(u^k,x,c)}\underset{\textbf{c}}{\text{{minimize}}}\quad\Big{\{}\sum_{k\in\pazocal{K}}\pazocal{L}(\hat{\mathbf{x}}_{k},\mathbf{x}_{k}^{pred})\enskip|\enskip\textbf{x}_{k}^{pred}\in\underset{\textbf{x}\in\pazocal{X}_{\hat{\textbf{u}}_{k}}}{\operatorname*{arg\,min}}\enskip f(\hat{\textbf{u}}_{k},\textbf{x},\textbf{c})\Big{\}} (2)

where L\pazocal{L} is any given loss function, 𝐱kpred\mathbf{x}^{pred}_{k} is an optimal solution to the fitted forward optimization model instantiated at 𝐮=𝐮^k\mathbf{u}=\hat{\mathbf{u}}_{k}, and X𝐮^k={𝐱|g(𝐮^k,𝐱)𝟎}\pazocal{X}_{\hat{{\mathbf{u}}}_{k}}=\{\mathbf{x}\;|\;g(\hat{\mathbf{u}}_{k},\mathbf{x})\leq\mathbf{0}\} is the feasible region corresponding to 𝐮^k\hat{\mathbf{u}}_{k}. There are various ways to reformulate (2); the particular model we use here is the model proposed by Keshavarz \BOthers. (\APACyear2011), which formulates the optimality criterion via KKT optimality conditions (see appendix).

The goal of IO could be explanatory, i.e., we aim to characterize the process that generated the data, or predictive, i.e., we want to fit a model that will predict the actions of an optimizing agent/process. That is, having imputed 𝐜\mathbf{c} and given unew\textbf{u}^{new} we could use 𝐅𝐎𝐏(unew,c)\mathbf{FOP}(\textbf{u}^{new},\textbf{c}) to predict xnew\textbf{x}^{new}. Fitting a model to data can also result in better prescriptive performance, i.e., if instead of predicting what an optimizing agent/process would do, we want to prescribe an optimal decision under new circumstances unew\textbf{u}^{new} (this latter perspective being more consistent with classical optimization).

4 Inverse Optimization Problem as a Learning Problem

Learning is the capability to acquire knowledge by extracting patterns from raw data (Goodfellow \BOthers., \APACyear2016); in this sense, IO fits within the broader conceptual framework of learning, i.e., we are acquiring knowledge about an optimization process that generated the data we observe. Applied to the IO problem described in Section 3, supervised ML would aim to find a vector-valued function 𝐡(u,𝜷)\mathbf{h}(\textbf{u},\bm{\beta}) from u (input variable or feature) to x (target variable) such that the loss function L\pazocal{L} is minimized (Russell \BBA Norvig, \APACyear2016). Once 𝐡(u,𝜷)\mathbf{h}(\textbf{u},\bm{\beta}) is imputed, it can be used for prediction given new feature observations unew\textbf{u}^{new}. Figure 1 illustrates this observation: in the left panel, we shown that IO takes pairs of (𝐮^,𝐱^)(\hat{\mathbf{u}},\hat{\mathbf{x}}) (i.e., data) as input, imputes the value of 𝐜\mathbf{c} (i.e., knowledge) and then, given 𝐮new\mathbf{u}^{new} and an imputed 𝐜\mathbf{c}, makes a prediction 𝐱new\mathbf{x}^{new}; in the right panel, we show that the same happens in a typical supervised ML framework, with the difference being the actual models and methods used to impute the unknown parameters and, consequently, the learned model that is used for prediction.

Refer to caption
Figure 1: IO and ML perspectives applied to an IO problem.

The fundamental assumption in IO is that the observations are near-optimal or optimal solutions to an optimization problem, whereas general ML methods do not require such an assumption. Based on the description above, an IO problem can be seen as part of the same conceptual framework as machine learning – the main difference being that in IO data is assumed to come from an optimization process. Viewed another way, we can say that IO assumes a strong prior that the input data is coming from an optimization process, and IO methods leverage this prior knowledge.

Refer to caption
Figure 2: A data set and the fitted optimization model (points A to D are the training set, point E is the test set).

Consider Figure 2 (a): points A to D are given and our goal is to predict the next observation. Point EE is an unseen test point for which we want to evaluate the quality of prediction. Given no additional information, a natural idea would be to fit a curve using simple linear or polynomial regression. The training error for these models may be low but the test error would be high, which would normally signal over-fitting to the training data. However, here the problem would not be over-fitting — rather, the issue is that the data is not coming from a linear or a polynomial relationship between x1x_{1} and x2x_{2} and also not from a linear or polynomial relationship between a feature uu and each of the xix_{i}. In reality, AA, BB, CC and DD are optimal solutions to the parametric optimization problem P(u):max(6+5u)x1(3+10u)x2P(u):\max\;(-6+5u)x_{1}-(3+10u)x_{2} subject to the convex hull of (A,B,C,D,E)(A,B,C,D,E), shown in Figure 2 (d), with P(0.96u0.16)=AP(-0.96\leq u\leq 0.16)=A, P(0.16u1.25)=BP(-0.16\leq u\leq 1.25)=B, P(1.25u17.7)=CP(1.25\leq u\leq 17.7)=C, P(17.7u20)=DP(17.7\leq u\leq 20)=D and P(20u0.96)=EP(-20\leq u\leq-0.96)=E. This example was generated using the Wolfram Parametric Linear Programming app (Bunduchi \BBA Mandric, \APACyear2011).

Just like we aim to fit a linear regression model when we conjecture the relationship is linear, or a polynomial one when we conjecture it is polynomial, we can fit an optimization model when we think the data is coming from an optimization problem. Prior knowledge of the process that generated the data can have a substantial impact on the ability to fit a good model; the above example illustrates specifically the effect of incorporating an optimization model prior. Next, we compare the predictive performance of IO methods, which assume an optimization prior, and ML methods, which do not, when data is generated from parametric convex optimization problems.

5 Experiments

5.1 Preliminaries

Our goal is to evaluate ‘out-of-the-box’ approaches that can be easily implemented in practice: a KKT-based IO model (Keshavarz \BOthers., \APACyear2011), random forest (Breiman, \APACyear2001), support vector regression (Drucker \BOthers., \APACyear1997) and Gaussian process regression (Rasmussen, \APACyear2004). We use the scikit-learn (Pedregosa \BOthers., \APACyear2011) implementation of the ML methods. Keshavarz \BOthers. (\APACyear2011)’s KKT-based IO model is solved using Concert Technology of IBM ILOG CPLEX v.12.7.0. All experiments are run on an HP server with 20 Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz processors and 512 GB RAM under Linux environment.

To show the value of prior knowledge for IO, we consider two types of IO models. In IO-perfect, we have perfect information about the form of the unknown objective function (but do not know its parameters). In IO-imperfect, the objective function is misspecified, and so differs from the true one due to modeling error.

Algorithm 1 Data Generation
\csnameALG@b@\ALG@L@\ALG@thisentity@\csnameALG@currentblock@0for k1k\leftarrow 1 to KK do
\csnameALG@b@\ALG@L@\ALG@thisentity@\csnameALG@currentblock@-1generate uk from uniform dist’n\textbf{u}_{k}\textrm{ from }\textrm{uniform dist'n}
\csnameALG@b@\ALG@L@\ALG@thisentity@\csnameALG@currentblock@-2solve 𝐅𝐎𝐏true(𝐮k,𝐜true)\mathbf{FOP}_{true}(\mathbf{u}_{k},\mathbf{c}^{true}) and get xktrue{\textbf{x}}_{k}^{true}
\csnameALG@b@\ALG@L@\ALG@thisentity@\csnameALG@currentblock@-3kk+1k\leftarrow k+1
\csnameALG@b@\ALG@L@\ALG@thisentity@\csnameALG@currentblock@-4
\csnameALG@b@\ALG@L@\ALG@thisentity@\csnameALG@currentblock@-5D{(x^1,u^1),(x^2,u^2),,(x^K,u^K)}\pazocal{D}\leftarrow\{(\hat{\textbf{x}}_{1},\hat{\textbf{u}}_{1}),(\hat{\textbf{x}}_{2},\hat{\textbf{u}}_{2}),...,(\hat{\textbf{x}}_{K},\hat{\textbf{u}}_{K})\}

Pairs (u^k,x^k)(\hat{\textbf{u}}_{k},\hat{\textbf{x}}_{k}) are generated using Algorithm 1 and divided into a training set and a test set. We use leave-one-out cross-validation on the training set for the ML methods: we pick one point as our validation set, fit a model to the remaining data and evaluate the error on the single held-out point. A validation error estimate is obtained by repeating this procedure for each of the training points and averaging the results. Doing so with several hyperparameter settings enables us to choose the hyperparameter setting with the lowest average cross-validation error. All of the ML experimental results presented below show the error on the test set for the best hyper-parameter setting found during the leave-one-out cross-validation procedure.

We measure the deviation between the predicted x and xtrue\textbf{x}^{true} in the test set (of size MM) using mean relative error (MRE):

MRE=1Mm=1Mxxtrue2xtrue2.MRE=\frac{1}{M}\sum_{m=1}^{M}\frac{\|\textbf{x}-\textbf{x}^{true}\|_{2}}{\|\textbf{x}^{true}\|_{2}}. (3)

5.2 Experiment 1: Utility Function Estimation

Consider imputing a customer’s unknown utility function given observations of their past purchases (Keshavarz \BOthers., \APACyear2011; Bärmann \BOthers., \APACyear2017). We assume that customers solve an internal optimization problem in the course of purchasing goods to maximize their utility function:

UFOP(p):minimizex\displaystyle UFOP(\textbf{{p}}):\underset{\textbf{x}}{\text{minimize}} pTxU(x)\displaystyle\quad\textbf{p}^{T}\textbf{x}-U(\textbf{x}) (4)
subject to 𝐱𝟎,\displaystyle\quad\mathbf{x}\geq\mathbf{0},

where xn\textbf{x}\in\mathbb{R}^{n} and pn\textbf{p}\in\mathbb{R}^{n} are the vectors of consumer demand and the associated price, respectively. Similarly to Keshavarz \BOthers. (\APACyear2011), we assume that the function U(x)U(\textbf{x}) is a concave utility function which is non-decreasing over [0,xmax][0,x^{\max}], xmaxx^{\max} being the maximum demand level found from past observations. Since U(𝐱)U(\mathbf{x}) is concave, the objective function of (4) is convex.

We generate piU[5,20]p_{i}\sim U[5,20]. The true utility function is Utrue(x)=i=1nxiU_{true}(\textbf{x})=\sum_{i=1}^{n}\sqrt{x_{i}}. We use Uperfect(𝐱)=cixiU_{perfect}(\mathbf{x})=c_{i}\sqrt{x_{i}} in implementing IO-perfect (i.e., the model is correctly specified so that all observations can be made optimal solutions for their corresponding pip_{i}). For IO-imperfect, we use Uimperfect(x)=i=1nqxi2+2rxiU_{imperfect}(\textbf{x})=\sum_{i=1}^{n}q{x}_{i}^{2}+2r{x_{i}}.

Based on the results of our hyper-parameter search during leave-one-out cross-validation, we use the scikit-learn implementation of random forest with n_estimators=50, max_depth=None, max_feature=None, support vector regression with c=0.1, gamma="auto", kernel=RBF and Gaussian process with kernel=RBF, alpha=1e-10, normalized_y=False, optimizer="fmin_1_bfgs_b" and
n_restarts_optimizer=10.

Figure 3 shows how the relative test error of IO-perfect, IO-imperfect, RF, SVR and GP changes when we increase the training size. Whereas RF, SVR and GP performance improves, the performance of IO-imperfect and IO-perfect is not affected. GP achieves performance similar to that of IO-imperfect with only 20 observations, while RF and SVR get close to IO-imperfect at 60 and 100 points, respectively. The ML methods perform well because the underlying optimization process (resulting from model (4)) is constrained by only non-negativity constraints and is parametric only in the objective function – the relationship between the input and output of this process can be captured (learned) without knowledge of the underlying optimization model. In the next experiment, we aim to identify IO problems which would be more difficult for these ML methods and would require knowledge of the underlying optimization structure to make good predictions.

Refer to caption
Figure 3: Comparing IO-imperfect, IO-perfect, RF, SVR, and GP for utility function estimation; each point is an average over 15 randomly-generated problems, piU[5,20]p_{i}\sim U[5,20].

5.3 Experiment 2: Randomly-Generated Parametric Optimization Problems

Using the POP toolbox (Oberdieck \BOthers., \APACyear2016), we generate POPs of the following form:

TPOP(u):min𝐱\displaystyle TPOP(\textbf{u}):\underset{\mathbf{x}}{\text{min}} (𝐐𝐱+𝐇𝐮+𝐜)T𝐱\displaystyle\quad(\mathbf{Q}\mathbf{x}+\mathbf{H}\mathbf{u}+\mathbf{c})^{T}\mathbf{x} (5)
s.t. 𝐀𝐱𝐛+𝐅𝐮\displaystyle\quad\mathbf{A}\mathbf{x}\leq\mathbf{b}+\mathbf{F}\mathbf{u}
𝐱n\displaystyle\quad\mathbf{x}\in\mathbb{R}^{n}
𝐮𝕌:={𝐮q|𝐂𝐑A𝐮𝐂𝐑b},\displaystyle\quad\mathbf{u}\in\mathbb{U}:=\{\mathbf{u}\in\mathbb{R}^{q}\hskip 2.84544pt|\hskip 2.84544pt\mathbf{CR}_{A}\mathbf{u}\leq\mathbf{CR}_{b}\},

where 𝐐n×n0\mathbf{Q}\in\mathbb{R}^{n\times n}\succ 0, 𝐇n×q\mathbf{H}\in\mathbb{R}^{n\times q} and 𝐜n×1\mathbf{c}\in\mathbb{R}^{n\times 1}. In the inequality constraints, 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n}, 𝐛m×1\mathbf{b}\in\mathbb{R}^{m\times 1} and 𝐅m×q\mathbf{F}\in\mathbb{R}^{m\times q}. The last constraint of (5) restricts the parameter space, i.e., the admissible values of external parameter (feature) 𝐮\mathbf{u}, where 𝐂𝐑A2q×q\mathbf{CR}_{A}\in\mathbb{R}^{2q\times q} and 𝐂𝐑b2q×1\mathbf{CR}_{b}\in\mathbb{R}^{2q\times 1} allow for representations of constraints on 𝐮\mathbf{u} (in particular, lower and upper bounds – hence the number of rows is 2q2q), and qq denotes the dimensionality of 𝐮\mathbf{u}. We assume q=2q=2 in our experiments.

To study (forward) POPs, it is standard to view the space of feasible parameters 𝐮\mathbf{u} as being partitioned into polyhedrons called critical regions, each of which is uniquely defined by a set of optimal active constraints (Ahmadi-Moshkenani \BOthers., \APACyear2018; Oberdieck \BOthers., \APACyear2016). It is known that if 𝐐\mathbf{Q} is positive definite and 𝕌\mathbb{U} is convex, the optimal objective function z(𝐮)z(\mathbf{u}) is continuous and piecewise quadratic, while if the problem is reduced to a multi-parametric linear program, then the optimal objective function z(𝐮)z(\mathbf{u}) is continuous, convex, and piecewise affine (see Oberdieck \BOthers. (\APACyear2016) and the references therein). Figure 4 shows an example of one of the instances we generate using the POP toolbox of Oberdieck \BOthers. (\APACyear2016) with 21 critical regions and a two-dimensional 𝐮\mathbf{u}; the critical regions are shown in the left panel while the corresponding optimal value function z(𝐮)z(\mathbf{u}) is shown on the right.

To be able to control the number of critical regions in our experiments, we manually pre-process the instances generated from the toolbox to define problems over smaller regions of the parameter space with a particular number of critical regions. For instance, from the example in Figure 4, we can generate a one-region, a three-region and a five-region problems by selecting the regions indicated by the black rectangle borders. Throughout this process, care is taken to select intervals where each critical region has close to a “fair share” of the region; in preliminary experiments, it was noted that the difficulty of learning over a parameter space interval that covers, say, five regions but is dominated by one particular region is similar to the case of one critical region (more discussion on the effect of critical regions on learning performance appears later in this section).

Refer to caption
Figure 4: A POP instance defined over two parameters u1u_{1} and u2u_{2} generated using the POP toolbox of Oberdieck \BOthers. (\APACyear2016). The left panel shows 21 critical regions of the parameter space, while the right panel shows the piecewise quadratic optimal objective function z(𝐮)z(\mathbf{u}). Each critical region on the left corresponds to a quadratic ‘piece’ of z(𝐮)z(\mathbf{u}) on the right.

Varying the Training Set Size (Data Efficiency)

We compare the predictive performance of IO-perfect, IO-imperfect, SVR, RF and GP varying the training size from 1 to 500 on 20 randomly generated POPs of type (5) with two parameters u1u_{1} and u2u_{2}, with 1 and 3 critical regions. For IO-imperfect we use the mean of selected intervals of 𝐮\mathbf{u} in the (H𝐮)T𝐱(\textbf{H}\mathbf{u})^{T}\mathbf{x} term in the objective function instead of a parametric term.

Figures 5(a) and 5(b) illustrate the predictive (i.e., test set) performance of these methods (averaged over 20 instances) given data sets from one critical region and three critical regions, respectively. As expected, ML performance improves with increasing training set size in both cases. On the contrary, IO methods are not affected: IO-perfect has nearly zero error while IO-imperfect incurs around 8% error, regardless of the training set size. Since IO is able to achieve low prediction error with a much smaller training size, we see that IO methods with correctly specified priors can be more data efficient than ML algorithms for problems where data is generated by an optimization process.

Because the IO approach we chose solves an optimization model, no matter how many points are given, IO-perfect is able to recover the parameters that make the observations optimal while IO-imperfect will maintain the same level of mis-specification. Insensitivity to the training set size is consistent with Aswani \BOthers. (\APACyear2018)’s results, which showed that Keshavarz \BOthers. (\APACyear2011)’s method is in general not risk consistent, i.e., it is not guaranteed to asymptotically provide the best possible predictions. However, we emphasize that despite being risk inconsistent, given the right prior, the Keshavarz \BOthers. (\APACyear2011) approach performs well, and can be the method of choice due to its data efficiency.

Refer to caption
(a) One Critical Region
Refer to caption
(b) Three Critical Regions
Figure 5: Comparing IO-perfect, IO-imperfect, SVR, RF and GP varying training sample size and critical regions.

The test error for ML methods increases for all training set sizes as we move from one critical region (Figure 5(a)) to three critical regions (Figure 5(b)). Comparing Figures 5(a) and 5(b), the curves for GP and RF in 5(a) intersect IO-imperfect after 200 points, but in 5(b) they require more points to be comparable to IO-imperfect. This observation suggests that the number of critical regions in the parameter space may be a factor that makes these problems more difficult for standard ML approaches, motivating the next experimental comparison.

Increasing the Number of Critical Regions

Figure 6 shows the average predictive performance of IO-imperfect and the three ML methods over 20 randomly generated POPs given a training set of 200 observations. For each of the 20 problem instances, we choose one interval of 𝐮\mathbf{u} with one critical region, one with three critical regions and one with five, keeping the chosen range of both u1u_{1} and u2u_{2} equal to two, as in Figure 4. We can see that IO-imperfect is not sensitive to the number of critical regions while for all employed ML methods the test error increases with the number of critical regions. As mentioned above, every critical region is a polyhedron of 𝐮\mathbf{u} values over which the optimal objective function z(𝐮)z(\mathbf{u}) is quadratic (a quadratic ‘piece’ of the overall piecewise quadratic function) while the optimizer function 𝐱(𝐮)\mathbf{x}^{*}(\mathbf{u}) is piecewise-affine (Oberdieck \BOthers., \APACyear2016). It is not surprising that some of the ML methods can perform well with 200 points over one critical region as they only have to learn one quadratic function. When the number of critical regions increases, the task for the chosen ML methods becomes more challenging: they have to learn multiple quadratic functions without any prior knowledge about when the transition from one region to another will occur. We do not rule out the possibility of creating new, custom learning algorithms to overcome this issue but doing so is beyond the scope of this paper.

Refer to caption
Figure 6: Comparing IO-imperfect, SVR, RF and GP as the number of critical regions increases; training set size = 200, test set size = 200.

Varying the Functional Dependence of the Objective Function on u

Next, we investigate how the relationship between the TPOP objective function and the feature u affects the performance of ML and IO. We investigate this question in the context of one randomly generated POP (see Figure 7):

z(u):minimize𝐱\displaystyle z(\textbf{u}):\underset{\mathbf{x}}{\text{minimize}} 1.3040x12+(1+u1)x1+19.4545x22+(u2u1+1)x2\displaystyle\quad 1.3040x_{1}^{2}+(1+u_{1})x_{1}+19.4545x_{2}^{2}+(u_{2}-u_{1}+1)x_{2} (6)
s.t. 0.2294x12.52370.9733u2\displaystyle\quad 0.2294x_{1}\leq 2.5237-0.9733u_{2}
0.1890x24.26790.8658u10.4634u2\displaystyle\quad 0.1890x_{2}\leq 4.2679-0.8658u_{1}-0.4634u_{2}
0.5436x10.5889x22.80880.4757u10.3624u2\displaystyle\quad 0.5436x_{1}-0.5889x_{2}\leq 2.8088-0.4757u_{1}-0.3624u_{2}
0.2210x13.25350.9753u1.\displaystyle\quad 0.2210x_{1}\leq 3.2535-0.9753u_{1}.
Refer to caption
Figure 7: Problem (6) critical regions (left) and parametric objective function (right) generated using the POP Toolbox of Oberdieck \BOthers. (\APACyear2016).

We consider three objective functions for problem (6) defined through functions Ψ1(u)\Psi_{1}(\textbf{u}) and Ψ2(u)\Psi_{2}(\textbf{u}), which set the coefficients of x1x_{1} and x2x_{2}:

f=1.3040x12+Ψ1(u)x1+19.4545x22+Ψ2(u)x2.f=1.3040x_{1}^{2}+\Psi_{1}(\textbf{u})x_{1}+19.4545x_{2}^{2}+\Psi_{2}(\textbf{u})x_{2}. (7)

We start with a linear Ψ(𝐮)\Psi(\mathbf{u}) followed by changing it to a multi-variate rational function and subsequently increasing the degree of the polynomials forming the rational function to create a relationship that could lead to a more challenging learning problem.

As shown in Table 1, the test error of GP, SVR and RF increases as we move from the ‘simple’ (linear) dependence on 𝐮\mathbf{u} to making the coefficient of 𝐱1\mathbf{x}_{1} and 𝐱2\mathbf{x}_{2} multi-variate rational functions of 𝐮\mathbf{u}. Since ML methods aim to find a mapping from u to x, a more complex Ψ(u)\Psi(\textbf{u}) indeed makes learning more difficult. On the other hand, IO is not sensitive to the nature of Ψ(u)\Psi(\textbf{u}) as this function is just the coefficient of x in the IO mathematical model.

Table 1: Comparing the predictive performance of ML and IO varying the form of the true objective function, with u^1U(4,6)\hat{u}_{1}\sim U(4,6) and u^2U(6,4)\hat{u}_{2}\sim U(-6,-4). Recall that KK is training set size. The test set size is set to be equal to the training set size.
True Objective Function Method MRE % (K=100K=100) MRE % (K=200K=200)
1.3040x12+(1+u1)x11.3040x_{1}^{2}+(1+u_{1})x_{1} +19.4545x22+(u2u1+1)x2+19.4545x_{2}^{2}+(u_{2}-u_{1}+1)x_{2} IO-perfect 0\sim 0 0\sim 0
IO-imperfect 3.85 3.79
GP 9.75 6.03
RF 6.75 5.74
SVR 19.03 18.04
1.3040x12+1+u11u1x11.3040x_{1}^{2}+\frac{1+u_{1}}{1-u_{1}}x_{1} +19.4545x22+(u2u1+1)3(u21)2x2+19.4545x_{2}^{2}+\frac{(u_{2}-u_{1}+1)^{3}}{(u_{2}-1)^{2}}x_{2} IO-perfect 0\sim 0 0\sim 0
IO-imperfect 7.80 7.11
GP 10.85 8.59
RF 9.77 7.07
SVR 24.94 23.30
1.3040x12+u1(12u1)4x11.3040x_{1}^{2}+\frac{u_{1}}{(1-2u_{1})^{4}}x_{1} +19.4545x22+(u2u1)3(3u25)5x2+19.4545x_{2}^{2}+\frac{(u_{2}-u_{1})^{3}}{(3u_{2}-5)^{5}}x_{2} IO-perfect 0\sim 0 0\sim 0
IO-imperfect 7.78 7.62
GP 15.28 10.29
RF 22.24 17.75
SVR 38.76 34.80

Investigating the Impact of Correctness of Prior Knowledge

Above, we show that IO-perfect and IO-imperfect are insensitive to problem characteristics which significantly impact the performance of ML (training set size, number of critical regions and the nature of the dependence on 𝐮\mathbf{u}). However, a difference between IO-perfect and IO-imperfect was observable. To investigate the impact of the correctness of objective function prior, we consider five objective function choices for problem (6) which encode different degrees of correctness (or mis-specification) of a prior.

In Table 2, given D\pazocal{D}, we aim to impute c1c_{1} and c2c_{2}. The first objective function is the exact objective function of problem (6), i.e., the IO-perfect method. The subsequent functions deviate from the true function in the coefficients of x1x_{1} and x2x_{2} (but the goal is still to impute c1c_{1} and c2c_{2} from the data generated by the true objective function of problem (6)). Besides parametric objective functions (i.e., objective functions defined over u), we also consider two non-parametric choices. In the first non-parametric objective, we use the mean of 1+u^11+\hat{u}_{1} and u^2u^1+1\hat{u}_{2}-\hat{u}_{1}+1 from the given training set u^1\hat{u}_{1} and u^2\hat{u}_{2}; in the second non-parametric function, we completely eliminate the linear terms.

We see that even slight differences in the function increase IO test error, although for the parametric functions the difference is small. The non-parametric mean-based function also does not lead to a significant increase, but this could change if a larger range of u^\hat{u} is considered. Removing any encoding of the parametric parts of the objective function drastically increases the prediction error from 3.88% to 52.28%. These observations confirm the sensitivity of IO to the correctness of objective function prior.

Table 2: Different parametrizations of the objective function of problem (6) and the corresponding test error, given u^1U(4,6)\hat{u}_{1}\sim U(4,6) and u^2U(6,4)\hat{u}_{2}\sim U(-6,-4).
Objective Function MRE % (K=100K=100)
Parametric c1x12+(1+u1)x1+c2x2+(u2u1+1)x2c_{1}x_{1}^{2}+(1+u_{1})x_{1}+c_{2}x_{2}+(u_{2}-u_{1}+1)x_{2} \sim 0
c1x12+x1+c2x2+(u2u1)x2c_{1}x_{1}^{2}+x_{1}+c_{2}x_{2}+(u_{2}-u_{1})x_{2} 0.38
c1x12+x1+c2x2u1x2c_{1}x_{1}^{2}+x_{1}+c_{2}x_{2}-u_{1}x_{2} 3.56
Non-Parametric
c1x12+6x1+c2x29x2c_{1}x_{1}^{2}+6x_{1}+c_{2}x_{2}-9x_{2} 3.88
c1x12+c2x2c_{1}x_{1}^{2}+c_{2}x_{2} 52.28

5.4 Insights

Our experiments with randomly-generated POPs demonstrate the importance of four characteristics as determinants of the difficulty of objective function learning: (i) size of the training set, (ii) nature of the dependence of the optimization problem on the external parameters, (iii) level of confidence with regards to the correctness of the optimization prior, (iv) number of critical regions in the parameter space of a POP.

Increasing the size of the training set favours ML methods but not necessarily the IO methods. The contrast in performance between the problem in experiment 1 and the randomly-generated POPs suggests that ML methods are more likely to be successful on problems with no or few (parametric) constraints. Increasing the complexity of the relationship between 𝐮\mathbf{u} and the objective function also makes the problem more difficult for ML. On the other hand, the performance of IO is strongly dependent on the correctness of the objective function prior.

Perhaps the most important characteristic determining the difficulty of the objective function learning tasks is the number of critical regions. Over 20 randomly-generated POPs, we found a substantial increase in the relative test error for all three ML methods as the number of critical regions increased. In fact, we conjecture that the underlying reason why adding parametric constraints and/or increasing the complexity of the dependence on 𝐮\mathbf{u} made learning more difficult is because doing so also increased the number of critical regions.

Given these observations, a summary of recommendations of when to use ML or IO for learning the coefficients of a convex POP is needed. We summarize our recommendations in Table 3. In each combination of a criterion and its value (low/high or small/large), we state whether we expect the classic ML methods or an IO method to be a better choice (or at least a better starting point) for solving the learning problem, while holding other criteria fixed. For example, when training set size is small, we expect IO to be a better option due to its data efficiency.

Table 3: Problem Characteristics and Suggested Methods.
Low/Small High/Large
Size of Training Set IO ML
Dependence on External Parameter ML IO
Confidence in Correctness of Prior ML IO
Number of Critical Regions ML IO

The relative performance of the methods considered may not be the same for other types of learning problems where data was generated by an optimization problem (e.g., learning constraints, learning in discrete optimization problems). However, we conjecture that the analysis of the underlying structure of the value function or the optimizer function in terms of the external parameters 𝐮\mathbf{u} (i.e., analysis of the problem in terms the critical regions) will be important in both gaining additional understanding of the challenges of learning from optimization data and developing more sophisticated learning methods for such problems.

6 Conclusion

In this paper, we view inverse optimization as a problem of learning from decisions that are made through an unknown optimization process. We specifically focus on the problem of learning a convex objective function of a parametric optimization problem. We experimentally compare the predictive performance of an inverse optimization method with perfect and imperfect priors with three well-known machine learning algorithms: support vector regression, random forest and Gaussian processes. While we show that some inverse optimization problems can be tackled through classic machine learning approaches, we highlight the need for sophisticated inverse optimization models for problems where at least one of the following characteristics holds: (i) the size of the training set is small, (ii) both the constraints and the objective function of the problem in question are dependent on an external parameter (feature), particularly when that dependence is non-linear, (iii) we have high confidence that our knowledge of the parametric nature of the constraints and objective is correct, and (iv) the number of critical regions of the POP is large with no one region dominating. We believe that these observations provide practitioners with guidance on when to consider employing inverse optimization instead of, or in addition to, classical machine learning.

Appendix: KKT-based Inverse Optimization Model

The Keshavarz \BOthers. (\APACyear2011) IO model for (1) is

𝐈𝐎(D):minimize𝝀,c\displaystyle\mathbf{IO}(\pazocal{D}):\underset{\bm{\lambda},\textbf{c}}{\text{minimize}} kKϕ(rkstat,rkcomp)\displaystyle\quad\sum_{k\in\pazocal{K}}\phi(\textbf{r}_{k}^{stat},\textbf{r}_{k}^{comp})
s.t. rstatkj=f(u,x,c)xj+i=1mλikgi(u,x)xjjJ,kK\displaystyle\quad{r^{stat}}_{k}^{j}=\frac{\partial f(\textbf{u},\textbf{x},\textbf{c})}{\partial{x_{j}}}+\sum_{i=1}^{m}\lambda_{i}^{k}\frac{\partial g_{i}(\textbf{u},\textbf{x})}{\partial{x_{j}}}\hskip 5.69046pt\quad\forall j\in\pazocal{J},k\in\pazocal{K}
rcompki=λkigi(u,x)iI,kK\displaystyle\quad{r^{comp}}^{i}_{k}=-\lambda^{i}_{k}g_{i}(\textbf{u},\textbf{x})\quad\quad\hskip 68.28644pt\forall i\in\pazocal{I},k\in\pazocal{K}
𝝀0,\displaystyle\quad\bm{\lambda}\geq\textbf{0},

where ϕ\phi is some norm, rstat\textbf{r}^{stat} and rcomp\textbf{r}^{comp} represent the complementary slackness and stationarity residuals, respectively, of the KKT conditions; 𝝀\bm{\lambda} is the dual variable. The missing objective function coefficient vector c is a decision variable and pairs of (x^k,u^k)(\hat{\textbf{x}}_{k},\hat{\textbf{u}}_{k}) are the inputs.

Acknowledgement(s)

The authors are supported by the Natural Sciences and Engineering Research Council of Canada.

References

  • Ahmadi-Moshkenani \BOthers. (\APACyear2018) \APACinsertmetastarAhmadi18{APACrefauthors}Ahmadi-Moshkenani, P., Johansen, T\BPBIA.\BCBL \BBA Olaru, S.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleCombinatorial approach toward multiparametric quadratic programming based on characterizing adjacent critical regions Combinatorial approach toward multiparametric quadratic programming based on characterizing adjacent critical regions.\BBCQ \APACjournalVolNumPagesIEEE Transactions on Automatic Control63103221–3231. \PrintBackRefs\CurrentBib
  • Ahuja \BBA Orlin (\APACyear2001) \APACinsertmetastarAhuja01{APACrefauthors}Ahuja, R\BPBIK.\BCBT \BBA Orlin, J\BPBIB.  \APACrefYearMonthDay2001. \BBOQ\APACrefatitleInverse optimization Inverse optimization.\BBCQ \APACjournalVolNumPagesOperations Research495771–783. \PrintBackRefs\CurrentBib
  • Aswani \BOthers. (\APACyear2019) \APACinsertmetastaraswani2019behavioral{APACrefauthors}Aswani, A., Kaminsky, P., Mintz, Y., Flowers, E.\BCBL \BBA Fukuoka, Y.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleBehavioral Modeling in Weight Loss Interventions Behavioral modeling in weight loss interventions.\BBCQ \APACjournalVolNumPagesEuropean Journal of Operational Research27231058–1072. \PrintBackRefs\CurrentBib
  • Aswani \BOthers. (\APACyear2018) \APACinsertmetastaraswani2018inverse{APACrefauthors}Aswani, A., Shen, Z\BHBIJ.\BCBL \BBA Siddiq, A.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleInverse optimization with noisy data Inverse optimization with noisy data.\BBCQ \APACjournalVolNumPagesOperations Research663870–892. \PrintBackRefs\CurrentBib
  • Babier \BOthers. (\APACyear2021) \APACinsertmetastarbabier2019{APACrefauthors}Babier, A., Chan, T\BPBIC\BPBIY., Lee, T., Mahmood, R.\BCBL \BBA Terekhov, D.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleAn Ensemble Learning Framework for Model Fitting and Evaluation in Inverse Linear Optimization An ensemble learning framework for model fitting and evaluation in inverse linear optimization.\BBCQ \APACjournalVolNumPagesINFORMS Journal on OptimizationArticles in Advance1–20. {APACrefURL} https://doi.org/10.1287/ijoo.2019.0045 \PrintBackRefs\CurrentBib
  • Bärmann \BOthers. (\APACyear2017) \APACinsertmetastarBarmann17{APACrefauthors}Bärmann, A., Pokutta, S.\BCBL \BBA Schneider, O.  \APACrefYearMonthDay2017. \BBOQ\APACrefatitleEmulating the Expert: Inverse Optimization through Online Learning Emulating the expert: Inverse optimization through online learning.\BBCQ \BIn \APACrefbtitleInternational Conference on Machine Learning International conference on machine learning (\BPGS 400–410). \PrintBackRefs\CurrentBib
  • Breiman (\APACyear2001) \APACinsertmetastarbreiman2001random{APACrefauthors}Breiman, L.  \APACrefYearMonthDay2001. \BBOQ\APACrefatitleRandom forests Random forests.\BBCQ \APACjournalVolNumPagesMachine Learning4515–32. \PrintBackRefs\CurrentBib
  • Bunduchi \BBA Mandric (\APACyear2011) \APACinsertmetastarMathematica{APACrefauthors}Bunduchi, E.\BCBT \BBA Mandric, I.  \APACrefYearMonthDay2011. \APACrefbtitleParametric Linear Programming. Parametric linear programming. {APACrefURL} http://demonstrations.wolfram.com/ParametricLinearProgramming/ \PrintBackRefs\CurrentBib
  • Burton \BBA Toint (\APACyear1992) \APACinsertmetastarburton1992instance{APACrefauthors}Burton, D.\BCBT \BBA Toint, P\BPBIL.  \APACrefYearMonthDay1992. \BBOQ\APACrefatitleOn an instance of the inverse shortest paths problem On an instance of the inverse shortest paths problem.\BBCQ \APACjournalVolNumPagesMathematical Programming531-345–61. \PrintBackRefs\CurrentBib
  • Burton \BBA Toint (\APACyear1994) \APACinsertmetastarburton_correlated94{APACrefauthors}Burton, D.\BCBT \BBA Toint, P\BPBIL.  \APACrefYearMonthDay1994. \BBOQ\APACrefatitleOn the use of an inverse shortest paths algorithm for recovering linearly correlated costs On the use of an inverse shortest paths algorithm for recovering linearly correlated costs.\BBCQ \APACjournalVolNumPagesMathematical Programming631-31–22. \PrintBackRefs\CurrentBib
  • Chan \BOthers. (\APACyear2014) \APACinsertmetastarchan2014generalized{APACrefauthors}Chan, T\BPBIC., Craig, T., Lee, T.\BCBL \BBA Sharpe, M\BPBIB.  \APACrefYearMonthDay2014. \BBOQ\APACrefatitleGeneralized inverse multiobjective optimization with application to cancer therapy Generalized inverse multiobjective optimization with application to cancer therapy.\BBCQ \APACjournalVolNumPagesOperations Research623680–695. \PrintBackRefs\CurrentBib
  • Chan \BOthers. (\APACyear2019) \APACinsertmetastarGIO{APACrefauthors}Chan, T\BPBIC., Lee, T.\BCBL \BBA Terekhov, D.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleInverse optimization: Closed-form solutions, geometry, and goodness of fit Inverse optimization: Closed-form solutions, geometry, and goodness of fit.\BBCQ \APACjournalVolNumPagesManagement Science6531115–1135. \PrintBackRefs\CurrentBib
  • Chow \BBA Recker (\APACyear2012) \APACinsertmetastarchow2012inverse{APACrefauthors}Chow, J.\BCBT \BBA Recker, W.  \APACrefYearMonthDay2012. \BBOQ\APACrefatitleInverse optimization with endogenous arrival time constraints to calibrate the household activity pattern problem Inverse optimization with endogenous arrival time constraints to calibrate the household activity pattern problem.\BBCQ \APACjournalVolNumPagesTransportation Research Part B: Methodological463463–479. \PrintBackRefs\CurrentBib
  • Drucker \BOthers. (\APACyear1997) \APACinsertmetastarvapnik95{APACrefauthors}Drucker, H., Burges, C., Kaufman, L., Smola, A.\BCBL \BBA Vapnik, V.  \APACrefYearMonthDay1997. \BBOQ\APACrefatitleSupport vector regression machines Support vector regression machines.\BBCQ \BIn \APACrefbtitleAdvances in Neural Information Processing Systems Advances in neural information processing systems (\BPGS 155–161). \PrintBackRefs\CurrentBib
  • Egri \BOthers. (\APACyear2014) \APACinsertmetastaregri2014inverse{APACrefauthors}Egri, P., Kis, T., Kovács, A.\BCBL \BBA Váncza, J.  \APACrefYearMonthDay2014. \BBOQ\APACrefatitleAn inverse economic lot-sizing approach to eliciting supplier cost parameters An inverse economic lot-sizing approach to eliciting supplier cost parameters.\BBCQ \APACjournalVolNumPagesInternational Journal of Production Economics14980–88. \PrintBackRefs\CurrentBib
  • Esfahani \BOthers. (\APACyear2018) \APACinsertmetastaresfahani2018data{APACrefauthors}Esfahani, P., Shafieezadeh-Abadeh, S., Hanasusanto, G.\BCBL \BBA Kuhn, D.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleData-driven inverse optimization with imperfect information Data-driven inverse optimization with imperfect information.\BBCQ \APACjournalVolNumPagesMathematical Programming1671191–234. \PrintBackRefs\CurrentBib
  • Fernández-Blanco \BOthers. (\APACyear2019) \APACinsertmetastarfernandez2019ev{APACrefauthors}Fernández-Blanco, R., Morales, J\BPBIM., Pineda, S.\BCBL \BBA Porras, Á.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleEV-Fleet Power Forecasting via Kernel-Based Inverse Optimization Ev-fleet power forecasting via kernel-based inverse optimization.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1908.00399. \PrintBackRefs\CurrentBib
  • Goodfellow \BOthers. (\APACyear2016) \APACinsertmetastarGoodfellow-et-al-2016{APACrefauthors}Goodfellow, I., Bengio, Y.\BCBL \BBA Courville, A.  \APACrefYear2016. \APACrefbtitleDeep Learning Deep learning. \APACaddressPublisherMIT Press. \APACrefnotehttp://www.deeplearningbook.org \PrintBackRefs\CurrentBib
  • Keshavarz \BOthers. (\APACyear2011) \APACinsertmetastarKeshavarz11{APACrefauthors}Keshavarz, A., Wang, Y.\BCBL \BBA Boyd, S.  \APACrefYearMonthDay2011. \BBOQ\APACrefatitleImputing a convex objective function Imputing a convex objective function.\BBCQ \BIn \APACrefbtitleIntelligent Control (ISIC), 2011 IEEE International Symposium on Intelligent control (isic), 2011 ieee international symposium on (\BPGS 613–619). \PrintBackRefs\CurrentBib
  • Kovács (\APACyear2019) \APACinsertmetastarkovacs2019parameter{APACrefauthors}Kovács, A.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleParameter Elicitation for Consumer Models in Demand Response Management Parameter elicitation for consumer models in demand response management.\BBCQ \BIn \APACrefbtitle2019 1st Global Power, Energy and Communication Conference (GPECOM) 2019 1st global power, energy and communication conference (gpecom) (\BPGS 456–460). \PrintBackRefs\CurrentBib
  • Oberdieck \BOthers. (\APACyear2016) \APACinsertmetastarPOP{APACrefauthors}Oberdieck, R., Diangelakis, N., Papathanasiou, M., Nascu, I.\BCBL \BBA Pistikopoulos, E.  \APACrefYearMonthDay2016. \BBOQ\APACrefatitlePOP – Parametric Optimization Toolbox POP – Parametric Optimization Toolbox.\BBCQ \APACjournalVolNumPagesIndustrial & Engineering Chemistry Research55338979-8991. {APACrefURL} https://doi.org/10.1021/acs.iecr.6b01913 {APACrefDOI} 10.1021/acs.iecr.6b01913 \PrintBackRefs\CurrentBib
  • Pedregosa \BOthers. (\APACyear2011) \APACinsertmetastarscikit{APACrefauthors}Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.\BDBLDuchesnay, E.  \APACrefYearMonthDay2011\APACmonth11. \BBOQ\APACrefatitleScikit-Learn: Machine Learning in Python Scikit-learn: Machine learning in python.\BBCQ \APACjournalVolNumPagesJ. Mach. Learn. Res.12null2825–2830. \PrintBackRefs\CurrentBib
  • Pistikopoulos \BOthers. (\APACyear2011) \APACinsertmetastarpistikopoulos2011multi{APACrefauthors}Pistikopoulos, E\BPBIN., Georgiadis, M\BPBIC.\BCBL \BBA Dua, V.  \APACrefYear2011. \APACrefbtitleMulti-Parametric Programming Multi-parametric programming (\BVOL 1). \PrintBackRefs\CurrentBib
  • Rasmussen (\APACyear2004) \APACinsertmetastarrasmussen2004gaussian{APACrefauthors}Rasmussen, C.  \APACrefYearMonthDay2004. \BBOQ\APACrefatitleGaussian processes in machine learning Gaussian processes in machine learning.\BBCQ \BIn \APACrefbtitleAdvanced Lectures on Machine Learning Advanced lectures on machine learning (\BPGS 63–71). \APACaddressPublisherSpringer. \PrintBackRefs\CurrentBib
  • Russell \BBA Norvig (\APACyear2016) \APACinsertmetastarrussell2016artificial{APACrefauthors}Russell, S\BPBIJ.\BCBT \BBA Norvig, P.  \APACrefYear2016. \APACrefbtitleArtificial intelligence: a modern approach Artificial intelligence: a modern approach. \APACaddressPublisherMalaysia; Pearson Education Limited,. \PrintBackRefs\CurrentBib
  • Saez-Gallego \BBA Morales (\APACyear2017) \APACinsertmetastarGallego{APACrefauthors}Saez-Gallego, J.\BCBT \BBA Morales, J\BPBIM.  \APACrefYearMonthDay2017. \BBOQ\APACrefatitleShort-term forecasting of price-responsive loads using inverse optimization Short-term forecasting of price-responsive loads using inverse optimization.\BBCQ \APACjournalVolNumPagesIEEE Transactions on Smart Grid954805–4814. \PrintBackRefs\CurrentBib
  • Shahmoradi \BBA Lee (\APACyear2020) \APACinsertmetastarshahmoradi2020{APACrefauthors}Shahmoradi, Z.\BCBT \BBA Lee, T.  \APACrefYearMonthDay2020. \APACrefbtitleQuantile Inverse Optimization: Improving Stability in Inverse Linear Programming. Quantile inverse optimization: Improving stability in inverse linear programming. \PrintBackRefs\CurrentBib
  • Tan \BOthers. (\APACyear2019) \APACinsertmetastardeepIO{APACrefauthors}Tan, Y., Delong, A.\BCBL \BBA Terekhov, D.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleDeep Inverse Optimization Deep inverse optimization.\BBCQ \BIn \APACrefbtitleInternational Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research International conference on integration of constraint programming, artificial intelligence, and operations research (\BPGS 540–556). \PrintBackRefs\CurrentBib
  • Tan \BOthers. (\APACyear2020) \APACinsertmetastartan2020{APACrefauthors}Tan, Y., Terekhov, D.\BCBL \BBA Delong, A.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleLearning linear programs from optimal decisions Learning linear programs from optimal decisions.\BBCQ \BIn \APACrefbtitleAdvances in Neural Information Processing Systems Advances in neural information processing systems (\BVOL 33). \PrintBackRefs\CurrentBib
  • Taskar \BOthers. (\APACyear2005) \APACinsertmetastartaskar2005learning{APACrefauthors}Taskar, B., Chatalbashev, V., Koller, D.\BCBL \BBA Guestrin, C.  \APACrefYearMonthDay2005. \BBOQ\APACrefatitleLearning Structured Prediction Models: A Large Margin Approach Learning structured prediction models: A large margin approach.\BBCQ \BIn \APACrefbtitleProceedings of the 22nd International Conference on Machine Learning Proceedings of the 22nd international conference on machine learning (\BPGS 896–903). \PrintBackRefs\CurrentBib
  • Tavaslıoğlu \BOthers. (\APACyear2018) \APACinsertmetastartavasliouglu2018{APACrefauthors}Tavaslıoğlu, O., Lee, T., Valeva, S.\BCBL \BBA Schaefer, A\BPBIJ.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleOn the structure of the inverse-feasible region of a linear program On the structure of the inverse-feasible region of a linear program.\BBCQ \APACjournalVolNumPagesOperations Research Letters461147–152. \PrintBackRefs\CurrentBib