This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Shrinkage Strategies for Structure Selection and Identification of Piecewise Affine Models

Valentina Breschi and Manas Mejari This work has been accepted to the 59-th IEEE Conference on Decision and Control.V. Breschi is with Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milano, Italy. [email protected]M. Mejari is with IDSIA Dalle Molle Institute for Artificial Intelligence, SUPSI-USI, Manno, Switzerland. [email protected]
Abstract

We propose two optimization-based heuristics for structure selection and identification of PieceWise Affine (PWA) models with exogenous inputs. The first method determines the number of affine sub-models assuming known model order of the sub-models, while the second approach estimates the model order for a given number of affine sub-models. Both approaches rely on the use of regularization-based shrinking strategies, that are exploited within a coordinate-descent algorithm. This allows us to estimate the structure of the PWA models along with its model parameters. Starting from an over-parameterized model, the key idea is to alternate between an identification step and structure refinement, based on the sparse estimates of the model parameters. The performance of the presented strategies is assessed over two benchmark examples.

I Introduction

Many real-world systems and phenomena are characterized by different operating conditions, among which they can commute smoothly and/or abruptly. In such scenarios, linear models are usually insufficient to accurately describe the behavior of the underlying system and one must resort to more complex model structures to attain satisfactory approximations. However, many learning frameworks lead to models that are rather hard to analyze and use, especially for control purposes.

A good trade-off between complexity, approximation power and intelligibility can be attained by considering piecewise affine (PWA) maps. Indeed, it is well known that PWA functions have universal approximation properties [13, 7], while featuring a rather simple structure, since they are characterized by a finite collection of affine sub-models defined over a polyhedral partition of the regressor space. Thus, PWA models can be analyzed and exploited for control design with state-of-the-art techniques (see e.g., [5]).

Although appealing because of the aforementioned properties, PWA models are challenging to learn from data. Indeed, to identify the local models and estimate the associated regions of the regressor space one should know the operating condition of the real systems at each time step, which is usually not available a priori. The learning problem thus entails not only the identification of sub-models parameters and the computation of a polyhedral partition of the regressor space, but it also requires one to solve an unsupervised clustering problem, since each sample has to be assigned to one of the local models.

Several heuristics have been proposed over the years to deal with this demanding learning task, among which we mention the bounded-error method proposed in [4], the clustering-based procedures in [2, 8, 9, 15], the mixed-integer programming based approaches in [14, 17], the Bayesian method in [11] and the sparse optimization techniques proposed in [1, 16]. However, most of these methods rely on the prior knowledge on the number of operating conditions of the true system and of its order, which might not be available in practice. In [4], the number of sub-models is chosen by solving a minimum number of feasible subsystems (MIN PFS) problem, which is, in turn, formulated by relying on a bound for the estimation error. This parameter is unlikely to be available in practice, thus becoming a tuning knob for the algorithm. A refinement strategy is also proposed therein, where the number of sub-models is further reduced by merging local models that share the same parameters and discarding the ones that are associated to “few” data-points. By exploiting sparse optimization, the methods presented in [1, 16] both allows for the selection of the number of modes. Specifically, the approach proposed in [16] considers an over-parameterized model and, then, exploits L1L_{1}-regularization to enforce the local parameter estimates to be similar. Instead, in [1], L1L_{1}-regularization is used to iteratively search for the sparsest vector of fitting errors to cluster the data-point, estimate local models and decide on their number. To the best of our knowledge, only in [1], it is provided a possible solution for the selection of the sub-models structure, which relies on the introduction of an additional L1L_{1}-regularization term on the parameters.

As an alternative, in this paper we present two regularization-based strategies that allow the partial selection of the PWA model structure along with parameter estimation. These techniques exploit the flexibility of the general purpose approach for jump model fitting presented in [3], which can easily incorporate regularization-based shrinking strategies. In particular, we propose a heuristic to estimate the number of local models by using LL_{\infty} regularization on the sub-models parameters, while accounting for the cardinality of the associated clusters to refine the optimization-driven choice. In the second approach, the local model structure is estimated via the elastic net. We remark that the proposed strategy is inspired by the one briefly discussed in [1], but it inherits the benefits of elastic net over simple lasso regularization. The overall identification procedure with partial structure selection requires the alternation between a learning stage and a step in which the model is refined. Once the model structure is determined, the final model is re-estimated by neglecting the shrinkage regularization terms. Although the proposed strategies are quite general and can be straightforwardly extended to other classes of PWA models, in this work we focus on piecewise affine autoregressive models with exogenous inputs (PWARX).

The paper is organized as follows. The problem of learning a PWARX model with partially unknown structure is introduced in Section II and the core algorithm used to address this learning task is summarized in Section III. The proposed regularization-based strategies are then discussed in Section IV, and their effectiveness is shown in Section V on a numerical example and an experimental case study. Conclusions and directions for future work are finally discussed in Section VI.

II Problem formulation

Consider the following dynamical data-generating system, whose instantaneous response yt𝒴y_{t}\in\mathcal{Y}\subseteq\mathbb{R} to an input ut𝒰u_{t}\in\mathcal{U}\subseteq\mathbb{R} is given by

yt=fo(xt)+eto,y_{t}=f_{\mathrm{o}}(x_{t})+e_{t}^{\mathrm{o}}, (1a)
where etoe_{t}^{\mathrm{o}}\in\mathbb{R} is a zero-mean additive white noise independent of the regressor xt𝒳nxx_{t}\in\mathcal{X}\subseteq\mathbb{R}^{n_{x}}. Let the regressor be given by the following collection of past input/output samples
xt=[yt1ytnaut1utnb],x_{t}=\begin{bmatrix}y_{t-1}&\cdots&y_{t-n_{a}}&u_{t-1}&\ldots u_{t-n_{b}}\end{bmatrix}^{\prime}, (1b)

with na,nb+n_{a},n_{b}\in\mathbb{R}^{+} denoting the possibly unknown order of the system. The unknown function fo:𝒳f_{\mathrm{o}}:\mathcal{X}\rightarrow\mathbb{R} characterizing the underlying regressor/output relationship is assumed to be nonlinear and possibly discontinuous, but not necessarily piecewise affine.

We aim at finding a PWA approximation ff of the unknown function fof_{\mathrm{o}}, based on a set of inputs U={ut}t=1TU=\{u_{t}\}_{t=1}^{T} and output samples Y={yt}t=1TY=\{y_{t}\}_{t=1}^{T}, generated by (1a)-(1b). We remark that, due to the universal approximation properties of PWA maps [13, 7], PWA function ff can approximate fof_{\mathrm{o}} with an arbitrary accuracy.

The approximating PWA function f:𝒳f:\mathcal{X}\rightarrow\mathbb{R} is defined as

f(xt)={x~tθy,1,if xt𝒳1,x~tθy,K,if xt𝒳K,f(x_{t})=\begin{cases}\tilde{x}_{t}^{\prime}\theta_{y,1},\quad\mbox{if }x_{t}\in\mathcal{X}_{1},\\ \vdots\\ \tilde{x}_{t}^{\prime}\theta_{y,K},\quad\mbox{if }x_{t}\in\mathcal{X}_{K},\end{cases} (2)

with x~t=[xt1]\tilde{x}_{t}=\left[\begin{smallmatrix}x_{t}^{\prime}&1\end{smallmatrix}\right]^{\prime}. This piecewise affine autoregressive model with exogenous inputs (PWARX) is a collection of KK\in\mathbb{N} local affine models, characterized by the parameters Θy=(θy,1,,θy,K)\Theta_{y}=(\theta_{y,1},\ldots,\theta_{y,K}) and defined over a complete polyhedral partition111{𝒳k}k=1K\{\mathcal{X}_{k}\}_{k=1}^{K} is a complete polyhedral partition of 𝒳\mathcal{X} if k=1K𝒳k=𝒳\bigcup_{k=1}^{K}\mathcal{X}_{k}=\mathcal{X} and 𝒳i𝒳j=\overset{\circ}{\mathcal{X}_{i}}\cap\overset{\circ}{\mathcal{X}_{j}}=\emptyset, for iji\neq j, i,j=1,,Ki,j=1,\ldots,K, with 𝒳i\overset{\circ}{\mathcal{X}_{i}} denoting the interior of the ii-th polyhedron 𝒳i\mathcal{X}_{i}. {𝒳k}k=1K\{\mathcal{X}_{k}\}_{k=1}^{K} of the regressor space 𝒳\mathcal{X}. We stress that all the local models share the same structure by definition.

We characterize the partition through a piecewise affine separator function ϕ:nx\phi:\mathbb{R}^{n_{x}}\rightarrow\mathbb{R}, defined as

ϕ(x)=maxk=1,,K{θx,kx~},\phi(x)=\max_{k=1,\ldots,K}\left\{\theta_{x,k}^{\prime}\tilde{x}\right\}, (3)

where θx,knx+1\theta_{x,k}^{\prime}\in\mathbb{R}^{n_{x}+1}. Thus, a polyhedron 𝒳i\mathcal{X}_{i} can be described by the following linear inequalities

𝒳i={xnx:(θx,iθx,j)x~0,j=1,,K,ji}.\mathcal{X}_{i}\!=\!\{x\in\mathbb{R}^{n_{x}}:(\theta_{x,i}-\theta_{x,j})^{\prime}\tilde{x}\geq 0,j=1,\ldots,K,j\neq i\}. (4)

Therefore, learning a PWARX model from data amounts at: (i)(i) choosing the number of affine sub-models KK; (ii)(ii) estimating the parameters Θy\Theta_{y} of the local models, (iii)(iii) finding the parameters Θx=(θx,1,,θx,K)\Theta_{x}=(\theta_{x,1},\ldots,\theta_{x,K}) characterizing the PWA separator in (3), and, eventually, (iv)(iv) selecting the order of the sub-models, namely nan_{a} and nbn_{b}, when it is unknown.

In this work, we split model structure selection into two separate tasks by assuming that either the number of local models KK or the sub-model order is fixed a priori. In the first task, we estimate (Θy,Θx,K)(\Theta_{y},\Theta_{x},K) for fixed na,nbn_{a},n_{b}, while in the second one we fix KK and retrieve (Θy,Θx,na,nb)(\Theta_{y},\Theta_{x},n_{a},n_{b}) from data.

III Learning PWARX model

It is well known that learning PWA approximating map is quite challenging, since this problem is NP-hard [12] even for an a-priori fixed model structure. Indeed, it requires the concurrent solution of local linear regressions, a classification and a clustering problem. To cope with this, we recall the modular optimization problem presented in [3, Section 2.2.1.] for PWA regression and we summarize the steps of the coordinate-descent approach proposed therein to solve the problem at hand.

Let X={xt}t=1TX=\{x_{t}\}_{t=1}^{T} be the sequence of regressors constructed from the available input/output dataset {U,Y}\{U,Y\}. Let st{1,,K}s_{t}\in\{1,\ldots,K\} be the latent variable, that indicates which local model is associated with the tt-th regressor/output pair {x~t,yt}\{\tilde{x}_{t},y_{t}\}, such that the sequence 𝒮={st}t=1T\mathcal{S}=\{s_{t}\}_{t=1}^{T} comprises the label associated to each data point.

Within the optimization-based framework of [3], learning a generic jump model entails the solution of the following optimization problem:

minΘy,Θx,𝒮t=1T\displaystyle\min_{\Theta_{y},\Theta_{x},\mathcal{S}}~{}~{}\sum_{t=1}^{T} [y(yt,xt,θy,st)+ρx(xt,θx,st)]+\displaystyle\left[\ell_{y}(y_{t},x_{t},\theta_{y,s_{t}})+\rho\ell_{x}(x_{t},\theta_{x,s_{t}})\right]+ (5a)
+k=1K[ry(θy,k)+ρrx(θx,k)],\displaystyle\quad\quad\quad+\sum_{k=1}^{K}\left[r_{y}(\theta_{y,k})+\rho r_{x}(\theta_{x,k})\right],
In particular, for the PWARX model class (2)-(3), the loss function y:𝒳×𝒴×nx+1{+}\ell_{y}:\mathcal{X}\times\mathcal{Y}\times\mathbb{R}^{n_{x}+1}\rightarrow\mathbb{R}\cup\{+\infty\} is chosen as the squared local fitting error, i.e.,
y(yt,xt,θy,st)=(ytx~tθy,st)2,\ell_{y}(y_{t},x_{t},\theta_{y,s_{t}})=(y_{t}-\tilde{x}_{t}^{\prime}\theta_{y,s_{t}})^{2}, (5b)
and the loss function x:𝒳×nx+1{+}\ell_{x}:\mathcal{X}\times\mathbb{R}^{n_{x}+1}\rightarrow\mathbb{R}\cup\{+\infty\} is
x(xt,θx,st)=j=1jstKmax{0,(θx,jθx,st)x~t+1}2,\ell_{x}(x_{t},\theta_{x,s_{t}})=\sum_{\begin{subarray}{c}j=1\\ j\neq s_{t}\end{subarray}}^{K}\max{\left\{0,(\theta_{x,j}\!-\!\theta_{x,s_{t}})^{\prime}\tilde{x}_{t}\!+\!1\right\}}^{2}, (5c)
so to account for the violations of the inequalities in (4). The regularizers ry:nx+1{+}r_{y}:\mathbb{R}^{n_{x}+1}\rightarrow\mathbb{R}\cup\{+\infty\} and rx:nx+1{+}r_{x}:\mathbb{R}^{n_{x}+1}\rightarrow\mathbb{R}\cup\{+\infty\} act on the parameters Θy\Theta_{y} and Θx\Theta_{x}, characterizing the local models and the PWA separator respectively. The positive hyper-parameter ρ\rho balances the effect of each term on the overall cost.

From (5a), it is clear that the discrete sequence 𝒮\mathcal{S} couples the regularized linear regressions and the multi-category discrimination problems, that have to be solved to find Θy\Theta_{y} and Θx\Theta_{x}, respectively. Nonetheless, once 𝒮\mathcal{S} is fixed, the parameters Θy\Theta_{y} and Θx\Theta_{x} can be found independently from one another. On the other hand, for a given Θy\Theta_{y} and Θx\Theta_{x}, the discrete sequence 𝒮\mathcal{S} can be computed using dynamic programming (DP) [6] technique in a computationally efficient manner.

Based on the above considerations, we exploit the coordinate-descent approach proposed in [3] by alternatively solving problem (5a) with respect to Θy\Theta_{y} and Θx\Theta_{x}, for a fixed sequence 𝒮\mathcal{S}, and optimizing with respect to 𝒮\mathcal{S} for fixed Θy\Theta_{y} and Θx\Theta_{x}. The three step procedure tailored for the solution of problem (5a) is summarized in Algorithm 1 where, with a slight abuse of notation, we denote

y(Y,X,Θy,𝒮)=t=1Ty(yt,xt,θy,st),\displaystyle\ell_{y}(Y,X,\Theta_{y},\mathcal{S})=\sum_{t=1}^{T}\ell_{y}(y_{t},x_{t},\theta_{y,s_{t}}), (6a)
x(X,Θx,𝒮)=t=1Tx(xt,θx,st),\displaystyle\ell_{x}(X,\Theta_{x},\mathcal{S})=\sum_{t=1}^{T}\ell_{x}(x_{t},\theta_{x,s_{t}}), (6b)
ry(Θy)=k=1Kry(θy,k),rx(Θx)=k=1Krx(θx,k).\displaystyle r_{y}(\Theta_{y})=\sum_{k=1}^{K}r_{y}(\theta_{y,k}),\quad r_{x}(\Theta_{x})=\sum_{k=1}^{K}r_{x}(\theta_{x,k}). (6c)

Given the structure of the local models, the problem ought to be solved at step 0.0..0..0. can be split into KK distinct regularized linear regression problems, one for each local models. We remark that each sub-model is updated by using only the data points that are associated to it, according to the estimated mode sequence 𝒮i1\mathcal{S}^{i-1}. The estimation of the PWA separator at step 0.0..0..0. entails the solution of a multi-category discrimination problem, which can be efficiently computed with the Newton-like method proposed in [8]. The mode sequence is finally updated at step 0.0..0..0. via DP [6], as detailed in [3], by neglecting the regularization terms as they are independent of 𝒮\mathcal{S}. Note that, the tunable parameter ρ+\rho\in\mathbb{R}^{+} is not considered when optimizing with respect to Θy\Theta_{y} and Θx\Theta_{x} (steps 0.0..0..0.-0.0..0..0.), but it is exploited when updating the mode sequence to trade-off between the fitting error and the cost of choosing a certain polyhedral region.

Algorithm 1 [Coordinated descent for learning PWARX [3]]

Input: Data {Y,X}\{Y,X\}; initial mode sequence 𝒮0\mathcal{S}^{0}; ρ+\rho\in\mathbb{R}^{+}.  

  1. 1.

    for i=1,i=1,\ldots do

    1. \triangleright

      Estimate local functions

    2. 0..1.

      ΘyiargminΘyy(Y,X,Θy,𝒮i1)+ry(Θy)\Theta_{y}^{i}\leftarrow\mathop{\rm arg\ min}\nolimits_{\Theta_{y}}\ell_{y}(Y,X,\Theta_{y},\mathcal{S}^{i-1})+r_{y}(\Theta_{y})

    3. \triangleright

      Estimate linear separators

    4. 0..2.

      ΘxiargminΘxx(X,Θx,𝒮i1)+rx(Θx)\Theta_{x}^{i}\!\leftarrow\!\mathop{\rm arg\ min}\nolimits_{{\Theta}_{x}}\ell_{x}(X,\Theta_{x},\mathcal{S}^{i-1})+r_{x}(\Theta_{x})

    5. \triangleright

      Estimate mode sequence

    6. 0..3.

      𝒮iargmin𝒮y(Y,X,Θyi,𝒮)+ρx(X,Θxi,𝒮)\mathcal{S}^{i}\!\leftarrow\!\mathop{\rm arg\ min}\nolimits_{\mathcal{S}}\ell_{y}\left(Y,X,\Theta_{y}^{i},\mathcal{S}\right)+\rho\ell_{x}(X,\Theta_{x}^{i},\mathcal{S})

  2. 2.

    until 𝒮i=𝒮i1\mathcal{S}^{i}=\mathcal{S}^{i-1}

    Output: Local models Θy=Θyi\Theta_{y}^{\star}\!=\!\Theta_{y}^{i}; PWA separator Θx=Θxi\Theta_{x}^{\star}\!=\!\Theta_{x}^{i}; sequence 𝒮=𝒮i\mathcal{S}^{\star}=\mathcal{S}^{i}.

Remark 1 (Inference hints)

Once a PWARX model is retrieved by running Algorithm 1, for each new regressor xtx_{t} the active local model can be computed as

s^t=maxk=1,,K{(θx,k)x~t},\hat{s}_{t}=\max_{k=1,\ldots,K}\left\{(\theta_{x,k}^{\star})^{\prime}\tilde{x}_{t}\right\}, (7a)
and, accordingly, the corresponding output can be obtained as follows:
y^t=x~tθy,s^t.\hat{y}_{t}=\tilde{x}_{t}^{\prime}\theta_{y,\hat{s}_{t}}^{\star}. (7b)

\blacksquare

IV Regularization-based strategies for model structure selection

Suppose that either the number of sub-models characterizing the PWA map in (2) or their local structure is unknown and assume that bounds on the values of these unknowns are given by the maximum tolerable complexity of the estimated model.

To determine the unknown PWARX model structure (i.e., either the number of sub-models KK or their model order nan_{a}, nbn_{b}), the regularization terms in the objective function of (5a) can be shaped so to use well-known regularization-based shrinkage techniques to choose the structure of the estimated PWARX model in (2), directly from data. This allows to avoid grid searches, which usually require an exhaustive exploration of the unknown parameter space.

In this work, starting from an over-parameterized model, we aim at devising iterative structure selection strategies that rely on runs of Algorithm 1 with tailored regularization terms. In particular, we focus on the selection of the group regularization ry(Θy)r_{y}(\Theta_{y}), which acts on the parameters of the local models.

Two different choices of ry(Θy)r_{y}(\Theta_{y}) are made to tackle the problems of estimating the number of sub-models KK and the model order na,nbn_{a},n_{b} respectively. For both problems, we fix the regularizer rx(Θx)r_{x}(\Theta_{x}) acting on the parameters Θx\Theta_{x} of the PWA seperator as follows

rx(Θx)=λk=1Kθx,k22,r_{x}(\Theta_{x})=\lambda\sum_{k=1}^{K}\|\theta_{x,k}\|_{2}^{2}, (8)

with λ+\lambda\in\mathbb{R}^{+} being a hyper-parameter to be tuned. This choice makes the multi-category discrimination optimization at step 0.0..0..0. of Algorithm 1 strictly convex, thus reducing the complexity of the resulting model. In the following, we introduce two different strategies based on the choice of ry(Θy)r_{y}(\Theta_{y}), respectively tailored to handle the selection of KK, for fixed local model structure, and the choice of nan_{a} and nbn_{b}, for a given number of local models. In the latter scenario, we assume that na,nb1n_{a},n_{b}\geq 1, namely that the response of the system at a given instant always depends at least on the previous input/output pair. In both cases, we assume that the dataset is relatively balanced, namely we expect that the local models are associated to a reasonable number of samples for the overall model to be identifiable with sufficient accuracy.

IV-A Number of local model selection via LL_{\infty} regularization

In this subsection, we outline the strategy to estimate number of local models KK for a fixed model orders na,nbn_{a},n_{b}. Let us assume that at most KmaxK_{max} local models in (2) are sufficient to accurately describe fof_{\mathrm{o}} in (1a) and KmaxK_{max} is known. As a first step towards the definition of a strategy to select the number of sub-models, we introduce a regularization term ry(Θy)r_{y}(\Theta_{y}) that shrinks the entire parameter vectors θy,k\theta_{y,k} towards zero, when the corresponding sub-models are redundant. To this end, we choose ry(Θy)r_{y}(\Theta_{y}) to be a convex combination of group Tikhonov regularization and a term penalizing the mixed L1L_{1-\infty}-norm of the parameter vectors, i.e.,

ry(Θy)=μk=1Kθy,k22+νk=1Kθy,k,r_{y}(\Theta_{y})=\mu\sum_{k=1}^{K}\|\theta_{y,k}\|_{2}^{2}+\nu\sum_{k=1}^{K}\|\theta_{y,k}\|_{\infty}, (9)

where μ,ν\mu,\nu\in\mathbb{R} are hyper-parameters to be tuned. The term penalizing the sum of the infinity norms (i.e., the mixed L1L_{1-\infty}-norm), ensures that, at the solution, the vector θy,k\theta_{y,k} is either full or have all elements nearly identical to zero. Thus, only the component of θy,k\theta_{y,k} with the highest absolute value affects the cost. The additional L2L_{2} regularization term is introduced for a better numerical conditioning of the problem when the parameter vector is not shrunk towards zero.

Since the shrinkage strategy might lead to parameter vectors with rather small elements, that are still not identically zero, we label a local model as redundant if the associated parameter vector θy,k\theta_{y,k} satisfies the following inequality:

θy,kδ,\|\theta_{y,k}\|_{\infty}\leq\delta, (10)

where δ+\delta\in\mathbb{R}^{+} is a relatively small hyper-parameter to be tuned.

Accordingly, starting from an over-parameterized model with K=KmaxK=K_{max}, the procedure for the automatic selection of the model structure is summarized in Algorithm 2. At the beginning of each iteration, Algorithm 1 is run with the regularization term in (9). Therefore, the estimation of the local models for a fixed mode sequence 𝒮\mathcal{S} (step 0.0..0..0. of Algorithm 1) entails the solution of the following problem:

minΘyt=1Tk=1K(ytx~tθy,st)2+μk=1Kθy,k22+νk=1Kθy,k.\min_{\Theta_{y}}\sum_{t=1}^{T}\sum_{k=1}^{K}(y_{t}\!-\!\tilde{x}_{t}^{\prime}\theta_{y,s_{t}})^{2}\!+\!\mu\sum_{k=1}^{K}\|\theta_{y,k}\|_{2}^{2}\!+\!\nu\sum_{k=1}^{K}\|\theta_{y,k}\|_{\infty}. (11)

Once the new PWARX model is retrieved (i.e., once algorithm 1 has stopped), redundant sub-models are detected by checking the condition in (10) for the estimated parameters Θy\Theta_{y}^{\star} and the number of local models is updated accordingly (see steps 2.2..2..3. and 2.(2..2..3.)2..2..3..2..2..3..2.). Specifically, at step  2.2..2..3., the set of indexes j\mathcal{R}^{j} corresponding to the redundant sub-models is updated based on the criterion (10) and at 2.(2..2..3.)2..2..3..2..2..3..2., the number of local models KjK^{j} is updated by subtracting the number of redundant sub-models (i.e. the cardinality #j\#\mathcal{R}^{j} of set j\mathcal{R}^{j}) from the number of sub-models estimated Kj1K^{j-1} at the previous iteration. If no redundant model is detected (#j=0\#\mathcal{R}^{j}=0), at steps 2.(2..2..3.)2..2..3..2..2..3..0.-2.(2..2..3.)2..2..3..2..2..3..0. we further check the clusters {𝒞k}k=1K\{\mathcal{C}_{k}\}_{k=1}^{K} induced by 𝒮\mathcal{S}, i.e.,

𝒞k={xtX:st=k},k=1,,K\mathcal{C}_{k}=\{x_{t}\in X:s_{t}^{\star}=k\},\quad k=1,\ldots,K (12)

and we exploit the assumption that the dataset is balanced to remove as many local models as the number of almost empty clusters, namely sets with cardinality #𝒞k\#\mathcal{C}_{k} smaller or equal than 1%1\% of the overall dataset. This procedure is iterated until KK does not vary over two consecutive iterations.

Once Algorithm 2 is terminated a new instance of Algorithm 1 is run with ry(Θy)r_{y}(\Theta_{y}) defined as in (9) with ν=0\nu=0 to identify the final PWARX model with number of local models KK^{\star}.

Algorithm 2 [Selection of the number of sub-models KK]

Input: Dataset {Y,X}\{Y,X\} of length TT; initial mode sequence 𝒮0\mathcal{S}^{0}; ρ,μ,ν,λ,δ+\rho,\mu,\nu,\lambda,\delta\in\mathbb{R}^{+}; maximum number of local models K0=KmaxK^{0}=K_{max}.  

  1. 1.

    for j=1,j=1,\ldots do

    1. 2..1.

      run Algorithm 1 for K=Kj1K\!=\!K^{j\!-\!1} and ry(Θy)r_{y}(\Theta_{y}) as in (9);

    2. 2..2.

      j{k{1,,Kj1}:θy,kδ}\mathcal{R}^{j}\leftarrow\{k\in\{1,\ldots,K^{j-1}\}:\|\theta_{y,k}^{\star}\|_{\infty}\leq\delta\};

    3. 2..3.

      if j=\mathcal{R}^{j}=\emptyset do

      1. 2..2..3..1.

        𝒞j{k{1,,Kj1}:#𝒞k0.01T}\mathcal{C}_{\emptyset}^{j}\!\leftarrow\!\{k\in\{1,\ldots,K^{j\!-\!1}\}:\#\mathcal{C}_{k}\leq 0.01\cdot T\};

      2. 2..2..3..2.

        KjKj1KjK^{j}\leftarrow K^{j-1}-K_{\emptyset}^{j}, with Kj=#𝒞jK_{\emptyset}^{j}=\#\mathcal{C}_{\emptyset}^{j};

    4. 2..4.

      else

      1. 2..2..3..1.

        KjKj1KRjK^{j}\leftarrow K^{j-1}-K_{R}^{j}, with KRj=#jK_{R}^{j}=\#\mathcal{R}^{j};

    5. 2..5.

      end if;

  2. 2.

    until Kj=Kj1K^{j}=K^{j-1}

    Output: Number of sub-models K=KjK^{\star}=K^{j}.

IV-B Model order selection via the Elastic Net

In this section, we present the algorithm to estimate the model orders nan_{a} and nbn_{b} of the local affine sub-models, assuming the number of modes KK and upper bounds on the parameters nan_{a} and nbn_{b} are known.

To this end, the regularization term ry(Θy)r_{y}(\Theta_{y}) needs to be selected in order not to shrink the entire parameter vector to zero (as done in the previous subsection), but only its redundant components. Accordingly, we exploit elastic net regularization by setting ry(Θy)r_{y}(\Theta_{y}) as:

ry(Θy)=μk=1Kθy,k22+νk=1Kθy,k1,r_{y}(\Theta_{y})=\mu\sum_{k=1}^{K}\|\theta_{y,k}\|_{2}^{2}+\nu\sum_{k=1}^{K}\|\theta_{y,k}\|_{1}, (13)

with μ,ν+\mu,\nu\in\mathbb{R}^{+} being hyper-parameters to be tuned to balance the effect of the two regularization terms. This choice allows us to still enforce parameter selection, thanks to the shrinkage property of the group lasso term which shrinks to zero all the parameters associated to unnecessary terms in the regressor, while the presence of the quadratic penalty leads to a more stable and efficient regularization path [18].

As the regularization strategy might not lead to parameters that are exactly identical to zero, we remove all terms in the regressor but the one associated to the affine term, for which

[θy,k]iδ,i=1,,na+nb\|[\theta_{y,k}]_{i}\|\leq\delta,\quad i=1,\ldots,n_{a}+n_{b} (14)

where [θy,k]i[\theta_{y,k}]_{i} denotes the ii-th component of θy,k\theta_{y,k} and δ+\delta\in\mathbb{R}^{+} is a hyper-parameter to be tuned, accounting for the fact that the regularization strategy might not lead to parameters that are exactly identical to zero. With this we imply that the affine term should never be neglected.

Note that, by testing the condition in (14) independently for all local models, we do not consider that they share the same structure and thus we cannot enforce this property on the PWA map in (2). Nonetheless, a possible heuristic to enforce this property on the PWA map (2) is to select nan_{a} and nbn_{b} as follows:

n^a=maxk=1,,K#{i{1,,na}:[θy,k]iδ},\displaystyle\hat{n}_{a}=\max_{k=1,\ldots,K}\#\{i\in\{1,\ldots,n_{a}\}:\|[\theta_{y,k}]_{i}\|\geq\delta\}, (15a)
n^b=maxk=1,,K#{i{1,,nb}:[θy,k]na+iδ},\displaystyle\hat{n}_{b}\!=\!\!\max_{k=1,\ldots,K}\#\{i\in\!\{1,\ldots,n_{b}\}:\|[\theta_{y,k}]_{n_{a}+i}\|\geq\delta\}, (15b)

which evaluates condition in (14) for all local models and fixes the model order of all local models to that of the sub-model with the least shrinkage. We stress that this approach might be conservative, since model structure selection relies on the characteristics of at most 22 sub-models.

Algorithm 3 summarizes the proposed strategy for estimating the model orders of the local sub-models. Similar to the Algorithm 2, this strategy relies on recurrent iterations of Algorithm 1 run with ry(Θy)r_{y}(\Theta_{y}) defined as in (13). We start from an over-parameterized model with parameters na,maxn_{a,max} and nb,maxn_{b,max} and run Algorithm 3 till the model order is invariant within two consecutive iterations. The model order is then updated at steps 2.2..2..5. and 2.2..2..5., according to the conditions given in (15). Note that, since the model order is iteratively refined, at the beginning of every new run the dataset has to be updated along with the model structure, by redefining {xt}t=1T\{x_{t}\}_{t=1}^{T} according to the definition in (1b) (see step 2.2..2..5. of Algorithm 3).

Once the model structure has been retrieved, Algorithm 1 is run once more for nan_{a}^{\star} and nbn_{b}^{\star} to obtain the PWARX model, imposing ν=0\nu=0 in (13).

Algorithm 3 [Selection of the sub-models structure]

Input: Dataset {Y,U}\{Y,U\} of length TT; initial mode sequence 𝒮0\mathcal{S}^{0}; ρ,μ,ν,λ,δ+\rho,\mu,\nu,\lambda,\delta\in\mathbb{R}^{+}; maximum model complexity na0=na,maxn_{a}^{0}=n_{a,max} and nb0=nb,maxn_{b}^{0}=n_{b,max}.  

  1. 1.

    for j=1,j=1,\ldots do

    1. 2..1.

      construct X={xt}t=1TX=\{x_{t}\}_{t=1}^{T} for na=naj1n_{a}\!=\!n_{a}^{j\!-\!1} and nb=nbj1n_{b}\!=\!n_{b}^{j\!-\!1};

    2. 2..2.

      run Algorithm 1 for ry(Θy)r_{y}(\Theta_{y}) in (13);

    3. 2..3.

      najmaxk#{i{1,,naj1}:[θy,k]iδ}n_{a}^{j}\leftarrow\max_{k}\#\{i\in\{1,\ldots,n_{a}^{j-1}\}:\|[\theta_{y,k}^{\star}]_{i}\|\geq\delta\};

    4. 2..4.

      nbjmaxk#{i{1,,nbj1}:[θy,k]i+naj1δ};n_{b}^{j}\!\leftarrow\!\!\max_{k}\#\{i\in\{1,\ldots,n_{b}^{j\!-\!1}\}\!:\|[\theta_{y,k}^{\star}]_{i+n_{a}^{j\!-\!1}}\|\!\geq\!\delta\};

  2. 2.

    until naj=naj1n_{a}^{j}=n_{a}^{j-1} and nbj=nbj1n_{b}^{j}=n_{b}^{j-1}

    Output: Order of the sub-models na=najn_{a}^{\star}=n_{a}^{j} and nb=nbjn_{b}^{\star}=n_{b}^{j}.

Remark 2

For a proper choice of the hyper-parameters μ,ν\mu,\nu, the number of iterations required in Algorithms 2 and 3 might be ideally reduced to 11. Nevertheless, the iterative nature of Algorithms 2 and 3 is expected to result in a good selection of the model structure, even when the hyper-parameters are not tuned to optimality. In this way, the burden due to cross-validation/grid search for tuning the hyper-parameters can be reduced. \blacksquare

V Case studies

In this section, the performance of the proposed model selection strategies is assessed on two benchmark examples, taken from [4]. The first academic example involves the identification of the PWARX system considered in [4], while the second case study involves modeling the dynamics of a placement process in a pick-and-place machine using experimental data [11]. In both case studies, the model structure selection and identification procedures are performed starting from over-parametrized models with Kmax=10K_{max}=10 or na,max=nb,max=10n_{a,max}=n_{b,max}=10, respectively. The tuning parameters μ\mu and ν\nu in the regularization terms (9) and (13) are selected so that μ(0,1)\mu\in(0,1) and

ν=1μ,\nu=1-\mu, (16)

respectively. In this way, we reduce the number of tunable parameters, by considering a convex combination of the L2L_{2} and lasso regularizations.

All computations are carried out an a i7i7 2.82.8 GHz Intel core processor running MATLAB 2019a. The local regression problem in Algorithm 1 are solved by using the CVX package [10], while the PWA separator is estimated via the computationally efficient batch approach proposed in [8], by exploiting the auxiliary parameters used in the example reported therein. To improve the quality of the solution and reduce its dependence on the initial conditions, Algorithm 1 is always executed for N=20N=20 different initial mode sequences 𝒮0\mathcal{S}^{0} and the PWARX model is selected as the one leading to the minimum cost.

V-A Numerical example: identification of a PWARX system

Consider the following PWARX data-generating system with K=3K=3 operating conditions

yt={[0.411.5]x~t+eto,if [4110]x~t<0,[0.510.5]x~t+eto,if [4110]x~t0&[516]x~t0[0.30.51.7]x~t+eto,if [516]x~t>0,y_{t}=\begin{cases}\left[\begin{smallmatrix}-0.4&1&1.5\end{smallmatrix}\right]\tilde{x}_{t}+e_{t}^{\mathrm{o}},\quad\mbox{if }\left[\begin{smallmatrix}4&-1&10\end{smallmatrix}\right]\tilde{x}_{t}<0,\\ \left[\begin{smallmatrix}0.5&-1&-0.5\end{smallmatrix}\right]\tilde{x}_{t}+e_{t}^{\mathrm{o}},\quad\mbox{if }\left[\begin{smallmatrix}4&-1&10\end{smallmatrix}\right]\tilde{x}_{t}\geq 0~{}\&\\ \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad~{}~{}\left[\begin{smallmatrix}5&1&-6\end{smallmatrix}\right]\tilde{x}_{t}\leq 0\\ \left[\begin{smallmatrix}-0.3&0.5&-1.7\end{smallmatrix}\right]\tilde{x}_{t}+e_{t}^{\mathrm{o}},\quad\mbox{if }\left[\begin{smallmatrix}5&1&-6\end{smallmatrix}\right]\tilde{x}_{t}>0,\end{cases} (17)

where

x~t=[yt1ut11].\tilde{x}_{t}=\begin{bmatrix}y_{t-1}&u_{t-1}&1\end{bmatrix}^{\prime}.

The system is excited with a sequence of random i.i.d. inputs uniformly distributed in the interval [4, 4][-4,\ 4] and the corresponding outputs are corrupted by a zero-mean additive noise sequence eoe_{o}, uniformly distributed in the interval [0.8, 0.8][-0.8,\ 0.8]. To asses the performance of the proposed structure selection strategies over different realizations of the training set, we perform a Monte-Carlo simulation with 8080 runs. At each Monte-Carlo iteration, a new dataset of length T=2000T=2000 is generated with different realizations of the inputs and the measurement noise sequences. The effect of the noise eoe^{o} on the measurement output yields an average signal-to-noise ratio (SNR),

SNR=10logt=1T(yteto)2t=1T(eto)2=16dB.\mbox{SNR}=10\log{\frac{\sum_{t=1}^{T}(y_{t}-e_{t}^{\mathrm{o}})^{2}}{\sum_{t=1}^{T}(e_{t}^{\mathrm{o}})^{2}}}=16~{}\mbox{dB}. (18)

The hyper-parameters are set to λ=103\lambda=10^{-3}, ρ=1\rho=1 and μ=0.1\mu=0.1 and ν\nu is selected according to (16).

V-A1 Unknown KK

TABLE I: Sub-models parameters (unknown KK): True vs estimated (mean±\pmstandard deviation) over 7878 Monte Carlo runs.
modes
k=1k=1 k=2k=2 k=3k=3
true mean±\pmstd true mean±\pmstd true mean±\pmstd
[θy,k]1[\theta_{y,k}^{\star}]_{1} -0.40 -0.40±\pm0.02 0.50 0.50±\pm0.01 -0.30 -0.30±\pm0.02
[θy,k]2[\theta_{y,k}^{\star}]_{2} 1.00 1.00±\pm0.01 -1.00 -1.00±\pm0.01 0.50 0.50±\pm0.01
[θy,k]3[\theta_{y,k}^{\star}]_{3} 1.50 1.48±\pm0.07 -0.50 -0.50±\pm0.02 -1.70 -1.70±\pm0.04
TABLE II: Polyhedral partition parameters (unknown KK): true vs estimated (mean±\pmstandard deviation) over 7878 Monte Carlo runs.
first separator second separator
true mean±\pmstd true mean±\pmstd
4.00 4.00±\pm0.06 5.00 5.01±\pm0.08
-1.00 -1.00±\pm0.00 1.00 1.00±\pm0.00
10.00 10.00±\pm0.13 -6.00 -6.01±\pm0.11

Let us assume that the model order of the system is known, namely that na=nb=1n_{a}=n_{b}=1, but that the number of local models KK is unknown. We iterate Algorithm 2 to estimate the appropriate number of modes, with δ\delta in (10) set to 0.010.01. Throughout the Monte Carlo runs only twice the true number of modes K=3K^{\star}=3 is not retrieved, in which cases the number of sub-models is estimated to be either 44 or 55. The average values and standard deviations of the estimated sub-model parameters obtained over the remaining 7878 runs (for which true number of modes K=3K^{\star}=3 has been retrieved) are reported in TABLE I. These results clearly show that proposed approach allows us to accurately reconstruct the local models. In turn, this implies an accurate reconstruction of the polyhedral partition, as proven by the comparison between the true parameters characterizing the partition vs the mean and standard deviation of the ones obtained from the estimated PWA separator in TABLE II. Overall, the structure of the underlying PWARX system system has been reconstructed accurately.

V-A2 Unknown nan_{a} and nbn_{b}

011223344550202040406060nan_{a}^{\star}frequency [%]
0112233440202040406060nbn_{b}^{\star}frequency [%]
Figure 1: Numerical example (fixed number of modes): frequency of the selected orders nan_{a}^{\star} and nbn_{b}^{\star}.
TABLE III: Sub-models parameters (unknown order): True vs estimated (mean±\pmstandard deviation) over 8080 Monte Carlo runs.
modes
k=1k=1 k=2k=2 k=3k=3
true mean±\pmstd true mean±\pmstd true mean±\pmstd
[θy,k]1[\theta_{y,k}^{\star}]_{1} -0.400 -0.406±\pm0.022 0.500 0.503±\pm0.016 -0.300 -0.299±\pm0.014
[θy,k]2[\theta_{y,k}^{\star}]_{2} 0.000 0.001±\pm0.007 0.000 0.001±\pm0.006 0.000 0.001±\pm0.009
[θy,k]2[\theta_{y,k}^{\star}]_{2} 0.000 0.000±\pm0.004 0.000 0.000±\pm0.002 0.000 0.001±\pm0.005
[θy,k]2[\theta_{y,k}^{\star}]_{2} 0.000 0.000±\pm0.000 0.000 0.000±\pm0.000 0.000 0.000±\pm0.003
[θy,k]2[\theta_{y,k}^{\star}]_{2} 1.000 1.000±\pm0.009 -1.000 -1.000±\pm0.007 0.500 0.501±\pm0.006
[θy,k]2[\theta_{y,k}^{\star}]_{2} 0.000 0.002±\pm0.011 0.000 -0.001±\pm0.010 0.000 0.000±\pm0.010
[θy,k]2[\theta_{y,k}^{\star}]_{2} 0.000 -0.001±\pm0.004 0.000 0.000±\pm0.003 0.000 0.000±\pm0.002
[θy,k]3[\theta_{y,k}^{\star}]_{3} 1.500 1.475 ±\pm0.079 -0.500 -0.497±\pm0.018 -1.700 -1.701±\pm0.059
TABLE IV: Polyhedral partition parameters (unknown order): true vs estimated (mean±\pmstandard deviation) over 8080 Monte Carlo runs.
first separator second separator
true mean±\pmstd true mean±\pmstd
4.000 4.001±\pm0.055 5.000 5.020±\pm0.092
0.000 0.000±\pm0.010 0.000 0.000±\pm0.016
0.000 -0.001±\pm0.006 0.000 0.000±\pm0.006
0.000 0.000±\pm0.000 0.000 0.000±\pm0.000
-1.000 -1.000±\pm0.000 1.000 1.000±\pm0.000
0.000 -0.002±\pm0.019 0.000 -0.004±\pm0.028
0.000 0.000±\pm0.004 0.000 -0.001±\pm0.005
10.000 10.000±\pm0.132 -6.00 -6.024±\pm0.131

We now assume that the number of sub-models K=3K=3 is known, while the model orders na,nbn_{a},n_{b} of the local sub-models are unknown. Algorithm 3 is iterated over the 8080 Monte Carlo runs to estimate nan_{a} and nbn_{b} along with the parameters of the PWARX model, with δ=0.01\delta=0.01. As shown in the histogram reported in Figure 1, the model orders of the initial over-parameterized model (with na,max=nb,max=10n_{a,max}=n_{b,max}=10) are shrunk and the final model is close in structure to the true one. From Figure 1, it can be seen that the proposed approach detected the true model order na=1n^{\star}_{a}=1 and nb=1n^{\star}_{b}=1 for about 60%60\% and 40%40\% of the Monte-Carlo runs, respectively. Note that, for about 58%58\% of the runs, a higher order of the input delays nb=2n_{b}=2 is estimated. However, for these cases, as shown in TABLE III, the values of the corresponding model parameters are close to zero, and are negligible with respect to the others. Therefore, they barely influence the performance of the identified model. We stress that the results shown in TABLE III are obtained after a final iteration of Algorithm 1 with ν\nu in (13) set to zero, and with the model orders selected by running Algorithm 3. The average and standard deviation of the parameters characterizing the regressor space partition are reported in TABLE IV, which are also accurately estimated. Therefore, despite the slight error in the reconstruction of the model orders in some of the Monte-Carlo runs, the overall model structure as well as the estimated parameters match closely with that of the true system.

V-B Modeling the dynamics of a pick-and-place process

As a second case study, we test the proposed algorithms to model an electronic component placement process in a pick-and-place machine. The set-up consists of a mounting head carrying an electronic component, that is pushed towards a printed circuit board and released, as soon as the two come in contact [11]. The process is characterized by two main operating modes: one associated to the behavior of the pick-and-place machine when it operates in an unconstrained environment and the other when the mounting head comes in contact with the circuit board.

A set of experimental data collected over an interval of 1515 s with a sampling frequency of 400400 Hz is available to model this process (see [11] for further details). This dataset comprises samples of the voltage applied to the motor driving the mounting head of the machine and the vertical position of the mounting head, which represent the input uu and output yy of the process, respectively. The overall data record is then split into two disjoint sets comprising the T=4800T=4800 samples associated to the first 1212 s of the experiment and the remaining T~=1200\tilde{T}=1200 samples, that are respectively used to train and validate a PWARX model for the process of interest.

The regularization parameters used in training are respectively selected as ρ=2.15104\rho=2.15\cdot 10^{-4}, and μ=105\mu=10^{-5} (see [3]), with ν\nu chosen according to (16), while λ=108\lambda=10^{-8}. In validation, the performance attained by simulating the learned PWARX model in open-loop and it is quantitatively assessed via the best fit rate (BFR) index, i.e.,

BFR=100max{0,1t=1T~(yty^t)2t=1T~(yty¯)2}%\mbox{BFR}=100\cdot\max\left\{0,1-\sqrt{\frac{\sum_{t=1}^{\tilde{T}}(y_{t}-\hat{y}_{t})^{2}}{\sum_{t=1}^{\tilde{T}}(y_{t}-\bar{y})^{2}}}\right\}\% (19)

with y¯\bar{y} and y^t\hat{y}_{t} denoting the average output over the validation set and the reconstructed one, respectively.

Refer to caption
(a) K=3K^{\star}=3 and na=nb=2n_{a}=n_{b}=2
Refer to caption
(b) K=2K=2 and na=nb=2n_{a}^{\star}=n_{b}^{\star}=2
Figure 2: Pick-and-place machine: simulated vs actual output (top) and reconstructed mode sequence for fixed local model structure (left) and fixed number of sub-models (right).

V-B1 Unknown KK

Let nan_{a} and nbn_{b} be fixed according to the order of the PWARX model estimated in [4], namely na=nb=2n_{a}=n_{b}=2. We run Algorithm 2 with δ=103\delta=10^{-3} in (10), which terminates after 66 iterations and it leads to the selection of K=3K^{\star}=3 modes. The proposed structure selection strategy thus introduces an additional mode with respect to the two main operating conditions of the underlying process. Nonetheless, we drastically reduce the complexity of the model with respect to the initial over-parameterized model, characterized by Kmax=10K_{max}=10 modes. The comparison between the output reconstructed in validation with the learned PWARX model with 3 modes and the actual one is reported in Figure 2(a), along with the sequence of active local models. This result corresponds to a BFR of 81.281.2%.

V-B2 Unknown nan_{a} and nbn_{b}

We now assume the number of local models is fixed to K=2K=2 and we run Algorithm 3 to estimate the sub-model orders. We set the threshold for selection in (14) as δ=103\delta=10^{-3}. Algorithm 3 terminates after 55 runs and the resulting PWARX model is of the second order, namely na=nb=2n_{a}^{\star}=n_{b}^{\star}=2. Thus, the complexity of the initial over-parameterized model with na,max=nb,max=10n_{a,max}=n_{b,max}=10 is shrunk, leading to a much simpler model, which is same as the one used in [4, 3]. Figure 2(b) shows the comparison between the simulated and reconstructed output in validation, along with the obtained active mode sequence, which corresponds to a BFR of 79.879.8%. It can be observed that the estimated PWARX model with simpler model structure is still able to accurately describe the behavior of the placement process.

VI Conclusions

The paper has presented a preliminary study on the application of regularization-based strategies to select the model structure of a PWARX model. The proposed approaches combine the selective power of existing shrinkage techniques with heuristics, thus providing a tool for data-driven model structure selection. The results obtained in the considered examples highlight the potential of these strategy.

Future research will be devoted to find less conservative criteria for the selection of the local model structure and to address a complete structure selection problem. The sensitivity of the presented algorithms w.r.t. a set of hyper-parameters will be analyzed and auto-tuning procedures for this parameters will be investigated. Thanks to the generality of the coordinate-descent approach on which our strategies rely, future work will also include their extension to other classes of switching models.

References

  • [1] L. Bako. Identification of switched linear systems via sparse optimization. Automatica, 47(4):668 – 677, 2011.
  • [2] L. Bako, K. Boukharouba, E. Duviella, and S. Lecoeuche. A recursive identification algorithm for switched linear/affine models. Nonlinear Analysis: Hybrid Systems, 5(2):242 – 253, 2011.
  • [3] A. Bemporad, V. Breschi, D. Piga, and S. Boyd. Fitting jump models. Automatica, 96:11 – 21, 2018.
  • [4] A. Bemporad, A. Garulli, S. Paoletti, and A. Vicino. A bounded-error approach to piecewise affine system identification. IEEE Transactions on Automatic Control, 50(10):1567–1580, Oct 2005.
  • [5] A. Bemporad and M. Morari. Control of systems integrating logic, dynamics, and constraints. Automatica, 35(3):407 – 427, 1999.
  • [6] D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
  • [7] L. Breiman. Hinging hyperplanes for regression, classification, and function approximation. IEEE Transactions on Information Theory, 39(3):999–1013, 1993.
  • [8] V. Breschi, D. Piga, and A. Bemporad. Piecewise affine regression via recursive multiple least squares and multicategory discrimination. Automatica, 73:155–162, 2016.
  • [9] G. Ferrari-Trecate, M. Muselli, D. Liberati, and M. Morari. A clustering technique for the identification of piecewise affine systems. Automatica, 39(2):205 – 217, 2003.
  • [10] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, March 2014.
  • [11] A. Juloski, W.P.M.H. Heemels, G. Ferrari-Trecate, R. Vidal, S. Paoletti, and J.H.G. Niessen. Comparison of four procedures for the identification of hybrid systems. In HSCC, volume 3414 of Lecture Notes in Computer Science, pages 354–369. Springer, 2005.
  • [12] F. Lauer. On the complexity of piecewise affine system identification. Automatica, 62:148 – 153, 2015.
  • [13] J. . Lin and R. Unbehauen. Canonical piecewise-linear approximations. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 39(8):697–699, 1992.
  • [14] V. V. Naik, M. Mejari, D. Piga, and A. Bemporad. Regularized moving-horizon piecewise affine regression using mixed-integer quadratic programming. In 2017 25th Mediterranean Conference on Control and Automation (MED), pages 1349–1354, 2017.
  • [15] H. Nakada, K. Takaba, and T. Katayama. Identification of piecewise affine systems based on statistical clustering technique. IFAC Proceedings Volumes, 37(12):179 – 184, 2004.
  • [16] H. Ohlsson and L. Ljung. Identification of switched linear regression models using sum-of-norms regularization. Automatica, 49(4):1045 – 1050, 2013.
  • [17] J. Roll, A. Bemporad, and L. Ljung. Identification of piecewise affine systems via mixed-integer programming. Automatica, 40(1):37 – 50, 2004.
  • [18] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.