This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Proximal Linearized Method for Sparse Equity Portfolio Optimization with Minimum Transaction Cost

Hong Seng Sim Centre for Mathematical Sciences, Universiti Tunku Abdul Rahman, 43000 Kajang, Selangor, MALAYSIA. Wendy Shin Yie Ling Department of Mathematics and Statistics, Faculty of Science, Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, MALAYSIA. Wah June Leong Department of Mathematics and Statistics, Faculty of Science, Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, MALAYSIA. Institute for Mathematical Research, Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, MALAYSIA. Chuei Yee Chen Corresponding author: [email protected] Department of Mathematics and Statistics, Faculty of Science, Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, MALAYSIA.
Abstract

In this paper, we propose a sparse equity portfolio optimization (SEPO) based on the mean-variance portfolio selection model. Aimed at minimizing transaction cost by avoiding small investments, this new model includes 0\ell_{0}-norm regularization of the asset weights to promote sparsity, hence the acronym SEPO-0\ell_{0}. The selection model is also subjected to a minimum expected return. The complexity of the model calls for proximal method, which allows us to handle the objectives terms separately via the corresponding proximal operators. We develop an efficient ADMM-like algorithm to find the optimal portfolio and prove its global convergence. The efficiency of the algorithm is demonstrated using real stock data and the model is promising in portfolio selection in terms of generating higher expected return while maintaining good level of sparsity, and thus minimizing transaction cost.

Keywords: Portfolio optimization, sparse portfolio, minimum transaction cost, mean-variance model, proximal method

1 Introduction

Introduced by Markowitz [20] in 1952, mean-variance optimization (MVO) has been widely used in the selection of optimal investment portfolios. The success of MVO is attributed to the simplicity of its quadratic objective function, which in turn can be solved by quadratic programming (QP) that are widely available. However, MVO has flaws on its own and its implementation in portfolio optimization has been heavily criticized by academics and professionals [22]. One of its flaws, as pointed out by Michaud [21], is its sensitivity towards input parameters, thus maximizes the errors associated with these inputs. This was proven theoretically and computationally by Best and Grauer [3], where a slight change in the assets’ expected return or correlations results in large changes in portfolio weights. Despite that, MVO remains to be one of the most successful framework due to the absence of models that are simple enough to be cast as a QP problem.

Over the past one decade or so, the success of robust optimization techniques has allowed researchers to consider non-quadratic objective function and regularization for portfolio optimization. Consequently, the work by Daubechies et al. [12] showed that the usual quadratic regularizing penalties can be replaced by weighted p\ell_{p}-norm penalties with p[1,2]p\in[1,2]. Two specific cases in portfolio optimization, namely lasso when p=1p=1 and ridge regression when p=2p=2, were considered by Brodie et al. [9] and [14], respectively. While the ridge regression regularization minimizes the sample variance subject to the constraint which leads to diversification, lasso regularization encourages sparse portfolios which in turn leads to the minimization of transaction cost. Such regularizations have been studied notably by Chen et al. [10], De Mol [13] and Fastrich et al. [15].

In reality, financial institutions charge their customers transaction fees for trading over the stock market. The two most common ways to charge their customers are based on a fixed transaction fee and/or a proportion of the investment amount, whichever is higher. In general, a large number of transactions will result in higher transaction cost, likely to be caused by small investments that incur fixed transaction fees. Transaction cost, in this sense, will have an effect on the portfolio optimization and the frequency of time rebalancing the portfolio. On the other hand, diversification is the practice of spreading the investments around so that the exposure to any one type of asset is limited. This practice can help to mitigate the risk and volatility in the portfolio, but potentially upsizing the number of investment components and thus, increasing the number of transactions. Therefore, a more realistic model is needed to strike a balance between diversification and minimizing transaction cost for optimal portfolio selection.

Due to the complexity of the objective function and the regularization that are involved, many existing literature employ the alternating direction method of multipliers (ADMM), which was first introduced by Gabay and Mercier [17] in 1976. It was not until the recent decade that ADMM has received much attention in machine learning problems. The essence of ADMM is that it allows one to handle the objective terms separately when they can only be approximated using proximal operators. Its appealing features in large-scale convex optimization problems include ease of implementation and relatively good performance (see, for instance Boyd et al. [8], Fazel et al. [16] and Perrin and Roncalli [22]). Some of the examples of ADMM-like algorithms in portfolio optimization can be found in Chen et al. [10], Dai and Wen [11] and Lai et al. [18], where they are used to solve p\ell_{p}-regularizing problems when p[1,2]p\in[1,2]. Though 0\ell_{0}-norm is ideal for sparsity problems, the regularization result in a discontinuous and nonconvex problem, of which the computation will turn out to be complicated.

In this paper, we propose a new algorithmic framework to maximize the sparsity within the entire portfolio while promoting diversification, i.e. to minimize the 0\ell_{0}-norm and 2\ell_{2}-norm of the asset weights, respectively, subject to a minimum expected return via MVO. We first transform the constrained problem into an unconstrained one, to find a non-smooth and non-convex objective term. The technique of ADMM allows us to handle these terms separately, but nevertheless converges to its optimal solution. Numerical results using real data are also provided to illustrate the reliability of the proposed model and its efficiency in generating higher expected return while minimizing transaction cost when compared to the standard MVO.

This paper is organized as follows: In Section 2, we present a model for sparse equity portfolio optimization with minimum transaction cost and establish the proximal linearized method for 0\ell_{0}-norm minimization. Subsequently, in Section 3, we present an ADMM algorithm to find the optimal portfolio of the proposed model, together with its convergence analysis. To illustrate the reliability and efficiency of our method, we present the numerical esults using real stock data in Section 4. Finally, the conclusion of the paper is presented in Section 5.

2 Proximal Linearized Method for 0\ell_{0}-norm Minimization

We begin with a universe of nn assets under consideration, with mean return vector μn\mu\in\mathbb{R}^{n} and the covariance matrix Vn×nV\in\mathbb{R}^{n\times n}. Let xnx\in\mathbb{R}^{n} be the vector of asset weights in the portfolio. Our objective is to maximize the portfolio return μTx\mu^{T}x and minimize the variance of portfolio return xTVxx^{T}Vx, while maintaining a certain level of diversification x22||x||^{2}_{2} and minimizing transaction cost x0||x||_{0}. The variance of the portfolio return is the measure of risk inherent in investing in a portfolio, and we shall denote this as variance risk throughout this paper. The portfolio is said to be pure concentrated if there exists ii such that xi=1x_{i}=1 and equally-weighted if xi=1nx_{i}=\frac{1}{n} for all ii. Assume that the capital is fully invested, thus eTx=1e^{T}x=1 where ene\in\mathbb{R}^{n} is an all-one vector. A sparse equity portfolio optimization with minimum transaction cost (SEPO-0\ell_{0}) goes as follows:

minxn\displaystyle\min_{x\in\mathbb{R}^{n}}\text{ } β12xTVxμTx+β22x22+x0\displaystyle\frac{\beta_{1}}{2}x^{T}Vx-\mu^{T}x+\frac{\beta_{2}}{2}||x||^{2}_{2}+||x||_{0} (2.1)
s.t. μTxr,\displaystyle\mu^{T}x\geq r,
eTx=1,\displaystyle e^{T}x=1,
x0,\displaystyle x\geq 0,

where β1>0\beta_{1}>0 is a parameter for leveraging the portfolio variance risk, β2>0\beta_{2}>0 is a parameter for leveraging portfolio diversification and r0r\geq 0 is the minimum guaranteed return ratio with rmax{μi}r\leq\max\{\mu_{i}\}.

In a standard MVO, diversification is of general importance to reduce portfolio risk without necessarily reducing portfolio return. While diversification does not mean that we add more money into our investment, it certainly does reduce our investment value as the investment in each equity incurs transaction cost. Our proposed method takes into consideration of having diversified investments, but at the same time avoid small investments that might incur unnecessary transaction costs due to its diversification. Note that the sparsity measure of the vector xnx\in\mathbb{R}^{n} is given by

x0:=number of nonzero components of xi.\|x\|_{0}:=\textnormal{number of nonzero components of }x_{i}.

Minimizing 0\ell_{0}-norm in (2.1) promotes sparsity within the portfolio, since the values of xix_{i} are forced to be zero except for the large ones, thus minimizing the transaction cost.

Our model (2.1) poses computational difficulties due to the non-convexity and discontinuity of 0\ell_{0}-norm and the inqeuality constraint μTxr\mu^{T}x\geq r. Instead of dealing with the problem in its entirety, we employ the alternating direction method of multipliers (ADMM) such that the smooth and non-smooth terms can be handled separately. This calls for a brief introduction of proximal operators and Moreau envelope [23]:

Definition 2.1.

Let ψ:n{+}\psi\colon\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} be a proper and lower semicontinuous function and σ>0\sigma>0 be a parameter. The proximal operator of ψ\psi is defined as

proxσψ(x)=argminyn{ψ(y)+12σyx22}.\text{prox}_{\sigma\psi}(x)=\mathop{\arg\min}\limits_{y\in\mathbb{R}^{n}}\left\{\psi(y)+\frac{1}{2\sigma}\|y-x\|^{2}_{2}\right\}. (2.2)

Its Moreau envelope (or Moreau-Yosida regularization) is defined by

envσψ(x)=infyn{ψ(y)+12σyx22}.\text{env}_{\sigma\psi}(x)=\inf_{y\in\mathbb{R}^{n}}\left\{\psi(y)+\frac{1}{2\sigma}\|y-x\|^{2}_{2}\right\}. (2.3)

The parameter σ\sigma can be interpreted as a trade-off between minimizing ψ\psi and being close to xx. Moreau envelope, specifically, is a way to smooth a non-smooth function and it can be shown that the optimal value of envσψ(x)\text{env}_{\sigma\psi}(x) is also the optimal value of proxσψ(x)\text{prox}_{\sigma\psi}(x).

Suppose now we are given a problem

minψ(x)+ϕ(x)\min\quad\psi(x)+\phi(x)

where ψ,ϕ:n{+}\psi,\phi\colon\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} are closed proper functions, of which both ψ\psi and ϕ\phi can be nonsmooth. Under the ADMM algorithm, each iteration kk takes on an alternating nature with the proximal operators of ψ\psi and ϕ\phi being evaluated separately:

xk+1\displaystyle x^{k+1} proxσψ(zkuk);\displaystyle\in\text{prox}_{\sigma\psi}(z^{k}-u^{k});
yk+1\displaystyle y^{k+1} proxσϕ(xk+1+uk);\displaystyle\in\text{prox}_{\sigma\phi}(x^{k+1}+u^{k});
uk+1\displaystyle u^{k+1} :=uk+xk+1zk+1.\displaystyle:=u^{k}+x^{k+1}-z^{k+1}.

Viewing the above as a fixed point iteration, the ADMM scheme results in x=zx=z such that

x\displaystyle x =proxσψ(xu),\displaystyle=\text{prox}_{\sigma\psi}(x-u),
y\displaystyle y =proxσϕ(x+u).\displaystyle=\text{prox}_{\sigma\phi}(x+u).

Turning our attention back to our problem (2.1), we first denote the set RR associated with the inequality constraint in (2.1) by

R={xn:μTxr},R=\left\{x\in\mathbb{R}^{n}\colon\mu^{T}x\geq r\right\}, (2.4)

and the indicator function of RR by

IR(x)={0,xR,,xR.I_{R}(x)=\left\{\begin{array}[]{ll}0,&x\in R,\\ \infty,&x\notin R.\end{array}\right. (2.5)

We now define the augmented Lagrangian corresponding to problem (2.1):

(x,λ,ρ)=\displaystyle\mathcal{L}(x,\lambda,\rho)= β12xTVxμTx+β22x22+x0+IR(x)\displaystyle\frac{\beta_{1}}{2}x^{T}Vx-\mu^{T}x+\frac{\beta_{2}}{2}||x||^{2}_{2}+||x||_{0}+I_{R}(x) (2.6)
+λ(eTx1)+ρ2(eTx1)2,\displaystyle+\lambda\left(e^{T}x-1\right)+\frac{\rho}{2}\left(e^{T}x-1\right)^{2},

where λ\lambda is the usual Lagrange multiplier and ρ>0\rho>0 is the penalty parameter for the equality constraint eTx=1e^{T}x=1. We may set ρ\rho to be constant with value greater than 4 [2], leading to our problem (2.6) rewritten as

(x,λ)=\displaystyle\mathcal{L}(x,\lambda)= β12xTVxμTx+β22x22+x0+IR(x)\displaystyle\ \frac{\beta_{1}}{2}x^{T}Vx-\mu^{T}x+\frac{\beta_{2}}{2}||x||^{2}_{2}+||x||_{0}+I_{R}(x) (2.7)
+λ(eTx1)+ρ2(eTx1)2,\displaystyle+\lambda\left(e^{T}x-1\right)+\frac{\rho}{2}\left(e^{T}x-1\right)^{2},

where xx and λ\lambda are updated via

xk+1\displaystyle x^{k+1} =argminx(x,λk),\displaystyle=\mathop{\arg\min}\limits_{x}\mathcal{L}\left(x,\lambda^{k}\right),
λk+1\displaystyle\lambda^{k+1} =λk+ρ(eTxk+11).\displaystyle=\lambda^{k}+\rho(e^{T}x^{k+1}-1).

Problem (2.7) can now be viewed as the following minimization problem:

minx,λP(x,λ)+Q(x),\min_{x,\lambda}\quad P(x,\lambda)+Q(x), (2.8)

where P(x,λ)P(x,\lambda) consists of the smooth terms given by

P(x,λ)=β12xTVxμTx+β22x22+λ(eTx1)+ρ2(eTx1)2,P(x,\lambda)=\frac{\beta_{1}}{2}x^{T}Vx-\mu^{T}x+\frac{\beta_{2}}{2}||x||^{2}_{2}+\lambda\left(e^{T}x-1\right)+\frac{\rho}{2}\left(e^{T}x-1\right)^{2}, (2.9)

and Q(x)Q(x) the non-smooth terms given by

Q(x)=x0+IR(x).Q(x)=||x||_{0}+I_{R}(x). (2.10)

For the purpose of our discussion on the proximal method, we let λ\lambda be a fixed value, say λ^\hat{\lambda}, of which we deal with the following minimization problem:

minxP(x,λ^)+Q(x).\min_{x}\quad P(x,\hat{\lambda})+Q(x). (2.11)

Our proximal method, inspired by Beck and Teboulle [1], for minimizing the objective function in (2.11) can be viewed as the proximal regularization of PP linearized at a given point xkx^{k}:

xk+1argminx{Q(x)+(xxk)TP(xk)+12txxk2},x^{k+1}\in\mathop{\arg\min}\limits_{x}\left\{Q(x)+(x-x^{k})^{T}\nabla P(x^{k})+\frac{1}{2t}\|x-x^{k}\|^{2}\right\}, (2.12)

where t>0t>0 and \nabla denotes the derivative operator. Invoking simple algebra and ignoring the constant terms, (2.12) can be written as

xk+1argminx{Q(x)+12tkx(xktkP(xk))2}.x^{k+1}\in\mathop{\arg\min}\limits_{x}\left\{Q(x)+\frac{1}{2t^{k}}\left\|x-(x^{k}-t^{k}\nabla P(x^{k}))\right\|^{2}\right\}. (2.13)

Using Definition 2.1, the iterative scheme consists of a proximal step at a resulting gradient point which gives us the proximal gradient method:

xk+1proxαkQ(xkαkP(xk)),x^{k+1}\in\text{prox}_{\alpha^{k}Q}\left(x^{k}-\alpha^{k}\nabla P\left(x^{k}\right)\right), (2.14)

where αk>0\alpha^{k}>0 is a suitable step size. Note that if P\nabla P is Lipschitz continuous with constant LcL_{c}, then the proximal gradient method is known to converge at a rate of 𝒪(1/k)\mathcal{O}(1/k) with fixed step size α(0,1/Lc]\alpha\in(0,1/L_{c}] (Boyd et al. [8]). In the case when LcL_{c} is not known, the step sizes can be chosen via line search methods (see, for example Beck and Teboulle [1]). In the context of line search methods, the largest possible step size α=1\alpha=1 is more desirable. Therefore, proximal gradient methods usually have a fixed step size α=min{1,1/Lc}\alpha=\min\{1,1/L_{c}\}. In our case, the Lipschitz continuity of P\nabla P gives

P(x)P(y)2\displaystyle\left\|\nabla P(x)-\nabla P(y)\right\|_{2} =β1V(xy)+β2(xy)+ρ(xy)2\displaystyle=\left\|\beta_{1}V(x-y)+\beta_{2}(x-y)+\rho(x-y)\right\|_{2}
β1V+β2I+ρeeTFxy2\displaystyle\leq\|\beta_{1}V+\beta_{2}I+\rho ee^{T}\|_{F}\,\|x-y\|_{2} (2.15)

for all x,ynx,y\in\mathbb{R}^{n} where II denotes the identity matrix and F\|\cdot\|_{F} denotes the Frobenius norm. Since the Lipschitz constant of (2) is not easily accessible, we can estimate it in the following way:

Lc\displaystyle L_{c} β1VF+β2IF+ρeeTF\displaystyle\leq\beta_{1}\|V\|_{F}+\beta_{2}\|I\|_{F}+\rho\|ee^{T}\|_{F}
=β1tr(VVT)+β2n+ρn=:L~c\displaystyle=\beta_{1}\sqrt{\text{tr}\left(VV^{T}\right)}+\beta_{2}\sqrt{n}+\rho n=:\tilde{L}_{c} (2.16)

where tr denotes the matrix trace. Since L~c>1\tilde{L}_{c}>1, it is clear that min{1,1/L~c}\min\{1,1/\tilde{L}_{c}\} will always return the value 1/L~c1/\tilde{L}_{c}. We shall henceforth fix our step size α=1/L~c\alpha=1/\tilde{L}_{c}. Our choice of step size follows from the well-known descent property below:

Lemma 2.1 (Descent property [1]).

Let ψ:RnR\psi:R^{n}\to R be a continuously differentiable function with gradient ψ\nabla\psi assumed to be LcL_{c}-Lipschitz continuous. Then, for any L~cLc\tilde{L}_{c}\geq L_{c},

ψ(x)ψ(y)+(xy)Tψ(y)+L~c2xy2,x,yRn.\psi(x)\leq\psi(y)+(x-y)^{T}\nabla\psi(y)+\frac{\tilde{L}_{c}}{2}\|x-y\|^{2},\;\;\forall x,y\in R^{n}. (2.17)

Using the proximal operator defined in Definition 2.1, the minimization of (2.12) is equivalent to the following step:

xk+1proxαQ(xkαP(xk)),x^{k+1}\in\text{prox}_{\alpha Q}\left(x^{k}-\alpha\nabla P\left(x^{k}\right)\right), (2.18)

where α=1L~c\alpha=\frac{1}{\tilde{L}_{c}}. The choice of L~c\tilde{L}_{c} also guarantees the sufficient decrease of our objective function under the proximal methods:

Lemma 2.2 (Sufficient decrease property [7]).

Let ψ:n\psi:\mathbb{R}^{n}\to\mathbb{R} be a C1C^{1} function with its gradient ψ\nabla\psi being Lipschitz continuous with moduli LcL_{c}. Let ϕ:n(,+]\phi:\mathbb{R}^{n}\to(-\infty,+\infty] be a proper and lower semicontinuous function with infnϕ>\inf_{\mathbb{R}^{n}}\phi>-\infty. Suppose L~c\tilde{L}_{c} is chosen such as L~c>Lc\tilde{L}_{c}>L_{c}. Then, for any xdom ϕx\in\mbox{dom }\phi and any x^n\hat{x}\in\mathbb{R}^{n} defined by

x^proxαϕ(xαψ(x)),α=1L~c,\hat{x}\in\mbox{prox}_{\alpha\phi}\left(x-\alpha\nabla\psi\left(x\right)\right),\quad\alpha=\frac{1}{\tilde{L}_{c}}, (2.19)

we have

ψ(x^)+ϕ(x^)ψ(x)+ϕ(x)12(L~cLc)x^x2.\psi\left(\hat{x}\right)+\phi\left(\hat{x}\right)\leq\psi(x)+\phi(x)-\frac{1}{2}\left(\tilde{L}_{c}-L_{c}\right)\|\hat{x}-x\|^{2}. (2.20)

Note that dom ϕ\text{dom }\phi in Lemma 2.2 defines the set of points for which proper and lower semicontinuous function ϕ:n{+}\phi:\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} takes on finite value:

dom ϕ={xn:ϕ(x)<+}.\text{dom }\phi=\{x\in\mathbb{R}^{n}\colon\phi(x)<+\infty\}.

In view of Lemma 2.2, we turn to our non-smooth term Q(x)Q(x), which can be written as the following unconstrained problem:

minxnx0+IR(x).\min_{x\in\mathbb{R}^{n}}\ \ \|x\|_{0}+I_{R}(x). (2.21)

It follows from the definition of Moreau envelope that our unconstrained optimization problem (2.21) becomes

minx,yny0+12σyx22+IR(x),\min_{x,y\in\mathbb{R}^{n}}\ \ \|y\|_{0}+\frac{1}{2\sigma}\|y-x\|_{2}^{2}+I_{R}(x), (2.22)

for some σ>0\sigma>0. It is known that if (x,y)(x^{*},y^{*}) is a solution of (2.22) for any σ>0\sigma>0, then yproxσ0(x)y^{*}\in\text{prox}_{\sigma\|\cdot\|_{0}}(x^{*}). In addition, xx^{*} is a solution of problem (2.21) if and only if (x,y)(x^{*},y^{*}) is a solution to (2.22). The proximal problem (2.13) now becomes

{yk+1proxσ0(yk1L~cP(yk)),xk+1=proxIR(yk+1).\left\{\begin{aligned} &y^{k+1}\in\text{prox}_{\sigma\|\cdot\|_{0}}\left(y^{k}-\frac{1}{\tilde{L}_{c}}\nabla P\left(y^{k}\right)\right),\\ &x^{k+1}=\text{prox}_{I_{R}}\left(y^{k+1}\right).\end{aligned}\right. (2.23)

In particular, the proximal operator of the indicator function IRI_{R} is reduced to Euclidean projection onto RR:

proxIR(x)={x,if μTxr,rμTxx,if μTx<r.\text{prox}_{I_{R}}(x)=\left\{\begin{array}[]{ll}x,&\text{if }\mu^{T}x\geq r,\\ \frac{r}{\mu^{T}x}x,&\text{if }\mu^{T}x<r.\end{array}\right. (2.24)

Meanwhile, the proximal operator of the 0\ell_{0}-norm can be expressed in its component-wise form:

proxσ0(x)={{0},if xi<2σ,{0,xi},if xi=2σ,{xi},if xi>2σ.\text{prox}_{\sigma\|\cdot\|_{0}}(x)=\left\{\begin{array}[]{cl}\{0\},&\text{if }x_{i}<\sqrt{2\sigma},\\ \{0,x_{i}\},&\text{if }x_{i}=\sqrt{2\sigma},\\ \{x_{i}\},&\text{if }x_{i}>\sqrt{2\sigma}.\end{array}\right. (2.25)

Note that proxσ0(x)\text{prox}_{\sigma\|\cdot\|_{0}}(x) is known as a hard thresholding operator since it forces the vectors xix_{i}’s except the large one to be zero [23]. In other words, larger σ\sigma results in higher sparsity and less penalization for moving away from xx. Doing so ensures that our portfolio selection avoid small investments.

In the next section, we will see how the proximal operators are evaluated alternately to give us the optimal solution for problem (2.1).

3 Alternating proximal algorithm and its convergence

In this section, we present an ADMM algorithm to find the optimal portfolio of the proposed SEPO-0\ell_{0} model (2.1) and establish its global convergence.

SEPO-0\ell_{0} Algorithm

  1. Step 0

    Given β1,β2,σ,r,V,μ,ρ,α\beta_{1},\beta_{2},\sigma,r,V,\mu,\rho,\alpha, initial point (x0,λ0)(x^{0},\lambda^{0}) and convergence tolerance ε\varepsilon. Set k:=0k:=0.

  2. Step 1

    Compute x^k+1proxσ0(xkαP(xk,λk))\hat{x}^{k+1}\in\text{prox}_{\sigma\|\cdot\|_{0}}\!\left(x^{k}-\alpha\nabla P(x^{k},\lambda^{k})\right).

  3. Step 2

    Compute xk+1=proxIR(x^k+1).x^{k+1}=\text{prox}_{I_{R}}\left(\hat{x}^{k+1}\right).

  4. Step 3

    Compute λk+1=λk+ρ(eTxk+11)\lambda^{k+1}=\lambda^{k}+\rho(e^{T}x^{k+1}-1).

  5. Step 4

    If P(xk+1,λk+1)<ε\left\|\nabla P\left(x^{k+1},\lambda^{k+1}\right)\right\|<\varepsilon or k>10000k>10000, stop. Else, set k:=k+1k:=k+1 and go to Step 1.

We have seen in Section 2 how the proposed proximal method guarantees the descent of the solution. To proceed with the convergence of SEPO-0\ell_{0} algorithm, we begin with Assumption A for any objective function :n{+}\mathcal{L}\colon\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} where =ψ+ϕ\mathcal{L}=\psi+\phi:

Assumption A

  1. (i)

    ψ:n\psi\colon\mathbb{R}^{n}\rightarrow\mathbb{R} is a continuously differentiable function where its gradient ψ\nabla\psi is Lipschitz continuous with moduli LcL_{c}.

  2. (ii)

    ϕ:n{+}\phi\colon\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} is a proper and lower semicontinuous function.

  3. (iii)

    infnψ>\inf_{\mathbb{R}^{n}}\psi>-\infty and infnϕ>\inf_{\mathbb{R}^{n}}\phi>-\infty

SEPO-0\ell_{0} algorithm also results in nice convergence properties of (2.7):

Lemma 3.1 (Convergence properties [7]).

Suppose that Assumption A holds. Let {xk}k\{x^{k}\}_{k\in\mathbb{N}} be a sequence generated by SEPO-0\ell_{0} algorithm. Then, the sequence {(xk,λk):k}\{\mathcal{L}(x^{k},\lambda^{k}):k\in\mathbb{N}\} is nonincreasing and in particular

(xk)(xk+1)12(L~cLc)xk+1xk2.\mathcal{L}\left(x^{k}\right)-\mathcal{L}\left(x^{k+1}\right)\geq\frac{1}{2}\left(\tilde{L}_{c}-L_{c}\right)\|x^{k+1}-x^{k}\|^{2}. (3.1)

Moreover,

k=1xk+1xk2<,\sum^{\infty}_{k=1}\left\|x^{k+1}-x^{k}\right\|^{2}<\infty, (3.2)

and hence

limkxk+1xk=0.\lim_{k\rightarrow\infty}\ \left\|x^{k+1}-x^{k}\right\|=0. (3.3)
Proof.

Without loss of generality, we let λ\lambda be a fixed constant and work with (x)=P(x)+Q(x)\mathcal{L}(x)=P(x)+Q(x) in place of (x,λ)\mathcal{L}(x,\lambda), where P(x)P(x) is given by (2.9) and Q(x)Q(x) is given by (2.10). Note that P(x)P(x) is differentiable and its gradient is Lipschitz continuous with moduli LcL_{c}. Invoking SEPO-0\ell_{0} algorithm and by Lemma 2.2, we have

P(xk+1)+Q(xk+1)P(xk)+Q(xk)12(L~cLc)xk+1xk2,P\left(x^{k+1}\right)+Q\left(x^{k+1}\right)\leq P\left(x^{k}\right)+Q\left(x^{k}\right)-\frac{1}{2}\left(\tilde{L}_{c}-L_{c}\right)\left\|x^{k+1}-x^{k}\right\|^{2}, (3.4)

where L~c\tilde{L}_{c} is given by (2). Writing (xk)=P(xk)+Q(xk)\mathcal{L}(x^{k})=P(x^{k})+Q(x^{k}) in (3.4) and rearranging it lead to (3.1), which asserts that the sequence {(xk,λk):k}\{\mathcal{L}(x^{k},\lambda^{k}):k\in\mathbb{N}\} is nonincreasing.

Note that PP and QQ are bounded below (see Assumption A), and hence \mathcal{L} converges to some ¯\underline{\mathcal{L}}. Let N+N\in\mathbb{N}_{+}. We sum up (3.1) from k=0k=0 to k=N1k=N-1 to get

k=0N1xk+1xk2\displaystyle\sum^{N-1}_{k=0}\left\|x^{k+1}-x^{k}\right\|^{2} 2L~cLck=0N1((xk)(xk+1))\displaystyle\leq\frac{2}{\tilde{L}_{c}-L_{c}}\sum^{N-1}_{k=0}\left(\mathcal{L}\left(x^{k}\right)-\mathcal{L}\left(x^{k+1}\right)\right)
=2L~cLc((x0)(xN))\displaystyle=\frac{2}{\tilde{L}_{c}-L_{c}}\left(\mathcal{L}\left(x^{0}\right)-\mathcal{L}\left(x^{N}\right)\right)
2L~cLc((x0)¯).\displaystyle\leq\frac{2}{\tilde{L}_{c}-L_{c}}\left(\mathcal{L}\left(x^{0}\right)-\underline{\mathcal{L}}\right).

It follows that (3.2) and (3.3) hold when we take the limit as NN\rightarrow\infty. ∎

Before we present the result that sums up the properties of the sequence {xk}k\{x^{k}\}_{k\in\mathbb{N}} generated by SEPO-0\ell_{0} algorithm starting from the initial point x0x^{0}, we first give some basic notations. We denote by crit \text{crit }\mathcal{L} the set of critical points of \mathcal{L} and ω(x0)\omega\left(x^{0}\right) the set of all limit points, where

ω(x0)={x¯n: an increasing sequence of integers {kj}j such that xkjx¯ as j}.\omega\left(x^{0}\right)=\left\{\overline{x}\in\mathbb{R}^{n}:\exists\text{ an increasing sequence of integers }\{k_{j}\}_{j\in\mathbb{N}}\text{ such that }x^{k_{j}}\rightarrow\overline{x}\text{ as }j\rightarrow\infty\right\}.

Given any set Ωn\Omega\subset\mathbb{R}^{n} and any point xnx\in\mathbb{R}^{n}, the distance from xx to Ω\Omega is denoted and defined by

dist(x,Ω):=inf{yx:yΩ}.\text{dist}(x,\Omega):=\inf\left\{\|y-x\|\colon y\in\Omega\right\}.

When Ω=\Omega=\emptyset, then we invoke the usual convention that inf=\inf\emptyset=\infty and hence dist(x,Ω)=\text{dist}(x,\Omega)=\infty for all xx.

Lemma 3.2 (Properties of limit points [7]).

Suppose that Assumption A holds. Let {xk}k\{x^{k}\}_{k\in\mathbb{N}} be a bounded sequence generated by SEPO-0\ell_{0} algorithm. Then, the following hold:

  1. (a)

    ω(x0)\omega\left(x^{0}\right) is a nonempty, compact and connected set.

  2. (b)

    ω(x0)crit \omega\left(x^{0}\right)\subset\text{crit }\mathcal{L}.

  3. (c)

    limkdist (xk,ω(x0))=0\lim_{k\rightarrow\infty}\text{dist }\left(x^{k},\omega\left(x^{0}\right)\right)=0.

  4. (d)

    The objective function \mathcal{L} is finite and constant on ω(x0)\omega\left(x^{0}\right).

Proof.

See Bolte et al. [7]. ∎

What remains is its global convergence, of which we shall establish by means of the Kurdyka-Łojasiewicz (KL) property [7] as an extension of Łojasiewicz gradient inequality [19] for non-smooth functions. We first show that the objective function (2.7) is semi-algebraic and therefore is a KL function. This, in turn, is crucial in giving us the convergence property of the sequences generated via SEPO-0\ell_{0} algorithm. We begin by recalling notations and definitions concerning subdifferential (see, for instance [7, 23]) and KL property.

Definition 3.1.

Let ϕ:n{+}\phi\colon\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} be a proper and lower semicontinuous function. The (limiting) subdifferential of ϕ\phi at xdom ϕx\in\text{dom }\phi, is denoted and defined by

ϕ(x)={un:xkx,ϕ(xk)ϕ(x),uku,liminfyxkϕ(y)ϕ(xk)uk,yxkyxk0}.\partial\phi(x)=\left\{u\in\mathbb{R}^{n}\colon\exists x^{k}\rightarrow x,\ \phi(x^{k})\rightarrow\phi(x),\ u^{k}\rightarrow u,\ \mathop{\lim\inf}\limits_{y\rightarrow x^{k}}\ \frac{\phi(y)-\phi(x^{k})-\langle u^{k},y-x^{k}\rangle}{\|y-x^{k}\|}\geq 0\right\}. (3.5)

The point xx is called a (limiting) critical point of ϕ\phi if 0ϕ(x)0\in\partial\phi(x).

It follows that 0ϕ(x)0\in\partial\phi(x) if xnx\in\mathbb{R}^{n} is a local minimizer of ϕ\phi. For continuously differentiable ϕ\phi, then ϕ(x)={ϕ}\partial\phi(x)=\{\nabla\phi\} and hence we have the usual gradient mapping ϕ\nabla\phi from xdom ϕx\in\text{dom }\phi to ϕ(x)\nabla\phi(x). If ψ\psi is convex, the subdifferential (3.5) turns out to be the classical Fréchet subdifferential (see [23]).

Let η(0,]\eta\in(0,\infty] and denote by Φη\Phi_{\eta} be the class of all concave and continuous functions φ:[0,η)+\varphi\colon[0,\eta)\rightarrow\mathbb{R}_{+} that are continuously differentiable on (0,η)(0,\eta) and continuous at 0 with φ(0)=0\varphi(0)=0 and φ(s)>0\varphi^{\prime}(s)>0 for all s(0,η)s\in(0,\eta).

Definition 3.2 (Kurdyka-Łojasiewicz (KL) property).

Let ϕ:n{+}\phi\colon\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} be a proper and lower semicontinuous function. The function ϕ\phi is said to have the Kurdyka-Łojasiewicz (KL) property at u¯dom ϕ:={un:ϕ(u)}\bar{u}\in\text{dom }\partial\phi:=\left\{u\in\mathbb{R}^{n}\colon\partial\phi(u)\neq\emptyset\right\} if there exist η(0,+]\eta\in(0,+\infty], a neighbourhood U of u¯\bar{u} and a function φΦη\varphi\in\Phi_{\eta}, such that for all uU[ϕ(u¯)<ϕ(u)<ϕ(u¯)+η]u\in U\cap\left[\phi(\bar{u})<\phi(u)<\phi(\bar{u})+\eta\right], the following inequality holds:

φ(ϕ(u)ϕ(u¯))dist(0,ϕ(u))1.\varphi^{\prime}\left(\phi(u)-\phi(\bar{u})\right)\ \text{dist}\left(0,\partial\phi\left(u\right)\right)\geq 1. (3.6)

Moreover, ϕ\phi is called a KL function if it satisfies the KL property at each point of dom ϕ\text{dom }\phi.

The definition above uses the sublevel sets: Given a,ba,b\in\mathbb{R}, the sublevel sets of a function ϕ\phi are denoted and defined by

[aϕb]:={xn:aϕ(x)b}.[a\leq\phi\leq b]:=\left\{x\in\mathbb{R}^{n}\colon a\leq\phi(x)\leq b\right\}.

Similar definition holds for [a<ϕ<b][a<\phi<b]. The level sets of ϕ\phi are denoted and defined by

[ϕ=a]:={xn:ϕ(x)=a}.[\phi=a]:=\left\{x\in\mathbb{R}^{n}\colon\phi(x)=a\right\}.

Closely related to the KL function is the semi-algebraic function, which is crucial in the proof of the convergence property of our proposed method.

Definition 3.3 (Semi-algebraic sets and functions).
  1. (i)

    A subset Ωn\Omega\subset\mathbb{R}^{n} is called a semi-algebraic set if there exists a finite number of real polynomial functions pijp_{ij} and qijq_{ij} such that

    Ω=j=1pi=1q{un:pij(u)=0 and qij(u)<0}.\Omega=\bigcup^{p}_{j=1}\bigcap^{q}_{i=1}\bigl{\{}u\in\mathbb{R}^{n}\colon p_{ij}(u)=0\text{ and }q_{ij}(u)<0\bigr{\}}. (3.7)
  2. (ii)

    A function ϕ:n{+}\phi\colon\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} is called a semi-algebraic function if its graph

    {(u,t)n+1:ϕ(u)=t}\bigl{\{}(u,t)\in\mathbb{R}^{n+1}\colon\phi(u)=t\bigr{\}} (3.8)

    is a semi-algebraic subset of Rn+1R^{n+1}.

It follows that semi-algebraic functions are indeed KL functions, of which the result below is a non-smooth version of the Łojasiewicz gradient inequality.

Theorem 3.4 ([5, 6]).

Let ϕ:n{+}\phi\colon\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} be a proper and lower semicontinuous function. If ϕ\phi is semi-algebraic, then it is a KL function.

Theorem 3.4 allows us to avoid the technicality in proving the KL property. This is due to the broad range of functions and sets that are indeed semi-algebraic (see, for instance [4, 7]). Some of the examples of semi-algebraic functions include real polynomial functions, and indicator functions of semi-algebraic sets. Apart from that, finite sums and products of semi-algebraic functions, as well as scalar products are all semi-algebraic.

We are now ready to give the global convergence result of the proposed model (2.1).

Theorem 3.5 (Global convergence).

Suppose the objective function :n{+}\mathcal{L}\colon\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} is a KL function such that Assumption A holds. Then the sequence {xk}k\{x^{k}\}_{k\in\mathbb{N}} generated by SEPO-0\ell_{0} algorithm converges to a critical point xx^{*}.

Proof.

See Bolte et al. [7]. ∎

By virtue of Theorem 3.5, we now show that each term in (2.7) is semi-algebraic since the finite sum of semi-algebraic functions is also semi-algebraic. It is obvious that function (2.7) is a sum of a smooth function P(x)P(x), 0\ell_{0}-norm and indicator function. The function P(x)P(x) given by (2.9) is a linear combination of linear and quadratic functions, and hence P(x)P(x) is a real polynomial function, which in turn is semi-algebraic.

As a specific example given by Bolte et al. [7], 0\ell_{0}-norm is nothing but the sparsity measure of the vector xnx\in\mathbb{R}^{n}, which is indeed semi-algebraic. In particular, the graph of 0\|\cdot\|_{0} is given by a finite union of product sets:

graph 0=I{1,,n}(i=1nJiI)×{n|I|},\text{graph }\|\cdot\|_{0}=\bigcup_{I\subset\{1,\dots,n\}}\left(\prod^{n}_{i=1}J^{I}_{i}\right)\times\{n-|I|\}, (3.9)

where for any given I{1,,n}I\subset\{1,\dots,n\}, |I||I| denotes the cardinality of II and

JiI={{0},if iI,{0},otherwise.J^{I}_{i}=\left\{\begin{array}[]{ll}\{0\},&\text{if }i\in I,\\ \mathbb{R}\setminus\{0\},&\text{otherwise}.\end{array}\right.

It is obvious that (3.9) is a piecewise linear set, hence the claim. Lastly, the indicator function IR(x)I_{R}(x) defined by (2.5) is also semi-algebraic, since the feasible set (2.4) is convex.

4 Numerical experiments and results

In this section, we study the efficiency of the proposed portfolio optimization model, SEPO-0\ell_{0}, in maximizing portfolio return and minimizing transaction cost. We test our algorithm on real data of stock prices and returns of 100 companies across 10 different sectors in China, collected from January 2019 to June 2019. These data are in turn used to generate the covariance matrix, which gives us the portfolio variance as in our problem (2.1). We start with equally-weighted portfolio, i.e. xi0=1nx_{i}^{0}=\frac{1}{n} for all ii. We set ε=107\varepsilon=10^{-7} and stop the algorithm when P(xk+1,λk+1)<ε\left\|\nabla P\left(x^{k+1},\lambda^{k+1}\right)\right\|<\varepsilon or k>10000k>10000. All computational results are obtained by running Matlab R2021a on Windows 10 (Intel Core i7 1065G7 16GB CPU @ 1.30 GHz \sim 1.50GHz).

For testing purposes, we set our penalty parameter ρ=5\rho=5 and tuning parameter β2=1\beta_{2}=1. The latter means that we set our weight on the portfolio diversification as constant. Meanwhile, the value of β1\beta_{1} is chosen to be relatively smaller than β2\beta_{2}. For illustration, we present our results for minimum guaranteed return ratio r=0.1r=0.1 and r=0.2r=0.2.

In Table 4.1, we present the computational results of the expected return, variance risk and sparsity under the proposed SEPO-0\ell_{0} model and standard MVO model for different values of β1\beta_{1} when we set the minimum guaranteed ratio to be 0.1 and 0.2, respectively. Note that though we leveraged on the variance risk when β=1\beta=1, the portfolio selection under SEPO-0\ell_{0} manages to generate expected return of 0.3455 and 0.4014 when r=0.1r=0.1 and r=0.2r=0.2, respectively. Meanwhile, the standard MVO is only able to generate expected return of 0.1560 when we set r=0.1r=0.1. The variance risks, however, are higher under the proposed model due to the sparsity, as compared to maximum diversification of the standard MVO. From the table, we can see that our model offers good level of sparsity between 30% and 61% when r=0.1r=0.1 and between 52% and 72% when r=0.2r=0.2. This simply means that out of 100 stocks considered under minimum expected return ratio r=0.1r=0.1, one will only need to invest in the selected 39 - 70 stocks where the algorithm returns nonzero xix_{i}’s. Despite the sparse portfolio selection and increased risk, we can see that the proposed model is more promising in terms of higher expected return.

r=0.1r=0.1 r=0.2r=0.2
SEPO-0\ell_{0} Standard MVO SEPO-0\ell_{0}
β1\beta_{1} E.R. V.R. Spar E.R. V.R. Spar E.R. V.R. Spar
0.1 0.6355 3.2835 58% 0.6889 2.3603 0% 0.7441 4.3108 72%
0.2 0.6279 2.9577 61% 0.5735 1.5138 0% 0.6732 3.2822 66%
0.3 0.5050 2.1304 47% 0.4555 1.0320 0% 0.5829 2.4655 58%
0.4 0.5180 2.0288 53% 0.3760 0.8114 0% 0.5796 2.2333 64%
0.5 0.4865 1.7976 51% 0.3003 0.6689 0% 0.5374 1.9056 64%
0.6 0.4237 1.5684 39% 0.2646 0.5785 0% 0.4675 1.6193 52%
0.7 0.3677 1.4070 30% 0.2223 0.5324 0% 0.4655 1.4800 59%
0.8 0.3581 1.3248 31% 0.2057 0.4930 0% 0.4521 1.3289 63%
0.9 0.3787 1.2635 44% 0.1750 0.4704 0% 0.4182 1.2149 56%
1 0.3455 1.1802 40% 0.1560 0.4501 0% 0.4014 1.1204 56%
  • E.R. = Expected return, V.R. = Variance risk, Spar = Sparsity

Table 4.1: The values of portfolio expected return, risk and sparsity for different β1\beta_{1} with minimum guaranteed return ratio r=0.1r=0.1 and r=0.2r=0.2 under SEPO-0\ell_{0} and standard MVO
Refer to caption
(a) Expected return
Refer to caption
(b) Variance risk
Figure 4.1: SEPO-0\ell_{0} vs standard MVO with minumum guaranteed return ratio r=0.1r=0.1

We also compare the expected return and variance risk for the SEPO-0\ell_{0} and standard MVO for r=0.1r=0.1 by using scatterplot, as seen in Figure 4.1. The downward trend of the portfolio expected return and risk mimic the standard MVO as β1\beta\rightarrow 1. Note that a higher value of β1\beta_{1} reflects our leverage on the variance risk over expected return. At the same time, higher expected return results means higher risk as shown in Table 4.1. In general, the standard MVO model gives lower measure for risk due to maximum diversification, as we can see from Table 4.1 and Figure 4.1. The proposed SEPO-0\ell_{0}, on the other hand, can lead to higher expected return and lower total transaction cost due to a sparse portfolio. This shows that SEPO-0\ell_{0} model is able to provide good combination of portfolio selection under sparsity.

To illustrate the reliability of our model, we present the output of the proposed model for r=1r=1 using a scatterplot of the variables, as shown in Figure 4.2, with β1\beta_{1} as independent variable at xx-axis, expected return and sparsity (in decimal) at the left yy-axis, while risk’s scale is at the right of yy-axis. We can observe a similar trend for the three lines, which clearly reflects the consistency of our model in obtaining optimal portfolio selection.

Refer to caption
Figure 4.2: Portfolio expected return, risk and sparsity subjected to minimum guaranteed return ratio r=0.1r=0.1 under SEPO-0\ell_{0}, for different values of β1\beta_{1}

The relationship between the independent variable β1\beta_{1} and the response variables is further examined using multivariate linear regression model, as presented in Table 4.2. As we can see from the table, the estimates for response variables yy are all negative, which means their values decrease with the increase of β1\beta_{1}. Since the pp-values of all response variables are approximately zero, it is clear that these three variables are significant. In particular, β1\beta_{1} has significant negative relationship with expected return, risk and sparsity.

Response variable Estimate for intercept Estimate for yy Standard error for yy pp-value for yy R-squared
Expected return, y1y_{1} 0.6514 -0.3396 0.0383 2.0737e-0.5 0.9076
Risk, y2y_{2} 3.1246 -2.2371 0.2992 7.0894e-0.5 0.8748
Sparsity, y3y_{3} 0.6013 -0.2679 0.0796 9.8657e-0.5 0.5859
Table 4.2: Relationship between the independent variable β1\beta_{1} and the response variables using multivariate linear regression model for SEPO-0\ell_{0} with r=0.1r=0.1

The significance of β1\beta_{1} on these three dependent variables are supported by the values of R-squared of univariate regression, standing at 90.76%, 87.48% and 58.59% for expected return, risk and sparsity, respectively. Since R-squared is the percentage of total variation contributed by predictor variable, the high R-squared values of greater than 80% for expected return and risk mean that β1\beta_{1} explains high percentage of the variance in these two response variables. It is slightly lower for sparsity, however any R-squared value greater than 50% can be considered as moderately high.

5 Conclusion

The classical Markowitz portfolio scheme or mean-variance optimization (MVO) is one of the most successful framework due to the simplicity in implementation, in particular it can be solved by quadratic programming which is widely available. However, it is very sensitive to input parameter and obtaining acceptable solutions requires the right weight constraints. Over the past decade, there has been renewed attention in considering non-quadratic portfolio selection models, due to the advancement in optimization algorithms for solving more general class of functions. Here we proposed a new algorithmic framework that allows portfolio managers to strike a balance between diversifying investments and minimizing transaction cost, of which the latter is achieved by means of minimizing the 0\ell_{0}-norm. This simply means that the model maximizes sparsity within the portfolio, since the weights xix_{i} are forced to be zero except for large ones. In practice, the regularization of 0\ell_{0} results in a discontinuous and nonconvex problem, and hence is often approximated via 1\ell_{1}-norm. In this study, we employed the proximal methods such that function can be ’smoothed’, by means of linearizing part of the objective function at some given point and regularizing by a quadratic proximal term that acts as a measure for the ”local error” in the approximation. Writing our problem in the form of augmented Lagrangian, the unconstrained problem can be divided into two parts, namely the smooth and non-smooth terms. These terms are then handled separately through their proximal methods via the ADMM method. The global convergence of the proposed SEPO-0\ell_{0} algorithm for sparse equity portfolio has been established. The efficiency of our model in maximizing portfolio expectedreturn while striking a balance between minimizing transaction cost and diversification has been analyzed using actual data of 100 companies. Emperically, the implementation of our model leads to higher expected return and lower transaction cost. This shows that, despite its higher risk as compared to the standard MVO, the SEPO-0\ell_{0} model is promising in generating a good combination for optimal investment portfolio.

References

  • [1] A. Beck and M. Teboulle. Convex Optimization in Signal Processing and Communications, chapter Gradient-based algorithms with applications to signal recovery problems, pages 42–88. Cambridge University Press, 2009.
  • [2] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 3rd edition, 2016.
  • [3] M. J. Best and R. R. Grauer. On the sensitivity of mean-variance-efficient portfolios to changes in asset means: some analytical and computational results. The Review of Financial Studies, 4(2):315–342, 1991.
  • [4] J. Bochnak, M. Coste, and M. F. Roy. Real algebraic geometry, volume 36. Springer Science & Business Media, 2013.
  • [5] J. Bolte, A. Daniilidis, and A. Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205–1223, 2007.
  • [6] J. Bolte, A. Daniilidis, A. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572, 2007.
  • [7] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1):459–494, 2014.
  • [8] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
  • [9] J. Brodie, I. Daubechies, C. De Mol, D. Giannone, and I. Loris. Sparse and stable Markowitz portfolios. Proceedings of the National Academy of Sciences, 106(30):12267–12272, 2009.
  • [10] J. Chen, G. Dai, and N. Zhang. An application of sparse-group lasso regularization to equity portfolio optimization and sector selection. Annals of Operations Research, 284(1):243–262, 2020.
  • [11] Z. Dai and F. Wen. A generalized approach to sparse and stable portfolio optimization problem. Journal of Industrial & Management Optimization, 14(4):1651–1666, 2018.
  • [12] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 57(11):1413–1457, 2004.
  • [13] C. De Mol. Financial Signal Processing and Machine Learning, chapter Sparse Markowitz Portfolios, pages 11–22. Wiley Online Library, 2016.
  • [14] V. DeMiguel, L. Garlappi, F. J. Nogales, and R. Uppal. A generalized approach to portfolio optimization: Improving performance by constraining portfolio norms. Management Science, 55(5):798–812, 2009.
  • [15] B. Fastrich, S. Paterlini, and P. Winker. Constructing optimal sparse portfolios using regularization methods. Computational Management Science, 12(3):417–434, 2015.
  • [16] M. Fazel, T. K. Pong, D Sun, and P Tseng. Hankel matrix rank minimization with applications to system identification and realization. SIAM Journal on Matrix Analysis and Applications, 34(3):946–977, 2013.
  • [17] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
  • [18] Z. R. Lai, P. Y. Yang, L. Fang, and X. Wu. Short-term sparse portfolio optimization based on alternating direction method of multipliers. The Journal of Machine Learning Research, 19:1–28, 2018.
  • [19] S. Łojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117:87–89, 1963.
  • [20] H. Markowitz. Portfolio selection. Journal of Finance, 7(1):77–91, 1952.
  • [21] R. O. Michaud. The Markowitz optimization enigma: Is ‘optimized’optimal? Financial Analysts Journal, 45(1):31–42, 1989.
  • [22] S. Perrin and T. Roncalli. Machine learning optimization algorithms & portfolio allocation. Machine Learning for Asset Management: New Developments and Financial Applications, pages 261–328, 2020.
  • [23] R. T. Rockafellar and R. J. B Wets. Variational analysis. Springer-Verlag New York, 1998.