Optimal Portfolio Using Factor Graphical Lasso ⁰⁰footnotetext: The authors would like to thank the editor Fabio Trojani and three anonymous referees for helpful and constructive comments on the paper.

Tae-Hwy Lee¹¹1Department of Economics, University of California, Riverside. Email: [email protected]. and Ekaterina Seregina²²2Department of Economics, Colby College. Email: [email protected].

Abstract

Graphical models are a powerful tool to estimate a high-dimensional inverse covariance (precision) matrix, which has been applied for a portfolio allocation problem. The assumption made by these models is a sparsity of the precision matrix. However, when stock returns are driven by common factors, such assumption does not hold. We address this limitation and develop a framework, Factor Graphical Lasso (FGL), which integrates graphical models with the factor structure in the context of portfolio allocation by decomposing a precision matrix into low-rank and sparse components. Our theoretical results and simulations show that FGL consistently estimates the portfolio weights and risk exposure and also that FGL is robust to heavy-tailed distributions which makes our method suitable for financial applications. FGL-based portfolios are shown to exhibit superior performance over several prominent competitors including equal-weighted and Index portfolios in the empirical application for the S&P500 constituents.

Keywords: High-dimensionality, Portfolio optimization, Graphical Lasso, Approximate Factor Model, Sharpe Ratio, Elliptical distributions

JEL Classifications: C13, C55, C58, G11, G17

1 Introduction

Estimating the inverse covariance matrix, or precision matrix, of excess stock returns is crucial for constructing weights of financial assets in a portfolio and estimating the out-of-sample Sharpe Ratio. In high-dimensional setting, when the number of assets, $p$ , is greater than or equal to the sample size, $T$ , using an estimator of covariance matrix for obtaining portfolio weights leads to unstable investment allocations. This is known as the Markowitz’ curse: a higher number of assets increases correlation between the investments, which calls for a more diversified portfolio, and yet unstable corner solutions for weights become more likely. The reason behind this curse is the need to invert a high-dimensional covariance matrix to obtain the optimal weights from the quadratic optimization problem: when $p\geq T$ , the condition number of the covariance matrix (i.e., the absolute value of the ratio between maximal and minimal eigenvalues of the covariance matrix) is high. Hence, the inverted covariance matrix yields an unstable estimator of the precision matrix. To circumvent this issue one can estimate precision matrix directly, rather than inverting an estimated covariance matrix.

Graphical models were shown to provide consistent estimates of the precision matrix (Friedman et al., (2008); Meinshausen and Bühlmann, (2006); Cai et al., (2011)). Goto and Xu, (2015) estimated a sparse precision matrix for portfolio hedging using graphical models. They found out that their portfolio achieves significant out-of-sample risk reduction and higher return, as compared to the portfolios based on equal weights, shrunk covariance matrix, industry factor models, and no-short-sale constraints. Awoye, (2016) used Graphical Lasso (Friedman et al., (2008)) to estimate a sparse covariance matrix for the Markowitz mean-variance portfolio problem and reduce the realized portfolio risk. Millington and Niranjan, (2017) conducted an empirical study that applies Graphical Lasso for the estimation of covariance for the portfolio allocation. Their empirical findings suggest that portfolios using Graphical Lasso enjoy lower risk and higher returns compared to those using empirical covariance matrix. Millington and Niranjan, (2017) also construct a financial network using the estimated precision matrix to explore the relationship between the companies and show how the constructed network helps to make investment decisions. Callot et al., (2021) use the nodewise-regression method of Meinshausen and Bühlmann, (2006) to establish consistency of the estimated covariance matrix, weights and risk of high-dimensional financial portfolio. Their empirical application demonstrates that the precision matrix estimator based on the nodewise-regression outperforms the principal orthogonal complement thresholding estimator (POET) (Fan et al., (2013)) and linear shrinkage (Ledoit and Wolf, (2004)). Cai et al., (2020) use constrained $\ell_{1}$ -minimization for inverse matrix estimation (Clime) of the precision matrix (Cai et al., (2011)) to develop a consistent estimator of the minimum variance for high-dimensional global minimum-variance portfolio. It is important to note that all the aforementioned methods impose some sparsity assumption on the precision matrix of excess returns.

An alternative strategy to handle high-dimensional setting uses factor models to acknowledge common variation in the stock prices, which was documented in many empirical studies (see Campbell et al., (1997) among many others). A common approach decomposes covariance matrix of excess returns into low-rank and sparse parts, the latter is further regularized since, after the common factors are accounted for, the remaining covariance matrix of the idiosyncratic components is still high-dimensional (Fan et al., (2013, 2011, 2018)). This stream of literature, however, focuses on the estimation of a covariance matrix. The accuracy of precision matrices obtained from inverting the factor-based covariance matrix was investigated by Ait-Sahalia and Xiu, (2017), but they did not study a high-dimensional case. Factor models are generally treated as competitors to graphical models: as an example, Callot et al., (2021) find evidence of superior performance of nodewise-regression estimator of precision matrix over a factor-based estimator POET (Fan et al., (2013)) in terms of the out-of-sample Sharpe Ratio and risk of financial portfolio. The root cause why factor models and graphical models are treated separately is the sparsity assumption on the precision matrix made in the latter. Specifically, as pointed out in Koike, (2020), when asset returns have common factors, the precision matrix cannot be sparse because all pairs of assets are partially correlated conditional on other assets through the common factors. One attempt to integrate factor modeling and high-dimensional precision estimation was made by Fan et al., (2018) (Section 5.2): the authors referred to such class of models as “conditional graphical models”. However, this was not the main focus of their paper which concentrated on covariance estimation through elliptical factor models. As Fan et al., (2018) pointed out, “though substantial amount of efforts have been made to understand the graphical model, little has been done for estimating conditional graphical model, which is more general and realistic”. Concretely, to the best of our knowledge there are no studies that examine theoretical and empirical performance of graphical models integrated with the factor structure in the context of portfolio allocation.

In this paper we fill this gap and develop a new conditional precision matrix estimator for the excess returns under the approximate factor model that combines the benefits of graphical models and factor structure. We call our algorithm the Factor Graphical Lasso (FGL). We use a factor model to remove the co-movements induced by the factors, and then we apply the Weighted Graphical Lasso for the estimation of the precision matrix of the idiosyncratic terms. We prove consistency of FGL in the spectral and $\ell_{1}$ matrix norms. In addition, we prove consistency of the estimated portfolio weights and risk exposure for three formulations of the optimal portfolio allocation.

Our empirical application uses daily and monthly data for the constituents of the S&P500: we demonstrate that FGL outperforms equal-weighted portfolio, index portfolio, portfolios based on other estimators of precision matrix (Clime, Cai et al., (2011)) and covariance matrix, including POET (Fan et al., (2013)) and the shrinkage estimators adjusted to allow for the factor structure (Ledoit and Wolf, (2004), Ledoit and Wolf, (2017)), in terms of the out-of-sample Sharpe Ratio. Furthermore, we find strong empirical evidence that relaxing the constraint that portfolio weights sum up to one leads to a large increase in the out-of-sample Sharpe Ratio, which, to the best of our knowledge, has not been previously well-studied in the empirical finance literature.

From the theoretical perspective, our paper makes several important contributions to the existing literature on graphical models and factor models. First, to the best of out knowledge, there are no equivalent theoretical results that establish consistency of the portfolio weights and risk exposure in a high-dimensional setting without assuming sparsity on the covariance or precision matrix of stock returns. Second, we extend the theoretical results of POET (Fan et al., (2013)) to allow the number of factors to grow with the number of assets. Concretely, we establish uniform consistency for the factors and factor loadings estimated using PCA. Third, we are not aware of any other papers that provide convergence results for estimating a high-dimensional precision matrix using the Weighted Graphical Lasso under the approximate factor model with unobserved factors. Furthermore, all theoretical results established in this paper hold for a wide range of distributions: Sub-Gaussian family (including Gaussian) and elliptical family. Our simulations demonstrate that FGL is robust to very heavy-tailed distributions, which makes our method suitable for the financial applications. Finally, we demonstrate that in contrast to POET, the success of the proposed method does not heavily depend on the factor pervasiveness assumption: FGL is robust to the scenarios when the gap between the diverging and bounded eigenvalues decreases.

This paper is organized as follows: Section 2 reviews the basics of the Markowitz mean-variance portfolio theory. Section 3 provides a brief summary of the graphical models and introduces the Factor Graphical Lasso. Section 4 contains theoretical results and Section 5 validates these results using simulations. Section 6 provides empirical application. Section 7 concludes.

Notation

For the convenience of the reader, we summarize the notation to be used throughout the paper. Let $\mathcal{S}_{p}$ denote the set of all $p\times p$ symmetric matrices, and $\mathcal{S}_{p}^{++}$ denotes the set of all $p\times p$ positive definite matrices. For any matrix ${\mathbf{C}}$ , its $(i,j)$ -th element is denoted as $c_{ij}$ . Given a vector ${\mathbf{u}}\in\mathbb{R}^{d}$ and parameter $a\in[1,\infty)$ , let $\left\lVert{\mathbf{u}}\right\rVert_{a}$ denote $\ell_{a}$ -norm. Given a matrix ${\mathbf{U}}\in\mathcal{S}_{p}$ , let $\Lambda_{\text{max}}({\mathbf{U}})\equiv\Lambda_{1}({\mathbf{U}})\geq\Lambda_{2}({\mathbf{U}})\geq\ldots\geq\Lambda_{\text{min}}({\mathbf{U}})\equiv\Lambda_{p}({\mathbf{U}})$ be the eigenvalues of ${\mathbf{U}}$ , and $\text{eig}_{K}({\mathbf{U}})\in\mathbb{R}^{K\times p}$ denote the first $K\leq p$ normalized eigenvectors corresponding to $\Lambda_{1}({\mathbf{U}}),\ldots,\Lambda_{K}({\mathbf{U}})$ . Given parameters $a,b\in[1,\infty)$ , let ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\mathbf{U}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{a,b}\equiv\max_{\left\lVert{\mathbf{y}}\right\rVert_{a}=1}\left\lVert{\mathbf{U}}{\mathbf{y}}\right\rVert_{b}$ denote the induced matrix-operator norm. The special cases are ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\mathbf{U}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}\equiv\max_{1\leq j\leq N}\sum_{i=1}^{N}\left\lvert u_{i,j}\right\rvert$ for the $\ell_{1}/\ell_{1}$ -operator norm; the operator norm ( $\ell_{2}$ -matrix norm) ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\mathbf{U}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}^{2}\equiv\Lambda_{\text{max}}({\mathbf{U}}{\mathbf{U}}^{\prime})$ is equal to the maximal singular value of ${\mathbf{U}}$ ; ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\mathbf{U}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\infty}\equiv\max_{1\leq j\leq N}\sum_{i=1}^{N}\left\lvert u_{j,i}\right\rvert$ for the $\ell_{\infty}/\ell_{\infty}$ -operator norm. Finally, $\left\lVert{\mathbf{U}}\right\rVert_{\text{max}}\equiv\max_{i,j}\left\lvert u_{i,j}\right\rvert$ denotes the element-wise maximum, and ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\mathbf{U}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}^{2}\equiv\sum_{i,j}u_{i,j}^{2}$ denotes the Frobenius matrix norm.

2 Optimal Portfolio Allocation

Suppose we observe $p$ assets (indexed by $i$ ) over $T$ period of time (indexed by $t$ ). Let $\widetilde{{\mathbf{r}}}_{t}=(\widetilde{r}_{1t},\widetilde{r}_{2t},\ldots,\widetilde{r}_{pt})^{\prime}\sim\mathcal{D}({\mathbf{m}},{\bm{\Sigma}})$ be a $p\times 1$ vector of excess returns drawn from a distribution $\mathcal{D}$ , where ${\mathbf{m}}$ and ${\bm{\Sigma}}$ are the unconditional mean and covariance matrix of the returns. The goal of the Markowitz theory is to choose asset weights in a portfolio optimally. We will study two optimization problems: the well-known Markowitz weight-constrained (MWC) optimization problem, and the Markowitz risk-constrained (MRC) optimization that relaxes the constraint on portfolio weights.

The first optimization problem searches for asset weights such that the portfolio achieves a desired expected rate of return with minimum risk, under the restriction that all weights sum up to one. This can be formulated as the following quadratic optimization problem:

\min_{{\mathbf{w}}}\frac{1}{2}{\mathbf{w}}^{\prime}{\bm{\Sigma}}{\mathbf{w}},\ \text{s.t.}\ {\mathbf{w}}^{\prime}{\bm{\iota}}_{p}=1\ \text{and}\ {\mathbf{m}}^{\prime}{\mathbf{w}}\geq\mu

(2.1)

where ${\mathbf{w}}$ is a $p\times 1$ vector of asset weights in the portfolio, ${\bm{\iota}}_{p}$ is a $p\times 1$ vector of ones, and $\mu$ is a desired expected rate of portfolio return. Let ${\bm{\Theta}}\equiv{\bm{\Sigma}}^{-1}$ be the precision matrix.

If ${\mathbf{m}}^{\prime}{\mathbf{w}}>\mu$ , then the solution to (2.1) yields the global minimum-variance (GMV) portfolio weights ${\mathbf{w}}_{GMV}$ :

{\mathbf{w}}_{GMV}=({\bm{\iota}}_{p}^{\prime}{\bm{\Theta}}{\bm{\iota}}_{p})^{-1}{\bm{\Theta}}{\bm{\iota}}_{p}.

(2.2)

If ${\mathbf{m}}^{\prime}{\mathbf{w}}=\mu$ , the solution to (2.1) is a well-known two-fund separation theorem introduced by Tobin, (1958):

\displaystyle{\mathbf{w}}_{MWC}=(1-a_{1}){\mathbf{w}}_{GMV}+a_{1}{\mathbf{w}}_{M},

(2.3)

where ${\mathbf{w}}_{MWC}$ denotes the portfolio allocation with the constraint that the weights need to sum up to one, ${\mathbf{w}}_{M}=({\bm{\iota}}_{p}^{\prime}{\bm{\Theta}}{\mathbf{m}})^{-1}{\bm{\Theta}}{\mathbf{m}}$ , and $a_{1}=[\mu({\mathbf{m}}^{\prime}{\bm{\Theta}}{\bm{\iota}}_{p})({\bm{\iota}}_{p}^{\prime}{\bm{\Theta}}{\bm{\iota}}_{p})-({\mathbf{m}}^{\prime}{\bm{\Theta}}{\bm{\iota}}_{p})^{2}]/[({\mathbf{m}}^{\prime}{\bm{\Theta}}{\mathbf{m}})({\bm{\iota}}_{p}^{\prime}{\bm{\Theta}}{\bm{\iota}}_{p})-({\mathbf{m}}^{\prime}{\bm{\Theta}}{\bm{\iota}}_{p})^{2}]$ .

The MRC problem maximizes Sharpe Ratio (SR) subject to either target risk or target return constraints, but portfolio weights are not required to sum up to one:

\max_{{\mathbf{w}}}\frac{{\mathbf{m}}^{\prime}{\mathbf{w}}}{\sqrt{{\mathbf{w}}^{\prime}{\bm{\Sigma}}{\mathbf{w}}}}\ \text{s.t.}\ \text{(i)}\ {\mathbf{m}}^{\prime}{\mathbf{w}}\geq\mu\ \text{or}\text{(ii)}\ {\mathbf{w}}^{\prime}{\bm{\Sigma}}{\mathbf{w}}\leq\sigma^{2}.

(2.4)

When $\mu=\sigma\sqrt{{\mathbf{m}}^{\prime}{\bm{\Theta}}{\mathbf{m}}}$ , the solution to either of the constraints is given by

{\mathbf{w}}_{MRC}=\frac{\sigma}{\sqrt{{\mathbf{m}}^{\prime}{\bm{\Theta}}{\mathbf{m}}}}{\bm{\Theta}}{\mathbf{m}}.

(2.5)

Equation (2.4) tells us that once an investor specifies the desired return, $\mu$ , and maximum risk-tolerance level, $\sigma$ , the MRC weight maximizes the Sharpe Ratio of the portfolio.

Therefore, we have three alternative portfolio allocations commonly used in the existing literature: GMV in (2.2), MWC in (2.3) and MRC in (2.5). It is clear that all formulations require an estimate of the precision matrix ${\bm{\Theta}}$ .

3 Factor Graphical Lasso

In this section we introduce a framework for estimating precision matrix for the aforementioned financial portfolios which accounts for the fact that the returns follow approximate factor structure. We examine how to solve the Markowitz mean-variance portfolio allocation problems using factor structure in the returns. We also develop Factor Graphical Lasso Algorithm that uses the estimated common factors to obtain a sparse precision matrix of the idiosyncratic component. The resulting estimator is used to obtain the precision of the asset returns necessary to form portfolio weights.

The arbitrage pricing theory (APT), developed by Ross, (1976), postulates that the expected returns on securities should be related to their covariance with the common components or factors. The goal of the APT is to model the tendency of asset returns to move together via factor decomposition. Assume that the return generating process ( $\widetilde{{\mathbf{r}}}_{t}$ ) follows a $K$ -factor model:

\displaystyle\underbrace{\widetilde{{\mathbf{r}}}_{t}}_{p\times 1}={\mathbf{m}}+{\mathbf{B}}\underbrace{{\mathbf{f}}_{t}}_{K\times 1}+\ {\bm{\varepsilon}}_{t},\quad t=1,\ldots,T

(3.1)

where ${\mathbf{f}}_{t}=(f_{1t},\ldots,f_{Kt})^{\prime}$ are the factors, ${\mathbf{B}}$ is a $p\times K$ matrix of factor loadings, and ${\bm{\varepsilon}}_{t}$ is the idiosyncratic component that cannot be explained by the common factors. Without loss of generality, we assume throughout the paper that unconditional means of factors and idiosyncratic component are zero. Factors in (3.1) can be either observable, such as in Fama and French, (1993, 2015), or can be estimated using statistical factor models. Unobservable factors and loadings are usually estimated by the principal component analysis (PCA), as studied in Bai, (2003), Bai and Ng, (2002), Connor and Korajczyk, (1988), and Stock and Watson, (2002).

In this paper our main interest lies in establishing asymptotic properties of the estimators of precision matrix, portfolio weights and risk-exposure for the high-dimensional case. We assume that the number of common factors, $K=K_{p,T}\rightarrow\infty$ as $p\rightarrow\infty$ , or $T\rightarrow\infty$ , or both $p,T\rightarrow\infty$ , but we require that $\max\{K/p,K/T\}\rightarrow 0$ as $p,T\rightarrow\infty$ .

Our setup is similar to the one studied in Fan et al., (2013): we consider a spiked covariance model when the first $K$ principal eigenvalues of ${\bm{\Sigma}}$ are growing with $p$ , while the remaining $p-K$ eigenvalues are bounded.

Rewrite equation (3.1) in matrix form:

\underbrace{\widetilde{{\mathbf{R}}}}_{p\times T}={\mathbf{m}}{\bm{\iota}}^{\prime}_{T}+\underbrace{{\mathbf{B}}}_{p\times K}{\mathbf{F}}+{\mathbf{E}},

(3.2)

where ${\bm{\iota}}_{T}$ is a $T\times 1$ vector of ones. We further demean the returns using the sample mean, $\widehat{{\mathbf{m}}}$ , to obtain ${\mathbf{R}}\equiv\widetilde{{\mathbf{R}}}-\widehat{{\mathbf{m}}}{\bm{\iota}}^{\prime}_{T}$ . We assume that $\left\lVert\widehat{{\mathbf{m}}}-{\mathbf{m}}\right\rVert_{\text{max}}=\mathcal{O}_{P}(\sqrt{\log p/T})$ , which was proven to hold in Chang et al., (2018) (see their Lemma 1).

Let ${\bm{\Sigma}}_{\varepsilon}=T^{-1}{\mathbf{E}}{\mathbf{E}}^{\prime}$ and ${\bm{\Sigma}}_{f}=T^{-1}{\mathbf{F}}{\mathbf{F}}^{\prime}$ be covariance matrices of the idiosyncratic components and factors, and let ${\bm{\Theta}}_{\varepsilon}={\bm{\Sigma}}_{\varepsilon}^{-1}$ and ${\bm{\Theta}}_{f}={\bm{\Sigma}}_{f}^{-1}$ be their inverses. The factors and loadings in (3.2) are estimated by solving the following minimization problem: $(\widehat{{\mathbf{B}}},\widehat{{\mathbf{F}}})=\arg\!\min_{{\mathbf{B}},{\mathbf{F}}}\left\lVert{\mathbf{R}}-{\mathbf{B}}{\mathbf{F}}\right\rVert^{2}_{F}$ s.t. $\frac{1}{T}{\mathbf{F}}{\mathbf{F}}^{\prime}={\mathbf{I}}_{K},\ {\mathbf{B}}^{\prime}{\mathbf{B}}\ \text{is diagonal}$ . The constraints are needed to identify the factors (Fan et al., (2018)). It was shown (Stock and Watson, (2002)) that $\widehat{{\mathbf{F}}}=\sqrt{T}\text{eig}_{K}({\mathbf{R}}^{\prime}{\mathbf{R}})$ and $\widehat{{\mathbf{B}}}=T^{-1}{\mathbf{R}}\widehat{{\mathbf{F}}}^{\prime}$ . Given $\widehat{{\mathbf{F}}},\widehat{{\mathbf{B}}}$ , define $\widehat{{\mathbf{E}}}={\mathbf{R}}-\widehat{{\mathbf{B}}}\widehat{{\mathbf{F}}}$ . Given a sample of the estimated residuals $\{\widehat{{\bm{\varepsilon}}}_{t}={\mathbf{r}}_{t}-\widehat{{\mathbf{B}}}\widehat{{\mathbf{f}}_{t}}\}_{t=1}^{T}$ and the estimated factors $\{\widehat{{\mathbf{f}}}_{t}\}_{t=1}^{T}$ , let $\widehat{{\bm{\Sigma}}}_{\varepsilon}=(1/T)\sum_{t=1}^{T}\widehat{{\bm{\varepsilon}}}_{t}\widehat{{\bm{\varepsilon}}}_{t}^{\prime}$ and $\widehat{{\bm{\Sigma}}}_{f}=(1/T)\sum_{t=1}^{T}\widehat{{\mathbf{f}}}_{t}\widehat{{\mathbf{f}}}_{t}^{\prime}$ be the sample counterparts of the covariance matrices. Since our interest is in constructing portfolio weights, our goal is to estimate a precision matrix of the excess returns ${\bm{\Theta}}$ .

We impose a sparsity assumption on the precision matrix of the idiosyncratic errors, ${\bm{\Theta}}_{\varepsilon}$ , which is obtained using the estimated residuals after removing the co-movements induced by the factors (see Barigozzi et al., (2018); Brownlees et al., (2018); Koike, (2020)).

Let us elaborate on three reasons justifying the assumption of sparsity on the precision matrix of residuals. First, from the technical viewpoint, this assumption is widely used in high-dimensional settings when $p>T$ . Second, a more intuitive rationale for the sparsity assumption on ${\bm{\Theta}}_{\varepsilon}$ stems from its implication for the structure of corresponding optimal portfolios. Let $r_{t}^{\text{portf}}\equiv\widetilde{{\mathbf{r}}}_{t}^{\prime}{\mathbf{w}}_{t}$ be the optimal portfolio. Plugging in the definition of $\widetilde{{\mathbf{r}}}_{t}$ from (3.1), we get $r_{t}^{\text{portf}}=({\mathbf{m}}+{\bm{\varepsilon}}_{t})^{\prime}{\mathbf{w}}_{t}+{\mathbf{f}}_{t}^{\prime}{\mathbf{B}}{\mathbf{w}}_{t}$ . Hence, after hedging factor risk, we can isolate the excess return component only loading on non-factor risk. In this context, since ${\mathbf{w}}_{t}$ is a function of ${\bm{\Theta}}_{\varepsilon}$ , imposing sparsity on ${\bm{\Theta}}_{\varepsilon}$ translates into reducing the contribution of more volatile non-factor risk on the optimal portfolio and thus leading to less sensitive (more robust) investment strategies.

Third, another rationale comes from relatively high “concentration” of S&P 500 Composite Index: as evidenced from SP Global Index methodology and financial data on S&P 500 constituents by weight, 15 large companies (top 3%) comprise 30% of the total index weights (starting from Apple that has the highest weight of nearly 7%). As the number of firms, $p$ , increases, one reasonable assumption is that the number of large firms increases at a rate slower than $p$ (Chudik et al., (2011); Gabaix, (2011)). This suggests that one could divide the firms into dominant ones and followers. After the effect of common factors is accounted for, dominant firms still have significant idiosyncratic movements that influence other firms and must be taken into account when constructing a portfolio. When it comes to fringe firms (or market followers), idiosyncratic movements are smaller in magnitude and might be less relevant for portfolio allocation purposes. Hence, the network of the idiosyncratic returns is sparse and the sparsity increases with $p$ . By imposing sparsity, we only keep relatively large partial correlations among idiosyncratic components: as illustrated in Supplemental Appendix D.2, in our empirical application the estimated number of zeroes in off-diagonal elements of ${\bm{\Theta}}_{\varepsilon}$ varies over time from 74.5%-98.8%.

Henceforth, having established the need for a sparse precision of errors, we search for a tool that would help us recover its entries. This brings us to consider a family of graphical models, which have evolved from the connection between partial correlations and the entries of an adjacency matrix. The adjacency matrix has zero or one in its entries, with a zero entry indicating that two variables are independent conditional on the rest. The adjacency matrix is sometimes referred to as a “graph”. Graphical Lasso procedure (Friedman et al., (2008)) described in Supplemental Appendix A is a representative member of graphical models family: its theoretical and empirical properties have been thoroughly examined in a standard sparse setting (Friedman et al., (2008), Mazumder and Hastie, (2012), Janková and van de Geer, (2018)). One of the goals of our paper is to augment graphical models to non-sparse settings through integrating them with factor modeling. By doing so, graphical models would become adequate for applications in economics and finance.

A common way to induce sparsity is by utilizing Lasso-type penalty. This strategy is used in the Graphical Lasso (GL) together with the objective function based on the Bregman divergence for estimating inverse covariance. The discussion of GL is presented in Supplemental Appendix A. We now elaborate on the Bregman divergence class which unifies many commonly used loss functions, including the quasi-likelihood function. Let ${\mathbf{W}}_{\varepsilon}$ be an estimate of ${\bm{\Sigma}}_{\varepsilon}$ . Ravikumar et al., (2011) showed that Bregman divergence of the form $\text{trace}({\mathbf{W}}_{\varepsilon}{\bm{\Theta}}_{\varepsilon})-\log\det({\bm{\Theta}}_{\varepsilon})$ , known as the log-determinant Bregman function, is suitable to be used as a measure of the quality of constructed sparse approximations of signals such as precision matrices. As pointed out by Ravikumar et al., (2011), in principle one could use other Bregman divergences including the von Neumann Entropy or the Frobenius divergence which would lead to alternative forms of divergence minimizations for estimating precision matrix. We proceed with the log-determinant Bregman function since (i) it ensures positive definite estimator of precision matrix; (ii) the population optimization problem involves only the population covariance and not its inverse; (iii) the log-determinant divergence gives rise to the likelihood function in the multivariate Gaussian case. At the same time, despite its resemblance with the Gaussian log-likelihood, Bregman divergence was shown to be applicable for non-Gaussian distributions (Ravikumar et al., (2011)). Let $\widehat{{\mathbf{D}}}_{\varepsilon}^{2}\equiv\textup{diag}({\mathbf{W}}_{\varepsilon})$ . To sparsify entries of precision matrix of the idiosyncratic errors ${\bm{\Theta}}_{\varepsilon}$ , we use the following penalized Bregman divergence with the Weighted Graphical Lasso penalty:

\displaystyle\widehat{{\bm{\Theta}}}_{\varepsilon,\lambda}=\arg\!\min_{{\bm{\Theta}}\in\mathcal{S}_{p}^{++}}\text{trace}({\mathbf{W}}_{\varepsilon}{\bm{\Theta}}_{\varepsilon})-\log\det({\bm{\Theta}}_{\varepsilon})+\lambda\sum_{i\neq j}\widehat{d}_{\varepsilon,ii}\widehat{d}_{\varepsilon,jj}\left\lvert\theta_{\varepsilon,ij}\right\rvert.

(3.3)

The subscript $\lambda$ in $\widehat{{\bm{\Theta}}}_{\varepsilon,\lambda}$ means that the solution of the optimization problem in (3.3) will depend upon the choice of the tuning parameter which is discussed below. Section 4 establishes sparsity requirements that guarantee convergence of (3.3). In order to simplify notation, we will omit the subscript $\lambda$ .

The objective function in (3.3) extends the family of linear shrinkage estimators of the first moment to linear shrinkage estimators of the inverse of the second moments. Instead of restricting the number of regressors for estimating conditional mean, equation (3.3) restricts the number of edges in a graph by shrinking some off-diagonal entries of precision matrix to zero. Note that shrinkage occurs adaptively with respect to partial covariances.

Let us discuss the choice of the tuning parameter $\lambda$ in (3.3). Let $\widehat{{\bm{\Theta}}}_{\varepsilon,\lambda}$ be the solution to (3.3) for a fixed $\lambda$ . Following Koike, (2020), we minimize the following Bayesian Information Criterion (BIC) using grid search:

\text{BIC}(\lambda)\equiv T\Big{[}\text{trace}(\widehat{{\bm{\Theta}}}_{\varepsilon,\lambda}\widehat{{\bm{\Sigma}}}_{\varepsilon})-\log\text{det}(\widehat{{\bm{\Theta}}}_{\varepsilon,\lambda})\Big{]}+(\log T)\sum_{i\leq j}\mathds{1}\left[\widehat{\theta}_{\varepsilon,\lambda,ij}\neq 0\right].

(3.4)

The grid $\mathcal{G}\equiv\{\lambda_{1},\ldots,\lambda_{M}\}$ is constructed as follows: the maximum value in the grid, $\lambda_{M}$ , is set to be the smallest value for which all the off-diagonal entries of $\widehat{{\bm{\Theta}}}_{\varepsilon,\lambda_{M}}$ are zero. The smallest value of the grid, $\lambda_{1}\in\mathcal{G}$ , is determined as $\lambda_{1}\equiv\vartheta\lambda_{M}$ for a constant $0<\vartheta<1$ . The remaining grid values $\lambda_{1},\ldots,\lambda_{M}$ are constructed in the ascending order from $\lambda_{1}$ to $\lambda_{M}$ on the log scale:

\lambda_{i}=\exp\Big{(}\log(\lambda_{1})+\frac{i-1}{M-1}\log(\lambda_{M}/\lambda_{1})\Big{)},\quad i=2,\ldots,M-1.

We use $\vartheta=\omega_{3T}$ which is defined in Theorem 2 of the next section and $M=10$ in the simulations and the empirical exercise.

Having estimated factors, factor loadings and precision matrix of the idiosyncratic components, we combine them using the Sherman-Morrison-Woodbury formula to estimate the precision matrix of excess returns:

\widehat{{\bm{\Theta}}}=\widehat{{\bm{\Theta}}}_{\varepsilon}-\widehat{{\bm{\Theta}}}_{\varepsilon}\widehat{{\mathbf{B}}}[\widehat{{\bm{\Theta}}}_{f}+\widehat{{\mathbf{B}}}^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}\widehat{{\mathbf{B}}}]^{-1}\widehat{{\mathbf{B}}}^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}.

(3.5)

To solve (3.3) we use the procedure based on the GL. However, the original algorithm developed by Friedman et al., (2008) is not suitable under the factor structure. Our procedure called Factor Graphical Lasso (FGL), which is summarized in Procedure 1, augments the standard GL: it starts with estimating factors, loadings (low-rank part) and error terms (sparse part), then it proceeds by recovering sparse precision matrix of the errors using GL, and, finally, low-rank and sparse components are combined through Shermann-Morrison-Woodbury formula in (3.5).

Procedure 1 Factor Graphical Lasso

1: (Factor Model) Estimate

\widehat{{\mathbf{f}}}_{t}

and

\widehat{{\mathbf{b}}}_{i}

(Theorem 1). Get

\widehat{{\bm{\varepsilon}}}_{t}={\mathbf{r}}_{t}-\widehat{{\mathbf{B}}}\widehat{{\mathbf{f}}_{t}}

\widehat{{\bm{\Sigma}}}_{\varepsilon}

\widehat{{\bm{\Sigma}}}_{f}

and

\widehat{{\bm{\Theta}}}_{f}=\widehat{{\bm{\Sigma}}}_{f}^{-1}

2: (GL) Use GL from Friedman et al., (2008) (see Supplemental Appendix A for more details) to get

\widehat{{\bm{\Theta}}}_{\varepsilon}

. (Theorem 2)

3: (FGL) Use

\widehat{{\bm{\Theta}}}_{\varepsilon}

\widehat{{\bm{\Theta}}}_{f}

and

\widehat{{\mathbf{b}}}_{i}

from Steps 1-2 to get

\widehat{{\bm{\Theta}}}

in Equation (3.5). (Theorem 3)

4: Use

\widehat{{\bm{\Theta}}}

to get

\widehat{{\mathbf{w}}}_{\xi}

\xi\in\{\text{GMV, MWC, MRC}\}

. (Theorem 4)

5: Use

\widehat{{\bm{\Sigma}}}=\widehat{{\bm{\Theta}}}^{-1}

and

\widehat{{\mathbf{w}}}_{\xi}

to get portfolio exposure

\widehat{{\mathbf{w}}}_{\xi}^{{}^{\prime}}\widehat{{\bm{\Sigma}}}\widehat{{\mathbf{w}}}_{\xi}

. (Theorem 5)

The estimator produced by GL in general and FGL in particular is guaranteed to be positive definite. We have verified it in the simulations (Section 5) and the empirical application (Section 6). In Section 4, consistency properties of estimators are established for the factors and loadings (Theorem 1), the precision matrix of ${\bm{\varepsilon}}$ (Theorem 2), the precision matrix ${\bm{\Theta}}$ (Theorem 3), portfolio weights (Theorem 4), and the portfolio risk exposure (Theorem 5). We can use $\widehat{{\bm{\Theta}}}$ obtained from (3.5) using Step 4 of Procedure 1 to estimate portfolio weights in (2.2), (2.3) and (2.5):

4 Asymptotic Properties

In this section we first provide a brief review of the terminology used in the literature on graphical models and the approaches to estimate a precision matrix. After that we establish consistency of the Factor Graphical Lasso in Procedure 1. We also study consistency of the estimators of weights in (2.2), (2.3) and (2.5) and the implications on the out-of sample Sharpe Ratio. Throughout the main text we assume that errors and factors have exponential-type tails (1(c)). Supplemental Appendix B.10 proves that the conclusions of all theorems studied in Section 4 continue to hold when this assumption is relaxed.

The review of the Gaussian graphical models is based on Hastie et al., (2001) and Bishop, (2006). A graph consists of a set of vertices (nodes) and a set of edges (arcs) that join some pairs of the vertices. In graphical models, each vertex represents a random variable, and the graph visualizes the joint distribution of the entire set of random variables. The edges in a graph are parameterized by potentials (values) that encode the strength of the conditional dependence between the random variables at the corresponding vertices. Sparse graphs have a relatively small number of edges. Among the main challenges in working with the graphical models are choosing the structure of the graph (model selection) and estimation of the edge parameters from the data.

Let $A\in\mathcal{S}_{p}$ . Define the following set for $j=1,\ldots,p$ :

\displaystyle D_{j}(A)\equiv\{i:A_{ij}\neq 0,\ i\neq j\},\quad d_{j}(A)\equiv\text{card}(D_{j}(A)),\quad d(A)\equiv\max_{j=1,\ldots,p}d_{j}(A),

(4.1)

where $d_{j}(A)$ is the number of edges adjacent to the vertex $j$ (i.e., the degree of vertex $j$ ), and $d(A)$ measures the maximum vertex degree. Define $S(A)\equiv\bigcup_{j=1}^{p}D_{j}(A)$ to be the overall off-diagonal sparsity pattern, and $s(A)\equiv\sum_{j=1}^{p}d_{j}(A)$ is the overall number of edges contained in the graph. Note that $\text{card}(S(A))\leq s(A)$ : when $s(A)=p(p-1)/2$ this would give a fully connected graph.

4.1 Assumptions

We now list the assumptions on the model (3.1):

\edefmbx(A.1)

(Spiked covariance model) As $p\rightarrow\infty$ , $\Lambda_{1}({\bm{\Sigma}})>\Lambda_{2}({\bm{\Sigma}})>\ldots>\Lambda_{K}({\bm{\Sigma}})\gg\Lambda_{K+1}({\bm{\Sigma}})\geq\ldots\geq\Lambda_{p}({\bm{\Sigma}})\geq 0$ , where $\Lambda_{j}({\bm{\Sigma}})=\mathcal{O}(p)$ for $j\leq K$ , while the non-spiked eigenvalues are bounded, that is, $c_{0}\leq\Lambda_{j}({\bm{\Sigma}})\leq C_{0}$ , $j>K$ for constants $c_{0},C_{0}>0$ .

\edefmbx(A.2)

(Pervasive factors) There exists a positive definite $K\times K$ matrix $\breve{{\mathbf{B}}}$ such that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|p^{-1}{\mathbf{B}}^{\prime}{\mathbf{B}}-\breve{{\mathbf{B}}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}\rightarrow 0$ and $\Lambda_{\text{min}}(\breve{{\mathbf{B}}})^{-1}=\mathcal{O}(1)$ as $p\rightarrow\infty$ .

\edefmbx(A.3)

(a)

$\{{\bm{\varepsilon}}_{t},{\mathbf{f}}_{t}\}_{t\geq 1}$ is strictly stationary. Also, $\mathbb{E}_{\,\!\!}\left[\varepsilon_{it}\right]=\mathbb{E}_{\,\!\!}\left[\varepsilon_{it}f_{it}\right]=0$ $\forall i\leq p$ , $j\leq K$ and $t\leq T$ .
(b)

There are constants $c_{1},c_{2}>0$ such that $\Lambda_{\text{min}}({\bm{\Sigma}}_{\varepsilon})>c_{1}$ , ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Sigma}}_{\varepsilon}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}<c_{2}$ and $\text{min}_{i\leq p,j\leq p}\text{var}(\varepsilon_{it}\varepsilon_{jt})>c_{1}$ .

(c)

There are $r_{1},r_{2}>0$ and $b_{1},b_{2}>0$ such that for any $s>0$ , $i\leq p$ , $j\leq K$ ,

\displaystyle\Pr{(\left\lvert\varepsilon_{it}\right\rvert>s)\leq\exp\{-(s/b_{1})^{r_{1}}\}},\ \Pr{(\left\lvert f_{jt}\right\rvert>s)\leq\exp\{-(s/b_{2})^{r_{2}}\}}.

We also impose the strong mixing condition. Let $\mathcal{F}_{-\infty}^{0}$ and $\mathcal{F}_{T}^{\infty}$ denote the $\sigma$ -algebras that are generated by $\{({\mathbf{f}}_{t},{\bm{\varepsilon}}_{t}):t\leq 0\}$ and $\{({\mathbf{f}}_{t},{\bm{\varepsilon}}_{t}):t\geq T\}$ respectively. Define the mixing coefficient

\alpha(T)=\sup_{A\in\mathcal{F}_{-\infty}^{0},B\in\mathcal{F}_{T}^{\infty}}\left\lvert\Pr{A}\Pr{B}-\Pr{AB}\right\rvert.

(4.2)

\edefmbx(A.4)

(Strong mixing) There exists $r_{3}>0$ such that $3r_{1}^{-1}+1.5r_{2}^{-1}+3r_{3}^{-1}>1$ , and $C>0$ satisfying, for all $T\in\mathbb{Z}^{+}$ , $\alpha(T)\leq\exp(-CT^{r_{3}})$ .

\edefmbx(A.5)
(Regularity conditions) There exists $M>0$ such that, for all $i\leq p$ , $t\leq T$ and $s\leq T$ , such that:
1. (a)
  
  $\left\lVert{\mathbf{b}}_{i}\right\rVert_{\text{max}}<M$
2. (b)
  
  $\mathbb{E}_{\,\!\!}\left[p^{-1/2}\{{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}-\mathbb{E}_{\,\!\!}\left[{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}\right]\}\right]^{4}<M$ and
3. (c)
  
  $\mathbb{E}_{\,\!\!}\left[\left\lVert p^{-1/2}\sum_{i=1}^{p}{\mathbf{b}}_{i}\varepsilon_{it}\right\rVert^{4}\right]<K^{2}M$ .

Some comments regarding the aforementioned assumptions are in order. Assumptions 1-1 are the same as in Fan et al., (2013), and assumption 1 is modified to account for the increasing number of factors. Assumption 1 divides the eigenvalues into the diverging and bounded ones. Without loss of generality, we assume that $K$ largest eigenvalues have multiplicity of 1. The assumption of a spiked covariance model is common in the literature on approximate factor models. However, we note that the model studied in this paper can be characterized as a “very spiked model”. In other words, the gap between the first $K$ eigenvalues and the rest is increasing with $p$ . As pointed out by Fan et al., (2018), 1 is typically satisfied by the factor model with pervasive factors, which brings us to Assumption 1: the factors impact a non-vanishing proportion of individual time-series. Supplemental Appendix C.4 explores the sensitivity of portfolios constructed using FGL when the pervasiveness assumption is relaxed, that is, when the gap between the diverging and bounded eigenvalues decreases. Assumption 1(a) is slightly stronger than in Bai, (2003), since it requires strict stationarity and non-correlation between $\{{\bm{\varepsilon}}_{t}\}$ and $\{{\mathbf{f}}_{t}\}$ to simplify technical calculations. In 1(b) we require ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Sigma}}_{\varepsilon}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}<c_{2}$ instead of $\lambda_{\textup{max}}({\bm{\Sigma}}_{\varepsilon})=\mathcal{O}(1)$ to estimate $K$ consistently. When $K$ is known, as in Koike, (2020); Fan et al., (2011), this condition can be relaxed. 1(c) requires exponential-type tails to apply the large deviation theory to $(1/T)\sum_{t=1}^{T}\varepsilon_{it}\varepsilon_{jt}-\sigma_{\varepsilon,ij}$ and $(1/T)\sum_{t=1}^{T}f_{jt}\varepsilon_{it}$ . However, in Supplemental Appendix B.10 we discuss the extension of our results to the setting with elliptical distribution family which is more appropriate for financial applications. Specifically, we discuss the appropriate modifications to the initial estimator of the covariance matrix of returns such that the bounds derived in this paper continue to hold. 1-1 are technical conditions which are needed to consistently estimate the common factors and loadings. The conditions 1(a-b) are weaker than those in Bai, (2003) since our goal is to estimate a precision matrix, and 1(c) differs from Bai, (2003) and Bai and Ng, (2006) in that the number of factors is assumed to slowly grow with $p$ .

In addition, the following structural assumption on the population quantities is imposed:

\edefmbx(B.1)

$\left\lVert{\bm{\Sigma}}\right\rVert_{\text{max}}=\mathcal{O}(1)$ , $\left\lVert{\mathbf{B}}\right\rVert_{\text{max}}=\mathcal{O}(1)$ , and $\left\lVert{\mathbf{m}}\right\rVert_{\infty}=\mathcal{O}(1)$ .

The sparsity of ${\bm{\Theta}}_{\varepsilon}$ is controlled by the deterministic sequences $s_{T}$ and $d_{T}$ : $s({\bm{\Theta}}_{\varepsilon})=\mathcal{O}_{P}(s_{T})$ for some sequence $s_{T}\in(0,\infty),\ T=1,2,\ldots$ , and $d({\bm{\Theta}}_{\varepsilon})=\mathcal{O}_{P}(d_{T})$ for some sequence $d_{T}\in(0,\infty),\ T=1,2,\ldots$ . We will impose restrictions on the growth rates of $s_{T}$ and $d_{T}$ . Note that assumptions on $d_{T}$ are weaker since they are always satisfied when $s_{T}=d_{T}$ . However, $d_{T}$ can generally be smaller than $s_{T}$ . In contrast to Fan et al., (2013) we do not impose sparsity on the covariance matrix of the idiosyncratic component. Instead, it is more realistic and relevant for error quantification in portfolio analysis to impose conditional sparsity on the precision matrix after the common factors are accounted for.

4.2 The FGL Procedure

Recall the definition of the Weighted Graphical Lasso estimator in (3.3) for the precision matrix of the idiosyncratic components. Also, recall that to estimate ${\bm{\Theta}}$ we used equation (3.5). Therefore, in order to obtain the FGL estimator $\widehat{{\bm{\Theta}}}$ we take the following steps: (1): estimate unknown factors and factor loadings to get an estimator of ${\bm{\Sigma}}_{\varepsilon}$ . (2): use $\widehat{{\bm{\Sigma}}}_{\varepsilon}$ to get an estimator of ${\bm{\Theta}}_{\varepsilon}$ in (3.3). (3): use $\widehat{{\bm{\Theta}}}_{\varepsilon}$ together with the estimators of factors and factor loadings from Step 1 to obtain the final precision matrix estimator $\widehat{{\bm{\Theta}}}$ , portfolio weight estimator $\widehat{{\mathbf{w}}}_{\xi}$ , and risk exposure estimator $\widehat{\Phi}_{\xi}=\widehat{{\mathbf{w}}}^{\prime}_{\xi}\widehat{{\bm{\Theta}}}^{-1}\widehat{{\mathbf{w}}}_{\xi}$ where $\xi\in\{\text{GMV, MWC, MRC}\}$ .

Subsection 4.3 examines the theoretical foundations of the first step, and Subsections 4.4-4.5 are devoted to Steps 2 and 3.

4.3 Convergence in Estimation of Factors and Loadings

As pointed out in Bai, (2003) and Fan et al., (2013), $K\times 1$ -dimensional factor loadings $\{{\mathbf{b}}_{i}\}_{i=1}^{p}$ , which are the rows of the factor loadings matrix ${\mathbf{B}}$ , and $K\times 1$ -dimensional common factors $\{{\mathbf{f}}_{t}\}_{t=1}^{T}$ , which are the columns of ${\mathbf{F}}$ , are not separately identifiable. Concretely, for any $K\times K$ matrix ${\mathbf{H}}$ such that ${\mathbf{H}}^{\prime}{\mathbf{H}}={\mathbf{I}}_{K}$ , ${\mathbf{B}}{\mathbf{f}}_{t}={\mathbf{B}}{\mathbf{H}}^{\prime}{\mathbf{H}}{\mathbf{f}}_{t}$ , therefore, we cannot identify the tuple $({\mathbf{B}},{\mathbf{f}}_{t})$ from $({\mathbf{B}}{\mathbf{H}}^{\prime},{\mathbf{H}}{\mathbf{f}}_{t})$ . Let $\widehat{K}\in\{1,\ldots,K_{\text{max}}\}$ denote the estimated number of factors, where $K_{\text{max}}$ is allowed to increase at a slower speed than $\min\{p,T\}$ such that $K_{\text{max}}=o(\min\{p^{1/3},T\})$ (see Li et al., (2017) for the discussion about the rate).

Define ${\mathbf{V}}$ to be a $\widehat{K}\times\widehat{K}$ diagonal matrix of the first $\widehat{K}$ largest eigenvalues of the sample covariance matrix in decreasing order. Further, define a $\widehat{K}\times\widehat{K}$ matrix ${\mathbf{H}}=(1/T){\mathbf{V}}^{-1}\widehat{{\mathbf{F}}}^{\prime}{\mathbf{F}}{\mathbf{B}}^{\prime}{\mathbf{B}}$ . For $t\leq T$ , ${\mathbf{H}}{\mathbf{f}}_{t}=T^{-1}{\mathbf{V}}^{-1}\widehat{{\mathbf{F}}}^{\prime}({\mathbf{B}}{\mathbf{f}}_{1},\ldots,{\mathbf{B}}{\mathbf{f}}_{T})^{\prime}{\mathbf{B}}{\mathbf{f}}_{t}$ , which depends only on the data ${\mathbf{V}}^{-1}\widehat{{\mathbf{F}}}^{\prime}$ and an identifiable part of parameters $\{{\mathbf{B}}{\mathbf{f}}_{t}\}_{t=1}^{T}$ . Hence, ${\mathbf{H}}{\mathbf{f}}_{t}$ does not have an identifiability problem regardless of the imposed identifiability condition.

Let $\gamma^{-1}=3r_{1}^{-1}+1.5r_{2}^{-1}+r_{3}^{-1}+1$ . The following theorem is an extension of the results in Fan et al., (2013) for the case when the number of factors is unknown and is allowed to grow. Proofs of all the theorems are in Supplemental Appendix B.

Theorem 1.

Suppose that $K_{\textup{max}}=o(\min\{p^{1/3},T\})$ , $K^{3}\log p=o(T^{\gamma/6})$ , $KT=o(p^{2})$ and Assumptions 1-1 and 1 hold. Let $\omega_{1T}\equiv K^{3/2}\sqrt{\log p/T}+K/\sqrt{p}$ and $\omega_{2T}\equiv K/\sqrt{T}+KT^{1/4}/\sqrt{p}$ . Then $\max_{i\leq p}\left\lVert\widehat{{\mathbf{b}}}_{i}-{\mathbf{H}}{\mathbf{b}}_{i}\right\rVert=\mathcal{O}_{P}(\omega_{1T})$ and $\max_{t\leq T}\left\lVert\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t}\right\rVert=\mathcal{O}_{P}(\omega_{2T})$ .

The conditions $K^{3}\log p=o(T^{\gamma/6})$ , $KT=o(p^{2})$ are similar to Fan et al., (2013), the difference arises due to the fact that we do not fix $K$ , hence, in addition to the factor loadings, there are $KT$ factors to estimate. Therefore, the number of parameters introduced by the unknown growing factors should not be “too large”, such that we can consistently estimate them uniformly. The growth rate of the number of factors is controlled by $K_{\text{max}}=o(\min\{p^{1/3},T\})$ .

The bounds derived in Theorem 1 help us establish the convergence properties of the estimated idiosyncratic covariance, $\widehat{{\bm{\Sigma}}}_{\varepsilon}$ , and precision matrix $\widehat{{\bm{\Theta}}}_{\varepsilon}$ which are presented in the next theorem:

Theorem 2.

Let $\omega_{3T}\equiv K^{2}\sqrt{\log p/T}+K^{3}/\sqrt{p}$ . Under the assumptions of Theorem 1 and with $\lambda\asymp\omega_{3T}$ (where $\lambda$ is the tuning parameter in (3.3)), the estimator $\widehat{{\bm{\Sigma}}}_{\varepsilon}$ obtained by estimating factor model in (3.2) satisfies $\left\lVert\widehat{{\bm{\Sigma}}}_{\varepsilon}-{\bm{\Sigma}}_{\varepsilon}\right\rVert_{\textup{max}}=\mathcal{O}_{P}(\omega_{3T}).$ Let $\varrho_{T}$ be a sequence of positive-valued random variables such that $\varrho_{T}^{-1}\omega_{3T}\xrightarrow{p}0$ . If $s_{T}\varrho_{T}\xrightarrow{\text{p}}0$ , then ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Theta}}}_{\varepsilon}-{\bm{\Theta}}_{\varepsilon}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{l}=\mathcal{O}_{P}(\varrho_{T}s_{T})$ as $T\rightarrow\infty$ for any $l\in[1,\infty]$ .

Note that the term containing $K^{3}/\sqrt{p}$ arises due to the need to estimate unknown factors. Fan et al., (2011) obtained a similar rate but for the case when factors are observable (in their work, $\omega_{3T}=K^{1/2}\sqrt{\log p/T}$ ). The second part of Theorem 2 is based on the relationship between the convergence rates of the estimated covariance and precision matrices established in Janková and van de Geer, (2018) (Theorem 14.1.3). Koike, (2020) obtained the convergence rate when factors are observable: the rate obtained in our paper is slower due to the fact that factors need to be estimated (concretely, the rate under observable factors would satisfy $\varrho_{T}^{-1}\sqrt{K\log p/T}\xrightarrow{p}0$ ). We now comment on the optimality of the rate in Theorem 2: as pointed out in Koike, (2020), in the standard Gaussian setting without factor structure, the minimax optimal rate is $d({\bm{\Theta}}_{\varepsilon})\sqrt{\log p/T}$ , which can be faster than the rate obtained in Theorem 2 if $d({\bm{\Theta}}_{\varepsilon})<s_{T}$ . Using penalized nodewise regression could help achieve this faster rate. However, our empirical application to the monthly stock returns demonstrated superior performance of the Weighted Graphical Lasso compared to the nodewise regression in terms of the out-of-sample Sharpe Ratio and portfolio risk. Hence, in order not to divert the focus of this paper, we leave the theoretical properties of the nodewise regression for future research.

4.4 Convergence in Estimation of Precision Matrix and Portfolio Weights

Having established the convergence properties of $\widehat{{\bm{\Sigma}}}_{\varepsilon}$ and $\widehat{{\bm{\Theta}}}_{\varepsilon}$ , we now move to the estimation of the precision matrix of the factor-adjusted returns in equation (3.5).

Theorem 3.

Under the assumptions of Theorem 2, if $d_{T}s_{T}\varrho_{T}\xrightarrow{\text{p}}0$ , then ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}=\mathcal{O}_{P}(\varrho_{T}s_{T})$ and ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}=\mathcal{O}_{P}(\varrho_{T}d_{T}K^{3/2}s_{T})$ .

Note that since, by construction, the precision matrix obtained using the Factor Graphical Lasso is symmetric, ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\infty}$ can be trivially obtained from the above theorem.

Using Theorem 3, we can then establish the consistency of the estimated weights of portfolios based on the Factor Graphical Lasso.

Theorem 4.

Under the assumptions of Theorem 3, we additionally assume ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}=\mathcal{O}(1)$ (this additional requirement essentially imposes $\Lambda_{p}({\bm{\Sigma}})>0$ in 1), and $\varrho_{T}d_{T}^{2}s_{T}=o(1)$ . Procedure 1 consistently estimates portfolio weights in (2.2), (2.3) and (2.5):
$\left\lVert\widehat{{\mathbf{w}}}_{\text{GMV}}-{\mathbf{w}}_{\text{GMV}}\right\rVert_{1}=\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}^{2}K^{3}s_{T}\Big{)}=o_{P}(1)$ , $\left\lVert\widehat{{\mathbf{w}}}_{\text{MWC}}-{\mathbf{w}}_{\text{MWC}}\right\rVert_{1}=\mathcal{O}_{P}(\varrho_{T}d_{T}^{2}K^{3}s_{T})=o_{P}(1)$ , and $\left\lVert\widehat{{\mathbf{w}}}_{\text{MRC}}-{\mathbf{w}}_{\text{MRC}}\right\rVert_{1}=\mathcal{O}_{P}\Big{(}d_{T}^{3/2}K^{3}\cdot[\varrho_{T}s_{T}]^{1/2}\Big{)}=o_{P}(1)$ .

We now comment on the rates in Theorem 4: first, the rates obtained by Callot et al., (2021) for GMV and MWC formulations, when no factor structure of stock returns is assumed, require $s({\bm{\Theta}})^{3/2}\sqrt{\log p/T}=o_{P}(1)$ , where the authors imposed sparsity on the precision matrix of stock returns, ${\bm{\Theta}}$ . Therefore, if the precision matrix of stock returns is not sparse, portfolio weights can be consistently estimated only if $p$ is less than $T^{1/3}$ (since $(p-1)^{3/2}\sqrt{\log p/T}=o(1)$ is required to ensure consistent estimation of portfolio weights). Our result in Theorem 4 improves this rate and shows that as long as $d_{T}^{2}s_{T}K^{3}\sqrt{\log p/T}=o_{P}(1)$ we can consistently estimate weights of the financial portfolio. Specifically, when the precision of the factor-adjusted returns is sparse, we can consistently estimate portfolio weights when $p>T$ without assuming sparsity on ${\bm{\Sigma}}$ or ${\bm{\Theta}}$ . Second, note that GMV and MWC weights converge slightly slower than MRC weight. This result is further supported by our simulations presented in the next section.

4.5 Implications on Portfolio Risk Exposure

Having examined the properties of portfolio weights, it is natural to comment on the portfolio variance estimation error. It is determined by the errors in two components: the estimated covariance matrix and the estimated portfolio weights. Define $a={\bm{\iota}}^{\prime}_{p}{\bm{\Theta}}{\bm{\iota}}_{p}/p$ , $b={\bm{\iota}}^{\prime}_{p}{\bm{\Theta}}{\mathbf{m}}/p$ , $d={\mathbf{m}}^{\prime}{\bm{\Theta}}{\mathbf{m}}/p$ , $g=\sqrt{{\mathbf{m}}^{\prime}{\bm{\Theta}}{\mathbf{m}}}/p$ and $\widehat{a}={\bm{\iota}}^{\prime}_{p}\widehat{{\bm{\Theta}}}{\bm{\iota}}_{p}/p$ , $\widehat{b}={\bm{\iota}}^{\prime}_{p}\widehat{{\bm{\Theta}}}\widehat{{\mathbf{m}}}/p$ , $\widehat{d}=\widehat{{\mathbf{m}}}^{\prime}\widehat{{\bm{\Theta}}}\widehat{{\mathbf{m}}}/p$ , $\widehat{g}=\sqrt{\widehat{{\mathbf{m}}}^{\prime}\widehat{{\bm{\Theta}}}\widehat{{\mathbf{m}}}}/p$ . Define $\Phi_{\text{GMV}}={\mathbf{w}}_{GMV}^{\prime}{\bm{\Sigma}}{\mathbf{w}}_{GMV}=(pa)^{-1}$ to be the global minimum variance, $\Phi_{\text{MWC}}={\mathbf{w}}_{MWC}^{\prime}{\bm{\Sigma}}{\mathbf{w}}_{MWC}=p^{-1}\Big{[}\frac{a\mu^{2}-2b\mu+d}{ad-b^{2}}\Big{]}$ is the MWC portfolio variance, and $\Phi_{\text{MRC}}={\mathbf{w}}_{MRC}^{\prime}{\bm{\Sigma}}{\mathbf{w}}_{MRC}=\sigma^{2}(pg)$ is the MRC portfolio variance. We use the terms variance and risk exposure interchangeably. Let $\widehat{\Phi}_{\text{GMV}}$ , $\widehat{\Phi}_{\text{MWC}}$ , and $\widehat{\Phi}_{\text{MRC}}$ be the sample counterparts of the respective portfolio variances. The expressions for $\Phi_{\text{GMV}}$ and $\Phi_{\text{MWC}}$ were derived in Fan et al., (2008) and Callot et al., (2021). Theorem 5 establishes the consistency of a large portfolio’s variance estimator.

Theorem 5.

Under the assumptions of Theorem 3, FGL consistently estimates GMV, MWC, and MRC portfolio variance:
$\left\lvert\widehat{\Phi}_{\text{GMV}}/\Phi_{\text{GMV}}-1\right\rvert=\mathcal{O}_{P}(\varrho_{T}d_{T}s_{T}K^{3/2})=o_{P}(1)$ ,
$\left\lvert\widehat{\Phi}_{\text{MWC}}/\Phi_{\text{MWC}}-1\right\rvert=\mathcal{O}_{P}(\varrho_{T}d_{T}s_{T}K^{3/2})=o_{P}(1)$ ,
$\left\lvert\widehat{\Phi}_{\text{MRC}}/\Phi_{\text{MRC}}-1\right\rvert=\mathcal{O}_{P}\Big{(}[\varrho_{T}d_{T}s_{T}K^{3/2}]^{1/2}\Big{)}=o_{P}(1)$ .

Callot et al., (2021) derived a similar result for $\Phi_{\text{GMV}}$ and $\Phi_{\text{MWC}}$ under the assumption that precision matrix of stock returns is sparse. Also, Ding et al., (2021) derived the bounds for $\Phi_{\text{GMV}}$ under the factor structure assuming sparse covariance matrix of idiosyncratic components and gross exposure constraint on portfolio weights which limits negative positions.

The empirical application in Section 6 reveals that the portfolios constructed using MRC formulation have higher risk compared with GMV and MWC alternatives: using monthly and daily returns of the components of S&P500 index, MRC portfolios exhibit higher out-of-sample risk and return compared to the alternative formulations. Furthermore, the empirical exercise demonstrates that the higher return of MRC portfolios outweighs higher risk for the monthly data which is evidenced by the increased out-of-sample Sharpe Ratio.

5 Monte Carlo

In order to validate our theoretical results, we perform several simulation studies which are divided into four parts. The first set of results computes the empirical convergence rates and compares them with the theoretical expressions derived in Theorems 3-5. The second set of results compares the performance of the FGL with several alternative models for estimating covariance and precision matrix. To highlight the benefit of using the information about factor structure as opposed to standard graphical models, we include Graphical Lasso by Friedman et al., (2008) (GL) that does not account for the factor structure. To explore the benefits of using FGL for error quantification in (3.5), we consider several alternative estimators of covariance/precision matrix of the idiosyncratic component in (3.5): (1) linear shrinkage estimator of covariance developed by Ledoit and Wolf, (2004) further referred to as Factor LW or FLW; (2) nonlinear shrinkage estimator of covariance by Ledoit and Wolf, (2017) (Factor NLW or FNLW); (3) POET (Fan et al., (2013)); (4) constrained $\ell_{1}$ -minimization for inverse matrix estimator, Clime (Cai et al., (2011)) (Factor Clime or FClime). Furthermore, we discovered that in certain setups the estimator of covariance produced by POET is not positive definite. In such cases we use the matrix symmetrization procedure as in Fan et al., (2018) and then use eigenvalue cleaning as in Callot et al., (2017) and Hautsch et al., (2012). This estimator is referred to as Projected POET; it coincides with POET when the covariance estimator produced by the latter is positive definite. The third set of results examines the performance of FGL and Robust FGL (described in Supplemental Appendix B.10) when the dependent variable follows elliptical distribution. The fourth set of results explores the sensitivity of portfolios constructed using different covariance and precision estimators of interest when the pervasiveness assumption 1 is relaxed, that is, when the gap between the diverging and bounded eigenvalues decreases. All exercises in this section use 100 Monte Carlo simulations.

We consider the following setup: let $p=T^{\delta}$ , $\delta=0.85$ , $K=2(\log T)^{0.5}$ and $T=[2^{h}],\ \text{for}\ h=7,7.5,8,\ldots,9.5$ . A sparse precision matrix of the idiosyncratic components is constructed as follows: we first generate the adjacency matrix using a random graph structure. Define a $p\times p$ adjacency matrix ${\mathbf{A}}_{\varepsilon}$ which is used to represent the structure of the graph:

\displaystyle a_{\varepsilon,ij}=\begin{cases}1,&\text{for}\ i\neq j\ \ \text{with probability $q$},\\ 0,&\text{otherwise.}\end{cases}

(5.1)

Let $a_{\varepsilon,ij}$ denote the $i,j$ -th element of the adjacency matrix ${\mathbf{A}}_{\varepsilon}$ . We set $a_{\varepsilon,ij}=a_{\varepsilon,ji}=1,\ \text{for}\ i\neq j$ with probability $q$ , and $0$ otherwise. Such structure results in $s_{T}=p(p-1)q/2$ edges in the graph. To control sparsity, we set $q=1/(pT^{0.8})$ , which makes $s_{T}=\mathcal{O}(T^{0.05})$ . The adjacency matrix has all diagonal elements equal to zero. Hence, to obtain a positive definite precision matrix we apply the procedure described in Zhao et al., (2012): using their notation, ${\bm{\Theta}}_{\varepsilon}={\mathbf{A}}_{\varepsilon}\cdot v+{\mathbf{I}}(\left\lvert\tau\right\rvert+0.1+u)$ , where $u>0$ is a positive number added to the diagonal of the precision matrix to control the magnitude of partial correlations, $v$ controls the magnitude of partial correlations with $u$ , and $\tau$ is the smallest eigenvalue of ${\mathbf{A}}_{\varepsilon}\cdot v$ . In our simulations we use $u=0.1$ and $v=0.3$ .

Factors are assumed to have the following structure:

	$\displaystyle{\mathbf{f}}_{t}=\phi_{f}{\mathbf{f}}_{t-1}+{\bm{\zeta}}_{t}$		(5.2)
	$\displaystyle\underbrace{{\mathbf{r}}_{t}}_{p\times 1}={\mathbf{m}}+{\mathbf{B}}\underbrace{{\mathbf{f}}_{t}}_{K\times 1}+\ {\bm{\varepsilon}}_{t},\quad t=1,\ldots,T$		(5.3)

where $m_{i}\sim\mathcal{N}(1,1)$ independently for each $i=1,\ldots,p$ , ${\bm{\varepsilon}}_{t}$ is a $p\times 1$ random vector of idiosyncratic errors following $\mathcal{N}(\bm{0},{\bm{\Sigma}}_{\varepsilon})$ , with sparse ${\bm{\Theta}}_{\varepsilon}$ that has a random graph structure described above, ${\mathbf{f}}_{t}$ is a $K\times 1$ vector of factors, $\phi_{f}$ is an autoregressive parameter in the factors which is a scalar for simplicity, ${\mathbf{B}}$ is a $p\times K$ matrix of factor loadings, ${\bm{\zeta}}_{t}$ is a $K\times 1$ random vector with each component independently following $\mathcal{N}(0,\sigma^{2}_{\zeta})$ . To create ${\mathbf{B}}$ in (5.3) we take the first $K$ rows of an upper triangular matrix from a Cholesky decomposition of the $p\times p$ Toeplitz matrix parameterized by $\rho$ . For the first set of results we set $\rho=0.2$ , $\phi_{f}=0.2$ and $\sigma^{2}_{\zeta}=1$ . The specification in (5.3) leads to the low-rank plus sparse decomposition of the covariance matrix of stock returns ${\mathbf{r}}_{t}$ .

As a first exercise, we compare the empirical and theoretical convergence rates of the precision matrix, portfolio weights and exposure. A detailed description of the procedure and the simulation results is provided in Supplemental Appendix C.1. We confirm that the empirical rates and theoretical rates from Theorems 3-5 are matched.

As a second exercise, we compare the performance of FGL with the alternative models listed at the beginning of this section. We consider two cases: Case 1 is the same as for the first set of simulations ( $p<T$ ): $p=T^{\delta}$ , $\delta=0.85$ , $K=2(\log T)^{0.5}$ , $s_{T}=\mathcal{O}(T^{0.05})$ . Case 2 captures the cases when $p>T$ with $p=3\cdot T^{\delta}$ , $\delta=0.85$ , all else equal. The results for Case 2 are reported in Figure 1-3, and Case 1 is located in Supplemental Appendix C.2. FGL demonstrates superior performance for estimating precision matrix and portfolio weights in both cases, exhibiting consistency for both Case 1 and Case 2 settings. Also, FGL outperforms GL for estimating portfolio exposure and consistently estimates the latter, however, depending on the case under consideration some alternative models produce lower averaged error.

As a third exercise, we examine the performance of FGL and Robust FGL (described in Supplemental Appendix B.10) when the dependent variable follows elliptical distributions. A detailed description of the data generating process (DGP) and simulation results are provided in Supplemental Appendix C.3. We find that the performance of FGL for estimating the precision matrix is comparable with that of Robust FGL: this suggests that our FGL algorithm is robust to heavy-tailed distributions even without additional modifications.

As a final exercise, we explore the sensitivity of portfolios constructed using different covariance and precision estimators of interest when the pervasiveness assumption 1 is relaxed. A detailed description of the data generating process (DGP) and simulation results are provided in Supplemental Appendix C.4. We verify that FGL exhibits robust performance when the gap between the diverging and bounded eigenvalues decreases. In contrast, POET and Projected POET are most sensitive to relaxing pervasiveness assumption which is consistent with our empirical findings and also with the simulation results by Onatski, (2013).

6 Empirical Application

In this section we examine the performance of the Factor Graphical Lasso for constructing a financial portfolio using daily data. The description and empirical results for monthly data can be found in Supplemental Appendix D. We first describe the data and the estimation methodology, then we list four metrics commonly reported in the finance literature, and, finally, we present the results.

6.1 Data

We use daily returns of the components of the S&P500 index. The data on historical S&P500 constituents and stock returns is fetched from CRSP and Compustat using SAS interface. For the daily data the full sample size has 5040 observations on 420 stocks from January 20, 2000 - January 31, 2020. We use January 20, 2000 - January 24, 2002 (504 obs) as the first training (estimation) period and January 25, 2002 - January 31, 2020 (4536 obs) as the out-of-sample (OOS) test period. Supplemental Appendix D.3 examines the performance of different competing methods for longer training periods. We roll the estimation window (training periods) over the test sample to rebalance the portfolios monthly. At the end of each month, prior to portfolio construction, we remove stocks with less than 2 years of historical stock return data. The performance of the competing models is compared with the Index – the composite S&P500 index listed as ^$\wedge$GSPC. We take the risk-free rate and Fama/French factors from Kenneth R. French’s data library.

6.2 Performance Measures

Similarly to Callot et al., (2021), we consider four metrics commonly reported in the finance literature: the Sharpe Ratio, the portfolio turnover, the average return and the risk of a portfolio (which is defined as the square root of the out-of-sample variance of the portfolio). We consider two scenarios: with and without transaction costs. Let $T$ denote the total number of observations, the training sample consists of $m=504$ observations, and the test sample is $n=T-m$ .

When transaction costs are not taken into account, the out-of-sample average portfolio return, variance and SR are

\displaystyle\hat{\mu}_{\text{test}}=\frac{1}{n}\sum_{t=m}^{T-1}\widehat{{\mathbf{w}}}^{\prime}_{t}{\mathbf{r}}_{t+1},\ \hat{\sigma}_{\text{test}}^{2}=\frac{1}{n-1}\sum_{t=m}^{T-1}(\widehat{{\mathbf{w}}}^{\prime}_{t}{\mathbf{r}}_{t+1}-\hat{\mu}_{\text{test}})^{2},\ \text{SR}=\hat{\mu}_{\text{test}}/\hat{\sigma}_{\text{test}}.

(6.1)

When transaction costs are considered, we follow Ban et al., (2018), Callot et al., (2021), DeMiguel et al., (2009), and Li, (2015) to account for the transaction costs, further denoted as tc. In line with the aforementioned papers, we set $\textup{tc}=10\text{bps}$ . Define the excess portfolio at time $t+1$ with transaction costs (tc) as

\displaystyle r_{t+1,\text{portfolio}}=\ \widehat{{\mathbf{w}}}^{\prime}_{t}{\mathbf{r}}_{t+1}-\textup{tc}(1+\widehat{{\mathbf{w}}}^{\prime}_{t}{\mathbf{r}}_{t+1})\sum_{j=1}^{p}\left\lvert\hat{w}_{t+1,j}-\hat{w}_{t,j}^{+}\right\rvert,

(6.2)

where

\displaystyle\hat{w}_{t,j}^{+}=\hat{w}_{t,j}\frac{1+r_{t+1,j}+r^{f}_{t+1}}{1+r_{t+1,\text{portfolio}}+r^{f}_{t+1}},

(6.3)

$r_{t+1,j}+r^{f}_{t+1}$ is sum of the excess return of the $j$ -th asset and risk-free rate, and $r_{t+1,\text{portfolio}}+r^{f}_{t+1}$ is the sum of the excess return of the portfolio and risk-free rate. The out-of-sample average portfolio return, variance, Sharpe Ratio and turnover are defined accordingly:

	$\displaystyle\hat{\mu}_{\text{test,tc}}=\frac{1}{n}\sum_{t=m}^{T-1}r_{t,\text{portfolio}},\$	$\displaystyle\hat{\sigma}_{\text{test,tc}}^{2}=\frac{1}{n-1}\sum_{t=m}^{T-1}(r_{t,\text{portfolio}}-\hat{\mu}_{\text{test,tc}})^{2},\ \text{SR}_{\text{tc}}=\hat{\mu}_{\text{test,tc}}/\hat{\sigma}_{\text{test,tc}},$		(6.4)
		$\displaystyle\text{Turnover}=\frac{1}{n}\sum_{t=m}^{T-1}\sum_{j=1}^{p}\left\lvert\hat{w}_{t+1,j}-\hat{w}_{t,j}^{+}\right\rvert.$		(6.5)

6.3 Description of Empirical Design

In the empirical application for constructing financial portfolio we consider two scenarios, when the factors are unknown and estimated using the standard PCA (statistical factors), and when the factors are known. The number of statistical factors, $\hat{K}$ , is estimated in accordance with Remark 1 in Supplemental Appendix D.1. For the scenario with known factors we include up to 5 Fama-French factors: FF1 includes the excess return on the market, FF3 includes FF1 plus size factor (Small Minus Big, SMB) and value factor (High Minus Low, HML), and FF5 includes FF3 plus profitability factor (Robust Minus Weak, RMW) and risk factor (Conservative Minus Agressive, CMA).
We examine the performance of Factor Graphical Lasso for three alternative portfolio allocations (2.2), (2.3) and (2.5) and compare it with the equal-weighted portfolio (EW), index portfolio (Index), FClime, FLW, FNLW (as in the simulations, we use alternative covariance and precision estimators that incorporate the factor structure through Sherman-Morrison inversion formula), POET, Projected POET, and factor models without sparsity restriction on the residual risk (FF1, FF3, and FF5).
In Table 1 and Supplemental Appendix D, we report the daily and monthly portfolio performance for three alternative portfolio allocations in (2.2), (2.3) and (2.5). We consider a relatively risk-averse investor in a sense that they are willing to tolerate no more risk than that incurred by holding the S&P500 Index: the target level of risk for the weight-constrained and risk-constrained Markowitz portfolio (MWC and MRC) is set at $\sigma=0.013$ which is the standard deviation of the daily excess returns of the S&P500 index in the first training set. A return target $\mu=0.0378\%$ which is equivalent to $10\%$ yearly return when compounded. Transaction costs for each individual stock are set to be a constant $0.1\%$ . Supplemental Appendix D.3 provides the results for less risk-averse investors that have higher target levels of risk and return for both monthly and daily data.

To compare the relative performance of investment strategies induced by different precision matrix estimators, we use a stepwise multiple testing procedure developed in Romano and Wolf, (2005) and further covered in Romano and Wolf, (2016). Let $\text{SR}^{P}=\mu_{\text{test}}/\sigma_{\text{test}}$ be the population counterpart of the sample Sharpe Ratio defined in (6.1). We compare each strategy $s$ , $1\leq s\leq S$ , with the benchmark (Index) strategy, indexed as $S+1$ . Define $\chi_{s}\equiv\text{SR}_{s}^{P}-\text{SR}_{S+1}^{P}$ . The test statistic is $\hat{\chi}_{s}\equiv\text{SR}_{s}-\text{SR}_{S+1}$ . For a given strategy $s$ , we consider the individual testing problem $\mathbb{H}_{0}:\chi_{s}\leq 0\quad\text{vs.}\quad\mathbb{H}_{A}:\chi_{s}>0$ . Using the stepwise multiple testing procedure we aim at identifying as many strategies as possible for which $\chi_{s}>0$ : we relabel the strategies according to the size of the individual test statistics, from largest to smallest, and make the individual decisions in a stepdown manner starting with the null hypothesis that corresponds to the largest test statistic. P-values for competing methods are reported in the tables with empirical results. We note that by construction of the stepwise multiple testing procedure, the resulting p-values are relatively conservative, consistent with Remark 3.1 of Romano and Wolf, (2005).

6.4 Empirical Results

This section explores the performance of the Factor Graphical Lasso for the financial portfolio using daily data.

Let us summarize the results for daily data in Table 1: (1) MRC portfolios produce higher return and higher risk, compared to MWC and GMV. However, the out-of-sample Sharpe Ratio for MRC is lower than that of MWC and GMV, which implies that the higher risk of MRC portfolios is not fully compensated by the higher return. (2) FGL outperforms all the competitors, including EW and Index. Specifically, our method has the lowest risk and turnover (compared to FClime, FLW, FNLW and POET), and the highest out-of-sample Sharpe Ratio compared with all alternative methods. (3) The implementation of POET for MRC resulted in the erratic behavior of this method for estimating portfolio weights; many entries in the weight matrix had “NaN” entries. We elaborate on the reasons behind such performance below. (4) Using the observable Fama-French factors in the FGL, in general, produces portfolios with higher return and higher out-of-sample Sharpe Ratio compared to the portfolios based on statistical factors. Interestingly, this increase in return is not followed by higher risk. (5) FGL strongly dominates all factor models that do not impose sparsity on the precision of the idiosyncratic component. The results for monthly data are provided in Supplemental Appendix D: all the conclusions are similar to the ones for daily data.

We now examine possible reasons behind the observed puzzling behavior of POET and Projected POET. The erratic behavior of the former is caused by the fact that POET estimator of covariance matrix was not positive-definite which produced poor estimates of GMV and MWC weights and made it infeasible to compute MRC weights (recall, by construction MRC weight in (2.5) requires taking a square root). To explore deteriorated behavior of Projected POET, let us highlight two findings outlined by the existing closely related literature. First, Bailey et al., (2021) examined “pervasiveness” degree, or strength, of 146 factors commonly used in the empirical finance literature, and found that only the market factor was strong, while all other factors were semi-strong. This indicates that the factor pervasiveness assumption 1 might be unrealistic in practice. Second, as pointed out by Onatski, (2013), “the quality of POET dramatically deteriorates as the systematic-idiosyncratic eigenvalue gap becomes small”. Therefore, being guided by the two aforementioned findings, we attribute deteriorated performance of POET and Projected POET to the decreased gap between the diverging and bounded eigenvalues documented in the past studies on financial returns. High sensitivity of these two covariance estimators in such settings was further supported by our additional simulation study (Supplemental Appendix C.4) examining the robustness of portfolios constructed using different covariance and precision estimators.

Table 2 compares the performance of MRC portfolios for the daily data for different time periods of interesting episodes in terms of the cumulative excess return (CER), risk, and SR. To demonstrate the performance of all methods during the periods of recession and expansion, we chose four periods and recorded CER for the whole year in each period of interest. Two years, 2002 and 2008 correspond to the recession periods, which is why we we refer to them as “Downturns”. We note that the references to Argentine Great Depression and The Financial Crisis do not intend to limit these economic downturns to only one year. They merely provide the context for the recessions. The other two years, 2017 and 2019, correspond to the years which were relatively favorable to the stock market (“Booms”). Overall, it is easier to beat the Index in Downturns than in Booms. In most cases FGL shows superior performance in terms of CER and SR for Downturn #1, Boom #1 and Boom #2. For Downturn #2, even though FGL has the highest CER, its SR is smaller than SR of some other competing methods. One explanation would be the following: as evidenced by high risk of the competing methods during Boom #2, there were high positive and negative returns during the period, with high returns driving up the average used in computing the SR. However, if one were to use the alternative strategies ignoring CER statistics, then the return on the money deposited at the beginning of 2008 would either be negative (e.g. FClime, Projected POET) or smaller than the CER of FGL-based strategies. This exercise demonstrates that SR statistics alone, especially during recession periods characterized by higher volatility, could be misleading. Another interesting finding from such exercise is that FGL exhibits smaller risk compared to most competing methods even during the periods of recession, which holds for all portfolio formulations. This allows FGL to minimize cumulative losses during economic downturns. Subperiod analyses for MWC and GMV portfolio formulations is presented in Supplemental Appendix D.5.

7 Conclusion

In this paper, we propose a new conditional precision matrix estimator for the excess returns under the approximate factor model with unobserved factors that combines the benefits of graphical models and factor structure. We established consistency of FGL in the spectral and $\ell_{1}$ matrix norms. In addition, we proved consistency of the portfolio weights and risk exposure for three formulations of the optimal portfolio allocation without assuming sparsity on the covariance or precision matrix of stock returns. All theoretical results established in this paper hold for a wide range of distributions: sub-Gaussian family (including Gaussian) and elliptical family. Our simulations demonstrate that FGL is robust to very heavy-tailed distributions, which makes our method suitable for the financial applications. Furthermore, we demonstrate that in contrast to POET and Projected POET, the success of the proposed method does not heavily depend on the factor pervasiveness assumption: FGL is robust to the scenarios when the gap between the diverging and bounded eigenvalues decreases.

The empirical exercise uses the constituents of the S&P500 index and demonstrates superior performance of FGL compared to several alternative models for estimating precision (FClime) and covariance (FLW, FNLW, POET) matrices, Equal-Weighted (EW) portfolio and Index portfolio in terms of the OOS SR and risk. This result is robust to monthly and daily data. We examine three portfolio formulations and discover that the only portfolios that produce positive CER during recessions are the ones that relax the constraint requiring portfolio weights sum up to one.

References

Ait-Sahalia and Xiu, (2017) Ait-Sahalia, Y. and Xiu, D. (2017). Using principal component analysis to estimate a high dimensional factor model with high-frequency data. Journal of Econometrics, 201(2):384–399.
Awoye, (2016) Awoye, O. A. (2016). Markowitz Minimum Variance Portfolio Optimization Using New Machine Learning Methods. PhD thesis, University College London.
Bai, (2003) Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica, 71(1):135–171.
Bai and Ng, (2002) Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica, 70(1):191–221.
Bai and Ng, (2006) Bai, J. and Ng, S. (2006). Confidence intervals for diffusion index forecasts and inference for factor-augmented regressions. Econometrica, 74(4):1133–1150.
Bailey et al., (2021) Bailey, N., Kapetanios, G., and Pesaran, M. H. (2021). Measurement of factor strength: Theory and practice. Journal of Applied Econometrics, 36(5):587–613.
Ban et al., (2018) Ban, G.-Y., El Karoui, N., and Lim, A. E. (2018). Machine learning and portfolio optimization. Management Science, 64(3):1136–1154.
Barigozzi et al., (2018) Barigozzi, M., Brownlees, C., and Lugosi, G. (2018). Power-law partial correlation network models. Electronic Journal of Statistics, 12(2):2905–2929.
Bishop, (2006) Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg.
Brownlees et al., (2018) Brownlees, C., Nualart, E., and Sun, Y. (2018). Realized networks. Journal of Applied Econometrics, 33(7):986–1006.
Cai et al., (2011) Cai, T., Liu, W., and Luo, X. (2011). A constrained l1-minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607.
Cai et al., (2020) Cai, T. T., Hu, J., Li, Y., and Zheng, X. (2020). High-dimensional minimum variance portfolio estimation based on high-frequency data. Journal of Econometrics, 214(2):482–494.
Callot et al., (2021) Callot, L., Caner, M., Önder, A. O., and Ulaşan, E. (2021). A nodewise regression approach to estimating large portfolios. Journal of Business & Economic Statistics, 39(2):520–531.
Callot et al., (2017) Callot, L. A. F., Kock, A. B., and Medeiros, M. C. (2017). Modeling and forecasting large realized covariance matrices and portfolio choice. Journal of Applied Econometrics, 32(1):140–158.
Campbell et al., (1997) Campbell, J. Y., Lo, A. W., and MacKinlay, A. C. (1997). The Econometrics of Financial Markets. Princeton University Press.
Chang et al., (2018) Chang, J., Qiu, Y., Yao, Q., and Zou, T. (2018). Confidence regions for entries of a large precision matrix. Journal of Econometrics, 206(1):57–82.
Chudik et al., (2011) Chudik, A., Pesaran, M. H., and Tosetti, E. (2011). Weak and strong cross-section dependence and estimation of large panels. The Econometrics Journal, 14(1):C45–C90.
Connor and Korajczyk, (1988) Connor, G. and Korajczyk, R. A. (1988). Risk and return in an equilibrium APT: Application of a new test methodology. Journal of Financial Economics, 21(2):255–289.
DeMiguel et al., (2009) DeMiguel, V., Garlappi, L., and Uppal, R. (2009). Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy? The Review of Financial Studies, 22(5):1915–1953.
Ding et al., (2021) Ding, Y., Li, Y., and Zheng, X. (2021). High dimensional minimum variance portfolio estimation under statistical factor models. Journal of Econometrics, 222(1, Part B):502–515.
Fama and French, (1993) Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics, 33(1):3–56.
Fama and French, (2015) Fama, E. F. and French, K. R. (2015). A five-factor asset pricing model. Journal of Financial Economics, 116(1):1–22.
Fan et al., (2008) Fan, J., Fan, Y., and Lv, J. (2008). High dimensional covariance matrix estimation using a factor model. Journal of Econometrics, 147(1):186 – 197.
Fan et al., (2011) Fan, J., Liao, Y., and Mincheva, M. (2011). High-dimensional covariance matrix estimation in approximate factor models. The Annals of Statistics, 39(6):3320–3356.
Fan et al., (2013) Fan, J., Liao, Y., and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B, 75(4):603–680.
Fan et al., (2018) Fan, J., Liu, H., and Wang, W. (2018). Large covariance estimation through elliptical factor models. The Annals of Statistics, 46(4):1383–1414.
Friedman et al., (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the Graphical Lasso. Biostatistics, 9(3):432–441.
Gabaix, (2011) Gabaix, X. (2011). The granular origins of aggregate fluctuations. Econometrica, 79(3):733–772.
Goto and Xu, (2015) Goto, S. and Xu, Y. (2015). Improving mean variance optimization through sparse hedging restrictions. Journal of Financial and Quantitative Analysis, 50(6):1415–1441.
Hastie et al., (2001) Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA.
Hautsch et al., (2012) Hautsch, N., Kyj, L. M., and Oomen, R. C. A. (2012). A blocking and regularization approach to high-dimensional realized covariance estimation. Journal of Applied Econometrics, 27(4):625–645.
Janková and van de Geer, (2018) Janková, J. and van de Geer, S. (2018). Inference in high-dimensional graphical models. Handbook of Graphical Models, Chapter 14, pages 325–351. CRC Press.
Kapetanios, (2010) Kapetanios, G. (2010). A testing procedure for determining the number of factors in approximate factor models with large datasets. Journal of Business & Economic Statistics, 28(3):397–409.
Koike, (2020) Koike, Y. (2020). De-biased graphical lasso for high-frequency data. Entropy, 22(4):456.
Ledoit and Wolf, (2004) Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2):365–411.
Ledoit and Wolf, (2017) Ledoit, O. and Wolf, M. (2017). Nonlinear shrinkage of the covariance matrix for portfolio selection: Markowitz meets goldilocks. The Review of Financial Studies, 30(12):4349–4388.
Li et al., (2017) Li, H., Li, Q., and Shi, Y. (2017). Determining the number of factors when the number of factors can increase with sample size. Journal of Econometrics, 197(1):76–86.
Li, (2015) Li, J. (2015). Sparse and stable portfolio selection with parameter uncertainty. Journal of Business & Economic Statistics, 33(3):381–392.
Mazumder and Hastie, (2012) Mazumder, R. and Hastie, T. (2012). The Graphical Lasso: new insights and alternatives. Electronic Journal of Statistics, 6:2125–2149.
Meinshausen and Bühlmann, (2006) Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462.
Millington and Niranjan, (2017) Millington, T. and Niranjan, M. (2017). Robust portfolio risk minimization using the graphical lasso. In Neural Information Processing, pages 863–872, Cham. Springer International Publishing.
Onatski, (2013) Onatski, A. (2013). Discussion on the paper by Fan J., Liao Y., and Mincheva M. Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B, 75(4):650–652.
Pourahmadi, (2013) Pourahmadi, M. (2013). High-Dimensional Covariance Estimation: With High-Dimensional Data. Wiley Series in Probability and Statistics. John Wiley and Sons, 2013.
Ravikumar et al., (2011) Ravikumar, P., J. Wainwright, M., Raskutti, G., and Yu, B. (2011). High-dimensional covariance estimation by minimizing -penalized log-determinant divergence. Electronic Journal of Statistics, Vol. 5:935–980.
Romano and Wolf, (2005) Romano, J. P. and Wolf, M. (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4):1237–1282.
Romano and Wolf, (2016) Romano, J. P. and Wolf, M. (2016). Efficient computation of adjusted p-values for resampling-based stepdown multiple testing. Statistics and Probability Letters, 113:38–40.
Ross, (1976) Ross, S. A. (1976). The arbitrage theory of capital asset pricing. Journal of Economic Theory, 13(3):341–360.
Stock and Watson, (2002) Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association, 97(460):1167–1179.
Tobin, (1958) Tobin, J. (1958). Liquidity preference as behavior towards risk. The Review of Economic Studies, 25(2):65–86.
Zhao et al., (2012) Zhao, T., Liu, H., Roeder, K., Lafferty, J., and Wasserman, L. (2012). The HUGE package for high-dimensional undirected graph estimation in R. Journal of Machine Learning Research, 13(1):1059–1062.

Refer to caption — Figure 1: Averaged errors of the estimators of ${\bm{\Theta}}$ for Case 2 on logarithmic scale: $p=3\cdot T^{0.85}$ , $K=2(\log T)^{0.5}$ , $s_{T}=\mathcal{O}(T^{0.05})$ .

Table 1: Daily portfolio returns, risk, SR and turnover. In the upper part corresponding to the results w/o transactions costs, p-values are in parentheses. In the lower part corresponding to the results with transaction costs, ^∗∗∗ indicates p-value

<

0.01, ^∗∗ indicates p-value

<

0.05, and ^∗ indicates p-value

<

0.10. In-sample: January 20, 2000 - January 24, 2002 (504 obs), Out-of-sample: January 17, 2002 - January 31, 2020 (4536 obs).

Markowitz Risk-Constrained

Markowitz Weight-Constrained

Global Minimum-Variance

Return

Risk

Turnover

Return

Risk

Turnover

Return

Risk

Turnover

Without TC

2.33E-04

1.90E-02

0.0123

2.33E-04

1.90E-02

0.0123

2.33E-04

1.90E-02

0.0123

Index

1.86E-04

1.17E-02

0.0159

1.86E-04

1.17E-02

0.0159

1.86E-04

1.17E-02

0.0159

FGL

8.12E-04

2.66E-02

0.0305

(0.0579)

2.95E-04

8.21E-03

0.0360

(0.024)

2.94E-04

7.51E-03

0.0392

(0.0279)

FClime

2.15E-03

8.46E-02

0.0254

(0.0758)

2.02E-04

9.85E-03

0.0205

(0.0299)

2.73E-04

1.07E-02

0.0255

(0.0419)

FLW

4.34E-04

2.65E-02

0.0164

(0.1782)

3.12E-04

9.96E-03

0.0313

(0.024)

3.10E-04

9.38E-03

0.0330

(0.0279)

FNLW

4.91E-04

6.66E-02

0.0074

(0.5515 )

2.98E-04

1.24E-02

0.0241

(0.0419)

3.06E-04

1.32E-02

0.0231

(0.0419)

POET

NaN

-7.06E-04

2.74E-01

-0.0026

(0.9137)

1.07E-03

2.71E-01

0.0039

(0.7912)

Projected POET

1.20E-03

1.71E-01

0.0070

(0.5515)

-8.06E-05

1.61E-02

-0.0050

(0.9337)

-7.57E-05

1.93E-02

-0.0039

(0.9482)

FGL (FF1)

7.96E-04

2.80E-02

0.0285

(0.0758)

3.73E-04

8.73E-03

0.0427

(0.024)

3.52E-04

8.62E-03

0.0408

(0.0259)

FGL (FF3)

6.51E-04

2.74E-02

0.0238

(0.0758)

3.52E-04

8.96E-03

0.0393

(0.024)

3.39E-04

8.94E-03

0.0379

(0.022)

FGL (FF5)

5.87E-04

2.70E-02

0.0217

(0.0758)

3.47E-04

9.38E-03

0.0370

(0.024)

3.36E-04

9.29E-03

0.0362

(0.022)

FF1

7.38E-04

1.11E-01

0.0067

(0.5821)

3.30E-05

1.62E-02

0.0020

(0.7139)

2.49E-05

1.61E-02

0.0015

(0.8430)

FF3

7.52E-04

1.11E-01

0.0068

(0.5821)

2.68E-05

1.62E-02

0.0017

(0.7139)

2.06E-05

1.61E-02

0.0013

(0.8430)

FF5

7.59E-04

1.11E-01

0.0069

(0.5821)

2.01E-05

1.62E-02

0.0012

(0.7139)

1.38E-05

1.61E-02

0.0009

(0.8430)

With TC

2.01E-04

1.90E-02

0.0106

0.0292

2.01E-04

1.90E-02

0.0106

0.0292

2.01E-04

1.90E-02

0.0106

0.0292

FGL

4.47E-04

2.66E-02

0.0168

0.3655

2.30E-04

8.22E-03

0.0280*

0.0666

2.32E-04

7.52E-03

0.0309*

0.0633

FClime

1.18E-03

8.48E-02

0.0139

1.0005

1.67E-04

9.86E-03

0.0170

0.0369

2.46E-04

1.07E-02

0.0230*

0.0290

FLW

-5.54E-05

2.65E-02

-0.0021

0.4874

1.92E-04

9.98E-03

0.0193

0.1207

1.92E-04

9.39E-03

0.0204*

0.1194

FNLW

-2.39E-03

7.03E-02

-0.0340

3.6370

5.50E-05

1.25E-02

0.0044

0.2441

6.08E-05

1.33E-02

0.0046

0.2457

POET

NaN

-2.28E-02

5.55E-01

-0.0411

113.3848

-2.81E-02

4.21E-01

-0.0666

132.8215

Projected POET

-1.59E-02

3.64E-01

-0.0437

35.9692

-1.03E-03

1.68E-02

-0.0616

0.9544

-1.37E-03

2.06E-02

-0.0666

1.2946

FGL (FF1)

3.86E-04

2.80E-02

0.0138

0.4068

2.82E-04

8.74E-03

0.0323**

0.0903

2.63E-04

8.63E-03

0.0305*

0.0887

FGL (FF3)

2.47E-04

2.74E-02

0.0090

0.4043

2.60E-04

8.98E-03

0.0290**

0.0928

2.49E-04

8.96E-03

0.0278*

0.0911

FGL (FF5)

1.83E-04

2.71E-02

0.0068

0.4032

2.53E-04

9.40E-03

0.0269*

0.0952

2.43E-04

9.30E-03

0.0262*

0.0937

FF1

-6.69E-03

1.28E-01

-0.0639

8.5721

-5.27E-04

1.65E-02

-0.0319

0.5704

-5.30E-04

1.64E-02

-0.0323

0.5641

FF3

-6.65E-03

1.28E-01

-0.0635

8.5411

-5.33E-04

1.65E-02

-0.0323

0.5701

-5.34E-04

1.64E-02

-0.0326

0.5638

FF5

-6.63E-03

1.28E-01

-0.0634

8.5262

-5.40E-04

1.65E-02

-0.0327

0.5703

-5.41E-04

1.64E-02

-0.0330

0.5646

Table 2: Cumulative excess return (CER) and risk of MRC portfolios using daily data. Targeted risk is set at

\sigma=0.013

, daily targeted return is

0.0378\%

. P-values are in parentheses. In-sample: January 20, 2000 - January 24, 2002 (504 obs), Out-of-sample: January 17, 2002 - January 31, 2020 (4536 obs).

Index

FGL

FClime

FLW

FNLW

ProjPOET

FGL(FF1)

FGL(FF3)

FGL(FF5)

FF1

FF3

FF5

Downturn #1: Argentine Great Depression (2002)

CER

-0.1633

-0.2418

0.2909

-0.0079

0.0308

0.0728

-0.6178

0.3375

0.3423

0.3401

-0.0860

Risk

0.0160

0.0168

0.0206

0.0348

0.0231

0.0213

0.0545

0.0211

0.0212

0.0495

-0.0393

-0.0615

0.0629

(0.0619)

0.0164

(0.0759)

0.0171

(0.0759)

0.0246

(0.0759)

-0.0467

(0.4852)

0.0689

(0.0619)

0.0696

(0.0619)

0.0692

(0.0619)

0.0169

(0.0759)

0.0169

(0.0759)

0.0169

(0.0759)

Downturn #2: Financial Crisis (2008)

CER

-0.5622

-0.4746

0.2938

-0.8912

0.2885

0.2075

-0.9999

0.2665

0.2650

0.2560

0.0404

Risk

0.0310

0.0258

0.0282

0.1484

0.0315

0.0392

0.1963

0.0320

0.0319

0.0986

-0.0857

0.0315

(0.0889)

0.1045

(0.1079)

0.0282

(0.1079)

0.0392

(0.1079)

0.1963

(0.1079)

0.0320

(0.0889)

0.0319

(0.0889)

0.0319

(0.0889)

0.0006

(0.1079)

0.0006

(0.1079)

0.0006

(0.1079)

Boom #1 (2017)

CER

0.0627

0.1752

0.7267

0.5331

0.3164

0.5796

-0.7599

0.6568

0.6607

0.6486

-0.5070

-0.5294

-0.4755

Risk

0.0218

0.0042

0.0142

0.0383

0.0118

0.0497

0.1197

0.0135

0.0134

0.0132

0.0720

0.0721

0.0710

0.0220

0.1536

0.1606

(0.5544)

0.1231

(0.6264)

0.0987

(0.6264)

0.1008

(0.5465)

0.0151

(0.9815)

0.1563

(0.5455)

0.1581

(0.5455)

0.1575

(0.5455)

-0.0022

(0.9985)

-0.0046

(0.9985)

0.0002

(0.9815)

Boom #2 (2019)

CER

0.1642

0.2934

0.6872

0.2346

0.5520

0.9315

1.8592

0.5166

0.5168

0.5037

0.2690

0.2682

0.2730

Risk

0.0185

0.0086

0.0263

0.0557

0.0287

0.0355

0.1177

0.0247

0.0248

0.1094

0.0418

0.1228

0.0919

(0.1738)

0.0436

(0.2298)

0.0753

(0.2298)

0.0905

(0.2298)

0.0898

(0.2298)

0.0793

(0.1728)

0.0792

(0.1728)

0.0779

(0.1728)

0.0798

(0.2298)

0.0798

(0.2298)

0.0799

(0.2298)

This Online Supplemental Appendix is structured as follows: Appendix A summarizes Graphical Lasso algorithm, Appendix B contains proofs of the theorems, accompanying lemmas, and an extension of the theorems to elliptical distributions. Appendix C provides additional simulations for Section 5, additional empirical results for Section 6 are located in Appendix D.

Appendix A Graphical Lasso Algorithm

To solve (3.3) we use the procedure based on the weighted Graphical Lasso which was first proposed in Friedman et al., (2008) and further studied in Mazumder and Hastie, (2012) and Janková and van de Geer, (2018) among others. Define the following partitions of ${\mathbf{W}}_{\varepsilon}$ , $\widehat{{\bm{\Sigma}}}_{\varepsilon}$ and ${\bm{\Theta}}_{\varepsilon}$ :

{\mathbf{W}}_{\varepsilon}=\begin{pmatrix}\underbrace{{\mathbf{W}}_{\varepsilon,11}}_{(p-1)\times(p-1)}&\underbrace{{\mathbf{w}}_{\varepsilon,12}}_{(p-1)\times 1}\\ {\mathbf{w}}_{\varepsilon,12}^{\prime}&w_{\varepsilon,22}\end{pmatrix},\widehat{{\bm{\Sigma}}}_{\varepsilon}=\begin{pmatrix}\underbrace{\widehat{{\bm{\Sigma}}}_{\varepsilon,11}}_{(p-1)\times(p-1)}&\underbrace{\widehat{{\bm{\sigma}}}_{\varepsilon,12}}_{(p-1)\times 1}\\ \widehat{{\bm{\sigma}}}_{\varepsilon,12}^{\prime}&\widehat{\sigma}_{\varepsilon,22}\end{pmatrix},{\bm{\Theta}}=\begin{pmatrix}\underbrace{{\bm{\Theta}}_{\varepsilon,11}}_{(p-1)\times(p-1)}&\underbrace{{\bm{\theta}}_{\varepsilon,12}}_{(p-1)\times 1}\\ {\bm{\theta}}_{\varepsilon,12}^{\prime}&\theta_{\varepsilon,22}\end{pmatrix}.

(A.1)

Let ${\bm{\beta}}\equiv-{\bm{\theta}}_{\varepsilon,12}/\theta_{\varepsilon,22}$ . The idea of GLASSO is to set ${\mathbf{W}}_{\varepsilon}=\widehat{{\bm{\Sigma}}}_{\varepsilon}+\lambda{\mathbf{I}}$ in (3.3) and combine the gradient of (3.3) with the formula for partitioned inverses to obtain the following $\ell_{1}$ -regularized quadratic program

\widehat{{\bm{\beta}}}=\arg\!\min_{{\bm{\beta}}\in\mathbb{R}^{p-1}}\Bigl{\{}\frac{1}{2}{\bm{\beta}}^{\prime}{\mathbf{W}}_{\varepsilon,11}{\bm{\beta}}-{\bm{\beta}}^{\prime}\widehat{{\bm{\sigma}}}_{\varepsilon,12}+\lambda\left\lVert{\bm{\beta}}\right\rVert_{1}\Bigr{\}}.

(A.2)

As shown by Friedman et al., (2008), (A.2) can be viewed as a LASSO regression, where the LASSO estimates are functions of the inner products of ${\mathbf{W}}_{\varepsilon,11}$ and $\widehat{\sigma}_{\varepsilon,12}$ . Hence, (3.3) is equivalent to $p$ coupled LASSO problems. Once we obtain $\widehat{{\bm{\beta}}}$ , we can estimate the entries ${\bm{\Theta}}_{\varepsilon}$ using the formula for partitioned inverses. The procedure to obtain sparse ${\bm{\Theta}}_{\varepsilon}$ is summarized in Algorithm 1.

Algorithm 1 Graphical Lasso Friedman et al., (2008)

1: Initialize

{\mathbf{W}}_{\varepsilon}=\widehat{{\bm{\Sigma}}}_{\varepsilon}+\lambda{\mathbf{I}}

. The diagonal of

{\mathbf{W}}_{\varepsilon}

remains the same in what follows.

2: Repeat for

j=1,\ldots,p,1,\ldots,p,\ldots

until convergence:

•

Partition ${\mathbf{W}}_{\varepsilon}$ into part 1: all but the $j$ -th row and column, and part 2: the $j$ -th row and column.
•

Solve the score equations using the cyclical coordinate descent: ${\mathbf{W}}_{\varepsilon,11}{\bm{\beta}}-\widehat{{\bm{\sigma}}}_{\varepsilon,12}+\lambda\cdot\text{Sign}({\bm{\beta}})=\mathbf{0}$ . This gives a $(p-1)\times 1$ vector solution $\widehat{{\bm{\beta}}}.$
•

Update $\widehat{{\mathbf{w}}}_{\varepsilon,12}={\mathbf{W}}_{\varepsilon,11}\widehat{{\bm{\beta}}}$ .

3: In the final cycle (for

i=1,\ldots,p

) solve for

\frac{1}{\widehat{\theta}_{22}}=w_{\varepsilon,22}-\widehat{{\bm{\beta}}}^{\prime}\widehat{{\mathbf{w}}}_{\varepsilon,12}

and

\widehat{{\bm{\theta}}}_{12}=-\widehat{\theta}_{22}\widehat{{\bm{\beta}}}

As was shown in Friedman et al., (2008) and the follow-up paper by Mazumder and Hastie, (2012), the estimator produced by Graphical Lasso is guaranteed to be positive definite.

Appendix B Proofs of the Theorems

B.1 Lemmas for Theorem 1

Lemma 1.

Under the assumptions of Theorem 1,

(a)

$\max_{i,j\leq K}\left\lvert(1/T)\sum_{t=1}^{T}f_{it}f_{jt}-\mathbb{E}_{\,\!\!}\left[f_{it}f_{jt}\right]\right\rvert=\mathcal{O}_{P}(\sqrt{1/T})$ ,
(b)

$\max_{i,j\leq p}\left\lvert(1/T)\sum_{t=1}^{T}\varepsilon_{it}\varepsilon_{jt}-\mathbb{E}_{\,\!\!}\left[\varepsilon_{it}\varepsilon_{jt}\right]\right\rvert=\mathcal{O}_{P}(\sqrt{\log p/T})$ ,
(c)

$\max_{i\leq K,j\leq p}\left\lvert(1/T)\sum_{t=1}^{T}f_{it}\varepsilon_{jt}\right\rvert=\mathcal{O}_{P}(\sqrt{\log p/T})$ .

Proof.

The proof of Lemma 1 can be found in Fan et al. (2011) (Lemma B.1). ∎

Lemma 2.

Under Assumption (A.4), $\max_{t\leq T}\sum_{s=1}^{K}\left\lvert\mathbb{E}_{\,\!\!}\left[{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}\right]\right\rvert/p=\mathcal{O}(1)$ .

Proof.

The proof of Lemma 2 can be found in Fan et al. (2013) (Lemma A.6). ∎

Lemma 3.

For $\widehat{K}$ defined in expression (3.6),

\Pr\Big{(}\widehat{K}=K\Big{)}\rightarrow 1.

Proof.

The proof of Lemma 3 can be found in Li et al. (2017) (Theorem 1 and Corollary 1). ∎

Using the expressions (A.1) in Bai (2003) and (C.2) in Fan et al. (2013), we have the following identity:

\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t}=\Big{(}\frac{{\mathbf{V}}}{p}\Big{)}^{-1}\Bigg{[}\frac{1}{T}\sum_{s=1}^{T}\widehat{{\mathbf{f}}}_{s}\frac{\mathbb{E}_{\,\!\!}\left[{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}\right]}{p}+\frac{1}{T}\sum_{s=1}^{T}\widehat{{\mathbf{f}}}_{s}\zeta_{st}+\frac{1}{T}\sum_{s=1}^{T}\widehat{{\mathbf{f}}}_{s}\eta_{st}+\frac{1}{T}\sum_{s=1}^{T}\widehat{{\mathbf{f}}}_{s}\xi_{st}\Bigg{]},

(B.1)

where $\zeta_{st}={\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}/p-\mathbb{E}_{\,\!\!}\left[{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}\right]/p$ , $\eta_{st}={\mathbf{f}}^{\prime}_{s}\sum_{i=1}^{p}{\mathbf{b}}_{i}\varepsilon_{it}/p$ and $\xi_{st}={\mathbf{f}}^{\prime}_{t}\sum_{i=1}^{p}{\mathbf{b}}_{i}\varepsilon_{is}/p$ .

Lemma 4.

For all $i\leq\widehat{K}$ ,

(a)

$(1/T)\sum_{t=1}^{T}\Big{[}(1/T)\sum_{t=1}^{T}\hat{f}_{is}\mathbb{E}_{\,\!\!}\left[{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}\right]/p\Big{]}^{2}=\mathcal{O}_{P}(T^{-1})$ ,
(b)

$(1/T)\sum_{t=1}^{T}\Big{[}(1/T)\sum_{t=1}^{T}\hat{f}_{is}\zeta_{st}/p\Big{]}^{2}=\mathcal{O}_{P}(p^{-1})$ ,
(c)

$(1/T)\sum_{t=1}^{T}\Big{[}(1/T)\sum_{t=1}^{T}\hat{f}_{is}\eta_{st}/p\Big{]}^{2}=\mathcal{O}_{P}(K^{2}/p)$ ,
(d)

$(1/T)\sum_{t=1}^{T}\Big{[}(1/T)\sum_{t=1}^{T}\hat{f}_{is}\xi_{st}/p\Big{]}^{2}=\mathcal{O}_{P}(K^{2}/p)$ .

Proof.

We only prove (c) and (d), the proof of (a) and (b) can be found in Fan et al. (2013) (Lemma 8).

(c)

Recall, $\eta_{st}={\mathbf{f}}^{\prime}_{s}\sum_{i=1}^{p}{\mathbf{b}}_{i}\varepsilon_{it}/p$ . Using Assumption (A.5), we get $\mathbb{E}_{\,\!\!}\left[(1/T)\times\sum_{t=1}^{T}\lVert\sum_{i=1}^{p}{\mathbf{b}}_{i}\varepsilon_{it}\rVert^{2}\right]=\mathbb{E}_{\,\!\!}\left[\left\lVert\sum_{i=1}^{p}{\mathbf{b}}_{i}\varepsilon_{it}\right\rVert^{2}\right]=\mathcal{O}(pK)$ . Therefore, by the Cauchy-Schwarz inequality and the facts that $(1/T)\sum_{t=1}^{T}\lVert{\mathbf{f}}_{t}\rVert^{2}=\mathcal{O}(K)$ , and, $\forall i$ , $\sum_{s=1}^{T}\hat{f}_{is}^{2}=T$ ,

	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\Big{(}\frac{1}{T}\sum_{s=1}^{T}\hat{f}_{is}\eta_{st}\Big{)}^{2}$	$\displaystyle\leq\left\lVert\frac{1}{T}\sum_{s=1}^{T}\lVert\hat{f}_{is}{\mathbf{f}}^{\prime}_{s}\rVert^{2}\frac{1}{T}\sum_{t=1}^{T}\frac{1}{p}\lVert\sum_{j=1}^{p}{\mathbf{b}}_{i}\varepsilon_{jt}\rVert\right\rVert^{2}$
		$\displaystyle\leq\frac{1}{Tp^{2}}\sum_{t=1}^{T}\left\lVert\sum_{j=1}^{p}{\mathbf{b}}_{i}\varepsilon_{jt}\right\rVert^{2}\Bigg{(}\frac{1}{T}\sum_{s=1}^{T}\hat{f}_{is}^{2}\frac{1}{T}\sum_{s=1}^{T}\left\lVert{\mathbf{f}}_{s}\right\rVert^{2}\Bigg{)}$
		$\displaystyle=\mathcal{O}_{P}\Big{(}\frac{K}{p}\cdot K\Big{)}=\mathcal{O}_{P}\Big{(}\frac{K^{2}}{p}\Big{)}.$

(d)

Using a similar approach as in part (c):

	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\Big{(}\frac{1}{T}\sum_{s=1}^{T}\hat{f}_{is}\xi_{st}\Big{)}^{2}$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left\lvert\frac{1}{T}\sum_{s=1}^{T}{\mathbf{f}}^{\prime}_{t}\sum_{j=1}^{p}\varepsilon_{js}\frac{1}{p}\hat{f}_{is}\right\rvert^{2}\leq\Big{(}\frac{1}{T}\sum_{t=1}^{T}\left\lVert{\mathbf{f}}_{t}\right\rVert^{2}\Big{)}\left\lVert\frac{1}{T}\sum_{s=1}^{T}\sum_{j=1}^{p}{\mathbf{b}}_{j}\varepsilon_{js}\frac{1}{p}\hat{f}_{is}\right\rVert^{2}$
		$\displaystyle\leq\Big{(}\frac{1}{T}\sum_{t=1}^{T}\left\lVert{\mathbf{f}}_{t}\right\rVert^{2}\Big{)}\frac{1}{T}\sum_{s=1}^{T}\left\lVert\sum_{j=1}^{p}{\mathbf{b}}_{j}\varepsilon_{js}\frac{1}{p}\right\rVert^{2}\Big{(}\frac{1}{T}\sum_{s=1}^{T}\hat{f}_{is}^{2}\Big{)}$
		$\displaystyle=\mathcal{O}_{P}\Big{(}K\cdot\frac{pK}{p^{2}}\cdot 1\Big{)}=\mathcal{O}_{P}\Big{(}\frac{K^{2}}{p}\Big{)}$

∎

Lemma 5.

(a)

$\max_{t\leq T}\left\lVert(1/(Tp))\sum_{s=1}^{T}\widehat{{\mathbf{f}}}^{{}^{\prime}}_{s}\mathbb{E}_{\,\!\!}\left[{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}\right]\right\rVert=\mathcal{O}_{P}(K/\sqrt{T})$ .
(b)

$\max_{t\leq T}\left\lVert(1/(Tp))\sum_{s=1}^{T}\widehat{{\mathbf{f}}}^{{}^{\prime}}_{s}\zeta_{st}\right\rVert=\mathcal{O}_{P}(\sqrt{K}T^{1/4}/\sqrt{p})$ .
(c)

$\max_{t\leq T}\left\lVert(1/(Tp))\sum_{s=1}^{T}\widehat{{\mathbf{f}}}^{{}^{\prime}}_{s}\eta_{st}\right\rVert=\mathcal{O}_{P}(KT^{1/4}/\sqrt{p})$ .
(d)

$\max_{t\leq T}\left\lVert(1/(Tp))\sum_{s=1}^{T}\widehat{{\mathbf{f}}}^{{}^{\prime}}_{s}\xi_{st}\right\rVert=\mathcal{O}_{P}(KT^{1/4}/\sqrt{p})$ .

Proof.

Our proof is similar to the proof in Fan et al. (2013). However, we relax the assumptions of fixed $K$ .

(a)

Using the Cauchy-Schwarz inequality, Lemma 2, and the fact that $(1/T)\sum_{t=1}^{T}\lVert\widehat{{\mathbf{f}}}_{t}\rVert^{2}=\mathcal{O}_{P}(K)$ , we get

	$\displaystyle\max_{t\leq T}\left\lVert\frac{1}{Tp}\sum_{s=1}^{T}\widehat{{\mathbf{f}}}^{\prime}_{s}\mathbb{E}_{\,\!\!}\left[{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}\right]\right\rVert\leq\max_{t\leq T}\Bigg{[}\frac{1}{T}\sum_{s=1}^{T}\left\lVert\widehat{{\mathbf{f}}}_{s}\right\rVert\frac{1}{T}\sum_{s=1}^{T}\Bigg{(}\frac{\mathbb{E}_{\,\!\!}\left[{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}\right]}{p}\Bigg{)}^{2}\Bigg{]}^{1/2}\leq\mathcal{O}_{P}(K)\max_{t\leq T}\Bigg{[}\frac{1}{T}\sum_{s=1}^{T}\Bigg{(}\frac{\mathbb{E}_{\,\!\!}\left[{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}\right]}{p}\Bigg{)}^{2}\Bigg{]}^{1/2}$
	$\displaystyle\leq\mathcal{O}_{P}(K)\max_{s,t}\sqrt{\left\lvert\frac{\mathbb{E}_{\,\!\!}\left[{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}\right]}{p}\right\rvert}\max_{t\leq T}\Bigg{[}\frac{1}{T}\sum_{s=1}^{T}\left\lvert\frac{\mathbb{E}_{\,\!\!}\left[{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}\right]}{p}\right\rvert\Bigg{]}^{1/2}=\mathcal{O}_{P}\Big{(}K\cdot 1\cdot\frac{1}{\sqrt{T}}\Big{)}=\mathcal{O}_{P}\Big{(}\frac{K}{\sqrt{T}}\Big{)}.$

(b)

Using the Cauchy-Schwarz inequality,

	$\displaystyle\max_{t\leq T}\left\lVert\frac{1}{T}\sum_{s=1}^{T}\widehat{{\mathbf{f}}}^{{}^{\prime}}_{s}\zeta_{st}\right\rVert\leq\max_{t\leq T}\frac{1}{T}\Bigg{(}\sum_{s=1}^{T}\left\lVert\widehat{{\mathbf{f}}}_{s}\right\rVert^{2}\sum_{s=1}^{T}\zeta_{st}^{2}\Bigg{)}^{1/2}$	$\displaystyle\leq\Bigg{(}\mathcal{O}_{P}(K)\max_{t}\frac{1}{T}\sum_{s=1}^{T}\zeta_{st}^{2}\Bigg{)}^{1/2}$
		$\displaystyle=\mathcal{O}_{P}\Big{(}\sqrt{K}\cdot T^{1/4}/\sqrt{p}\cdot\Big{)}.$

To obtain the last inequality we used Assumption (A.5)(b) to get $\mathbb{E}_{\,\!\!}\left[(1/T)\sum_{s=1}^{T}\zeta_{st}^{2}\right]^{2}\leq\max_{s,t\leq T}\mathbb{E}_{\,\!\!}\left[\zeta_{st}^{4}\right]=\mathcal{O}(1/p^{2})$ , and then applied the Chebyshev inequality and Bonferroni’s method that yield $\max_{t}(1/T)\sum_{s=1}^{T}\zeta_{st}^{2}=\mathcal{O}_{P}\Big{(}\sqrt{T}/p\Big{)}$ .

(c)

Using the definition of $\eta_{st}$ we get

\max_{t\leq T}\left\lVert\frac{1}{T}\sum_{s=1}^{T}\widehat{{\mathbf{f}}}^{{}^{\prime}}_{s}\eta_{st}\right\rVert\leq\left\lVert\frac{1}{T}\sum_{s=1}^{T}\widehat{{\mathbf{f}}}_{s}{\mathbf{f}}^{\prime}_{s}\right\rVert\max_{t}\left\lVert\frac{1}{p}\sum_{i=1}^{p}{\mathbf{b}}_{i}\varepsilon_{it}\right\rVert=\mathcal{O}_{P}\Big{(}K\cdot T^{1/4}/\sqrt{p}\Big{)}.

To obtain the last rate we used Assumption (A.5)(c) together with the Chebyshev inequality and Bonferroni’s method to get $\max_{t\leq T}\left\lVert\sum_{i=1}^{p}{\mathbf{b}}_{i}\varepsilon_{it}\right\rVert=\mathcal{O}_{P}\Big{(}T^{1/4}\sqrt{p}\Big{)}$ .

(d)

In the proof of Lemma 4 we showed that $\lVert(1/T)\times\sum_{t=1}^{T}\sum_{i=1}^{p}{\mathbf{b}}_{i}\varepsilon_{it}(1/p)\widehat{{\mathbf{f}}}_{s}\rVert^{2}=\mathcal{O}\Big{(}\sqrt{K/p}\Big{)}$ . Furthermore, Assumption (A.3) implies $\mathbb{E}_{\,\!\!}\left[K^{-2}{\mathbf{f}}_{t}\right]^{4}<M$ , therefore, $\max_{t\leq T}\left\lVert{\mathbf{f}}_{t}\right\rVert=\mathcal{O}_{P}\Big{(}T^{1/4}\sqrt{K}\Big{)}$ . Using these bounds we get

\displaystyle\max_{t\leq T}\left\lVert\frac{1}{T}\sum_{s=1}^{T}\widehat{{\mathbf{f}}}^{{}^{\prime}}_{s}\xi_{st}\right\rVert\leq\max_{t\leq T}\left\lVert{\mathbf{f}}_{t}\right\rVert\cdot\left\lVert\sum_{s=1}^{T}\sum_{i=1}^{p}{\mathbf{b}}_{i}\varepsilon_{it}\frac{1}{p}\widehat{{\mathbf{f}}}_{s}\right\rVert=\mathcal{O}_{P}\Big{(}T^{1/4}\sqrt{K}\cdot\sqrt{K/p}\Big{)}=\mathcal{O}_{P}\Big{(}T^{1/4}K/\sqrt{p}\Big{)}.

∎

Lemma 6.

(a)

$\max_{i\leq K}(1/T)\sum_{t=1}^{T}(\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t})_{i}^{2}=\mathcal{O}_{P}(1/T+K^{2}/p)$ .
(b)

$(1/T)\sum_{t=1}^{T}\lVert\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t}\rVert^{2}=\mathcal{O}_{P}(K/T+K^{3}/p)$ .
(c)

$\max_{t\leq T}(1/T)\lVert\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t}\rVert=\mathcal{O}_{P}(K/\sqrt{T}+KT^{1/4}/\sqrt{p})$ .

Proof.

Similarly to Fan et al. (2013), we prove this lemma conditioning on the event $\hat{K}=K$ . Since $\Pr(\hat{K}\neq K)=o(1)$ , the unconditional arguments are implied.

(a)

Using (B.1), for some constant $C>0$ ,

	$\displaystyle\max_{i\leq K}(1/T)\sum_{t=1}^{T}(\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t})_{i}^{2}$	$\displaystyle\leq C\max_{i\leq K}\frac{1}{T}\sum_{t=1}^{T}\Bigg{(}\frac{1}{T}\sum_{s=1}^{T}\hat{f}_{is}\frac{\mathbb{E}_{\,\!\!}\left[{\bm{\varepsilon}}^{\prime}_{s}{\bm{\varepsilon}}_{t}\right]}{p}\Bigg{)}^{2}+C\max_{i\leq K}\frac{1}{T}\sum_{t=1}^{T}\Bigg{(}\frac{1}{T}\sum_{s=1}^{T}\hat{f}_{is}\zeta_{st}\Bigg{)}^{2}$
		$\displaystyle+C\max_{i\leq K}\frac{1}{T}\sum_{t=1}^{T}\Bigg{(}\frac{1}{T}\sum_{s=1}^{T}\hat{f}_{is}\zeta_{st}\Bigg{)}^{2}+C\max_{i\leq K}\frac{1}{T}\sum_{t=1}^{T}\Bigg{(}\frac{1}{T}\sum_{s=1}^{T}\hat{f}_{is}\xi_{st}\Bigg{)}^{2}$
		$\displaystyle=\mathcal{O}_{P}\Bigg{(}\frac{1}{T}+\frac{1}{p}+\frac{K^{2}}{p}+\frac{K^{2}}{p}\Bigg{)}=\mathcal{O}_{P}(1/T+K^{2}/p).$

(b)

Part (b) follows from part (a) and

\frac{1}{T}\sum_{t=1}^{T}\lVert\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t}\rVert^{2}\leq K\max_{i\leq K}\frac{1}{T}\sum_{t=1}^{T}(\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t})_{i}^{2}.

∎

Lemma 7.

(a)

${\mathbf{H}}{\mathbf{H}}^{\prime}={\mathbf{I}}_{\hat{K}}+\mathcal{O}_{P}(K^{5/2}/\sqrt{T}+K^{5/2}/\sqrt{p})$ .
(b)

${\mathbf{H}}{\mathbf{H}}^{\prime}={\mathbf{I}}_{K}+\mathcal{O}_{P}(K^{5/2}/\sqrt{T}+K^{5/2}/\sqrt{p})$ .

Proof.

Similarly to Lemma 6, we first condition on $\hat{K}=K$ .

(a)

The key observation here is that, according to the definition of ${\mathbf{H}}$ , its rank grows with $K$ , that is, $\left\lVert{\mathbf{H}}\right\rVert=\mathcal{O}_{P}(K)$ . Let $\widehat{\text{cov}}({\mathbf{H}}{\mathbf{f}}_{t})=(1/T)\sum_{t=1}^{T}{\mathbf{H}}{\mathbf{f}}_{t}({\mathbf{H}}{\mathbf{f}}_{t})^{\prime}$ . Using the triangular inequality we get

\left\lVert{\mathbf{H}}{\mathbf{H}}^{\prime}-{\mathbf{I}}_{\hat{K}}\right\rVert_{F}\leq\left\lVert{\mathbf{H}}{\mathbf{H}}^{\prime}-\widehat{\text{cov}}({\mathbf{H}}{\mathbf{f}}_{t})\right\rVert_{F}+\left\lVert\widehat{\text{cov}}({\mathbf{H}}{\mathbf{f}}_{t})-{\mathbf{I}}_{\hat{K}}\right\rVert_{F}.

(B.2)

To bound the first term in (B.2), we use Lemma 1: $\left\lVert{\mathbf{H}}{\mathbf{H}}^{\prime}-\widehat{\text{cov}}({\mathbf{H}}{\mathbf{f}}_{t})\right\rVert_{F}\leq\left\lVert{\mathbf{H}}\right\rVert^{2}\left\lVert{\mathbf{I}}_{K}-\widehat{\text{cov}}({\mathbf{H}}{\mathbf{f}}_{t})\right\rVert_{F}=\mathcal{O}_{P}(K^{5/2}/\sqrt{T})$ .
To bound the second term in (B.2), we use the Cauchy-Schwarz inequality and Lemma 6:

	$\displaystyle\left\lVert\frac{1}{T}\sum_{t=1}^{T}{\mathbf{H}}{\mathbf{f}}_{t}({\mathbf{H}}{\mathbf{f}}_{t})^{\prime}-\frac{1}{T}\sum_{t=1}^{T}\widehat{{\mathbf{f}}}_{t}\widehat{{\mathbf{f}}}^{\prime}_{t}\right\rVert_{F}\leq\left\lVert\frac{1}{T}\sum_{t=1}^{T}({\mathbf{H}}{\mathbf{f}}_{t}-\widehat{{\mathbf{f}}}_{t})({\mathbf{H}}{\mathbf{f}}_{t})^{\prime}\right\rVert_{F}+\left\lVert\frac{1}{T}\sum_{t}\widehat{{\mathbf{f}}}_{t}(\widehat{{\mathbf{f}}}^{\prime}_{t}-({\mathbf{H}}{\mathbf{f}}_{t})^{\prime})\right\rVert_{F}$
	$\displaystyle\leq\Bigg{(}\frac{1}{T}\sum_{t=1}\left\lVert{\mathbf{H}}{\mathbf{f}}_{t}-\widehat{{\mathbf{f}}}_{t}\right\rVert^{2}\frac{1}{T}\sum_{t=1}\left\lVert{\mathbf{H}}{\mathbf{f}}_{t}\right\rVert^{2}\Bigg{)}^{1/2}+\Bigg{(}\frac{1}{T}\sum_{t=1}\left\lVert{\mathbf{H}}{\mathbf{f}}_{t}-\widehat{{\mathbf{f}}}_{t}\right\rVert^{2}\frac{1}{T}\sum_{t=1}\left\lVert\widehat{{\mathbf{f}}}_{t}\right\rVert^{2}\Bigg{)}^{1/2}$
	$\displaystyle=\mathcal{O}_{P}\Bigg{(}\Big{(}\frac{K}{T}+\frac{K^{3}}{p}\cdot K\Big{)}^{1/2}+\Big{(}\frac{K}{T}+\frac{K^{3}}{p}\cdot K^{2}\Big{)}^{1/2}\Bigg{)}=\mathcal{O}_{P}\Bigg{(}\frac{K^{3/2}}{\sqrt{T}}+\frac{K^{5/2}}{\sqrt{p}}\Bigg{)}.$

(b)

The proof of (b) follows from $\Pr(\hat{K}=K)\rightarrow 1$ and the arguments made in Fan et al. (2013), (Lemma 11) for fixed $K$ .

∎

B.2 Proof of Theorem 1

The second part of Theorem 1 was proved in Lemma 6. We now proceed to the convergence rate of the first part. Using the following definitions: $\widehat{{\mathbf{b}}}_{i}=(1/T)\sum_{t=1}^{T}r_{it}\widehat{{\mathbf{f}}}_{t}$ and $(1/T)\sum_{t=1}^{T}\widehat{{\mathbf{f}}}_{t}\widehat{{\mathbf{f}}}^{\prime}_{t}={\mathbf{I}}_{K}$ , we obtain

\widehat{{\mathbf{b}}}_{i}-{\mathbf{H}}{\mathbf{b}}_{i}=\frac{1}{T}\sum_{t=1}^{T}{\mathbf{H}}{\mathbf{f}}_{t}\varepsilon_{it}+\frac{1}{T}\sum_{t=1}^{T}r_{it}(\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t})+{\mathbf{H}}\Big{(}\frac{1}{T}\sum_{t=1}^{T}{\mathbf{f}}_{t}{\mathbf{f}}^{\prime}_{t}-{\mathbf{I}}_{K}\Big{)}{\mathbf{b}}_{i}.

(B.3)

Let us bound each term on the right-hand side of (B.3). The first term is

	$\displaystyle\max_{i\leq p}\left\lVert{\mathbf{H}}{\mathbf{f}}_{t}\varepsilon_{it}\right\rVert\leq\left\lVert{\mathbf{H}}\right\rVert\max_{i}\sqrt{\sum_{k=1}^{K}\Bigg{(}\frac{1}{T}\sum_{t=1}^{T}f_{kt}\varepsilon_{it}\Bigg{)}^{2}}$	$\displaystyle\leq\left\lVert{\mathbf{H}}\right\rVert\sqrt{K}\max_{i\leq p,j\leq K}\left\lvert\frac{1}{T}\sum_{t=1}^{T}f_{jt}\varepsilon_{it}\right\rvert$
		$\displaystyle=\mathcal{O}_{P}\Big{(}K\cdot K^{1/2}\cdot\sqrt{\log p/T}\Big{)},$

where we used Lemmas 1 and 7 together with Bonferroni’s method. For the second term,

\displaystyle\max_{i}\left\lVert\frac{1}{T}\sum_{t=1}^{T}r_{it}\Big{(}\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t}\Big{)}\right\rVert\leq\max_{i}\Bigg{(}\frac{1}{T}\sum_{t=1}^{T}r_{it}^{2}\frac{1}{T}\sum_{t=1}^{T}\left\lVert\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t}\right\rVert^{2}\Bigg{)}^{1/2}=\mathcal{O}_{P}\Bigg{(}\frac{1}{T}+\frac{K^{2}}{p}\Bigg{)}^{1/2},

where we used Lemma 6 and the fact that $\max_{i}T^{-1}\sum_{t=1}^{T}r_{it}^{2}=\mathcal{O}_{P}(1)$ since $\mathbb{E}_{\,\!\!}\left[r_{it}^{2}\right]=\mathcal{O}(1)$ .
Finally, the third term is $\mathcal{O}_{P}(K^{2}T^{-1/2})$ since $\lVert(1/T)\sum_{t=1}^{T}{\mathbf{f}}_{t}{\mathbf{f}}^{\prime}_{t}-{\mathbf{I}}_{K}\rVert=\mathcal{O}_{P}\Big{(}KT^{-1/2}\Big{)}$ , $\left\lVert{\mathbf{H}}\right\rVert=\mathcal{O}_{P}(K)$ and $\max_{i}\left\lVert{\mathbf{b}}\right\rVert_{i}=\mathcal{O}(1)$ by Assumption (B.1).

B.3 Corollary 1

As a consequence of Theorem 1, we get the following corollary:

Corollary 1.

Under the assumptions of Theorem 1,

\max_{i\leq p,t\leq T}\left\lVert\widehat{{\mathbf{b}}}_{i}^{{}^{\prime}}\widehat{{\mathbf{f}}}_{t}-{\mathbf{b}}^{\prime}_{i}{\mathbf{f}}_{t}\right\rVert=\mathcal{O}_{P}(\log T^{1/r_{2}}K^{2}\sqrt{\log p/T}+K^{2}T^{1/4}/\sqrt{p}).

Proof.

Using Assumption (A.4) and Bonferroni’s method, we have $\max_{t\leq T}\lVert{\mathbf{f}}_{t}\rVert=\mathcal{O}_{P}(\sqrt{K}\log T^{1/r_{2}})$ . By Theorem 1, uniformly in $i$ and $t$ :

	$\displaystyle\left\lVert\widehat{{\mathbf{b}}}^{\prime}_{i}\widehat{{\mathbf{f}}}_{t}-{\mathbf{b}}^{\prime}_{i}{\mathbf{f}}_{t}\right\rVert$	$\displaystyle\leq\left\lVert\widehat{{\mathbf{b}}}_{i}-{\mathbf{H}}{\mathbf{b}}_{i}\right\rVert\left\lVert\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t}\right\rVert+\left\lVert{\mathbf{H}}{\mathbf{b}}_{i}\right\rVert\left\lVert\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t}\right\rVert$
		$\displaystyle+\left\lVert\widehat{{\mathbf{b}}}_{i}-{\mathbf{H}}{\mathbf{b}}_{i}\right\rVert\left\lVert{\mathbf{H}}{\mathbf{f}}_{t}\right\rVert+\left\lVert{\mathbf{b}}_{i}\right\rVert\left\lVert{\mathbf{f}}_{t}\right\rVert\left\lVert{\mathbf{H}}^{\prime}{\mathbf{H}}-{\mathbf{I}}_{K}\right\rVert$
		$\displaystyle=\mathcal{O}_{P}\Bigg{(}\Big{(}K^{3/2}\sqrt{\frac{\log p}{T}}+\frac{K}{\sqrt{p}}\Big{)}\cdot\Big{(}\frac{K}{\sqrt{T}}+\frac{KT^{1/4}}{\sqrt{p}}\Big{)}\Bigg{)}+\mathcal{O}_{P}\Bigg{(}K\cdot\Big{(}\frac{K}{\sqrt{T}}+\frac{KT^{1/4}}{\sqrt{p}}\Big{)}\Bigg{)}$
		$\displaystyle+\mathcal{O}_{P}\Bigg{(}\Big{(}K^{3/2}\sqrt{\frac{\log p}{T}}+\frac{K}{\sqrt{p}}\Big{)}\cdot\log T^{1/r_{2}}K^{1/2}\Bigg{)}+\mathcal{O}_{P}\Bigg{(}\log T^{1/r_{2}}K^{1/2}\Big{(}\frac{K^{5/2}}{\sqrt{T}}+\frac{K^{5/2}}{\sqrt{p}}\Big{)}\Bigg{)}$
		$\displaystyle=\mathcal{O}_{P}\Big{(}\log T^{1/r_{2}}K^{2}\sqrt{\log p/T}+K^{2}T^{1/4}/\sqrt{p}\Big{)}.$

∎

B.4 Proof of Theorem 2

Using the definition of the idiosyncratic components we have $\varepsilon_{it}-\hat{\varepsilon}_{it}={\mathbf{b}}^{\prime}_{i}{\mathbf{H}}^{\prime}(\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t})+(\widehat{{\mathbf{b}}}^{\prime}_{i}-{\mathbf{b}}^{\prime}_{i}{\mathbf{H}}^{\prime})\widehat{{\mathbf{f}}}_{t}+{\mathbf{b}}^{\prime}_{i}({\mathbf{H}}^{\prime}{\mathbf{H}}-{\mathbf{I}}_{K}){\mathbf{f}}_{t}$ . We bound the maximum element-wise difference as follows:

	$\displaystyle\max_{i\leq p}\frac{1}{T}\sum_{t=1}^{T}(\varepsilon_{it}-\hat{\varepsilon}_{it})^{2}$	$\displaystyle\leq 4\max_{i}\left\lVert{\mathbf{b}}^{\prime}_{i}{\mathbf{H}}^{\prime}\right\rVert^{2}\frac{1}{T}\sum_{t=1}^{T}\left\lVert\widehat{{\mathbf{f}}}_{t}-{\mathbf{H}}{\mathbf{f}}_{t}\right\rVert^{2}+4\max_{i}\left\lVert\widehat{{\mathbf{b}}}^{\prime}_{i}-{\mathbf{b}}^{\prime}_{i}{\mathbf{H}}^{\prime}\right\rVert^{2}\frac{1}{T}\sum_{t=1}^{T}\left\lVert\widehat{{\mathbf{f}}}_{t}\right\rVert^{2}$
		$\displaystyle+4\max_{i}\left\lVert{\mathbf{b}}^{\prime}_{i}\right\rVert\frac{1}{T}\sum_{t=1}^{T}\left\lVert{\mathbf{f}}_{t}\right\rVert^{2}\left\lVert{\mathbf{H}}^{\prime}{\mathbf{H}}-{\mathbf{I}}_{K}\right\rVert_{F}^{2}$
		$\displaystyle=\mathcal{O}\Bigg{(}K^{2}\cdot\Big{(}\frac{K}{T}+\frac{K^{3}}{p}\Big{)}\Bigg{)}+\mathcal{O}\Bigg{(}\Big{(}\frac{K^{3}\log p}{T}+\frac{K^{2}}{p}\Big{)}\cdot K\Bigg{)}+\mathcal{O}\Bigg{(}K\cdot\Big{(}\frac{K^{5}}{T}+\frac{K^{5}}{p}\Big{)}\Bigg{)}$
		$\displaystyle=\mathcal{O}\Bigg{(}\frac{K^{4}\log p}{T}+\frac{K^{6}}{p}\Bigg{)}.$

Let $\omega_{3T}\equiv K^{2}\sqrt{\log p/T}+K^{3}/\sqrt{p}$ . Denote $\max_{i\leq p}(1/T)\sum_{t=1}^{T}(\varepsilon_{it}-\hat{\varepsilon}_{it})^{2}=\mathcal{O}_{P}(\omega_{3T}^{2})$ . Then, $\max_{i,t}\left\lvert\varepsilon_{it}-\hat{\varepsilon}_{it}\right\rvert=\mathcal{O}_{P}(\omega_{3T})=o_{P}(1)$ , where the last equality is implied by Corollary 1.
As pointed out in the main text, the second part of Theorem 2 is based on the relationship between the convergence rates of the estimated covariance and precision matrices established in Janková and van de Geer (2018) (Theorem 14.1.3).

B.5 Lemmas for Theorem 3

Lemma 8.

Under the assumptions of Theorem 1, we have the following results:

(a)

$\left\lVert{\mathbf{B}}\right\rVert=\left\lVert{\mathbf{B}}{\mathbf{H}}^{\prime}\right\rVert=\mathcal{O}(\sqrt{p})$ .
(b)

$\varrho_{T}^{-1}\max_{1\leq i\leq p}\left\lVert\widehat{{\mathbf{b}}}_{i}-{\mathbf{H}}^{\prime}{\mathbf{b}}_{i}\right\rVert=o_{P}(1/\sqrt{K})$ and $\max_{1\leq i\leq p}\left\lVert\widehat{{\mathbf{b}}}_{i}\right\rVert=\mathcal{O}_{P}(\sqrt{K})$ .
(c)

$\varrho_{T}^{-1}\left\lVert\widehat{{\mathbf{B}}}-{\mathbf{B}}{\mathbf{H}}^{\prime}\right\rVert=o_{P}\Big{(}\sqrt{p/K}\Big{)}$ and $\left\lVert\widehat{{\mathbf{B}}}\right\rVert=\mathcal{O}_{P}(\sqrt{p})$ .

Proof.

Part (c) is direct consequences of (a)-(b), therefore, we only prove the first two parts in what follows.

(a)

Part (a) easily follows from (B.1): $\text{tr}({\bm{\Sigma}}-{\mathbf{B}}{\mathbf{B}}^{\prime})=\text{tr}({\bm{\Sigma}})-\left\lVert{\mathbf{B}}\right\rVert^{2}\geq 0$ , since $\text{tr}({\bm{\Sigma}})=\mathcal{O}(p)$ by (B.1), we get $\left\lVert{\mathbf{B}}\right\rVert^{2}=\mathcal{O}(p)$ . Part (a) follows from the fact that the linear space spanned by the rows of ${\mathbf{B}}$ is the same as that by the rows of ${\mathbf{B}}{\mathbf{H}}^{\prime}$ , hence, in practice, it does not matter which one is used.
(b)

From Theorem 1, we have $\max_{i\leq p}\left\lVert\widehat{{\mathbf{b}}}_{i}-{\mathbf{H}}{\mathbf{b}}_{i}\right\rVert=\mathcal{O}_{P}(\omega_{1T})$ . Using the definition of $\varrho_{T}$ from Theorem 2, it follows that $\varrho_{T}^{-1}\omega_{1T}=o_{P}(\omega_{1T}\omega_{3T}^{-1})$ . Let $\widetilde{z}_{T}\equiv\omega_{1T}\omega_{3T}^{-1}$ . Consider
$\varrho_{T}^{-1}\max_{1\leq i\leq p}\left\lVert\widehat{{\mathbf{b}}}_{i}-{\mathbf{H}}{\mathbf{b}}_{i}\right\rVert=o_{P}(z_{T})$ . The latter holds for any $z_{t}\geq\widetilde{z}_{T}$ , with the tightest bound obtained when $z_{T}=\widetilde{z}_{T}$ . For the ease of representation, we use $z_{T}=1/\sqrt{K}$ instead of $\widetilde{z}_{T}$ .
The second result in Part (b) is obtained using the fact that $\max_{1\leq i\leq p}\left\lVert\widehat{{\mathbf{b}}}_{i}\right\rVert\leq\sqrt{K}\left\lVert{\mathbf{B}}\right\rVert_{\text{max}}$ , where $\left\lVert{\mathbf{B}}\right\rVert_{\text{max}}=\mathcal{O}(1)$ by (B.1).

∎

Lemma 9.

Let ${\bm{\Pi}}\equiv\Big{[}{\bm{\Theta}}_{f}+({\mathbf{B}}{\mathbf{H}}^{\prime})^{\prime}{\bm{\Theta}}_{\varepsilon}({\mathbf{B}}{\mathbf{H}}^{\prime})\Big{]}^{-1}$ , $\widehat{{\bm{\Pi}}}\equiv\Big{[}\widehat{{\bm{\Theta}}}_{f}+\widehat{{\mathbf{B}}}^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}\widehat{{\mathbf{B}}}\Big{]}^{-1}$ . Also, define ${\bm{\Sigma}}_{f}=(1/T)\sum_{t=1}^{T}{\mathbf{H}}{\mathbf{f}}_{t}({\mathbf{H}}{\mathbf{f}}_{t})^{\prime}$ , ${\bm{\Theta}}_{f}={\bm{\Sigma}}_{f}^{-1}$ , $\widehat{{\bm{\Sigma}}}_{f}\equiv(1/T)\sum_{t=1}^{T}\widehat{{\mathbf{f}}}_{t}\widehat{{\mathbf{f}}}^{\prime}_{t}$ , and $\widehat{{\bm{\Theta}}}_{f}=\widehat{{\bm{\Sigma}}}_{f}^{-1}$ . Under the assumptions of Theorem 2, we have the following results:

(a)

$\Lambda_{\text{min}}({\mathbf{B}}^{\prime}{\mathbf{B}})^{-1}=\mathcal{O}(1/p)$ .
(b)

${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Pi}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}=\mathcal{O}(1/p)$ .
(c)

$\varrho_{T}^{-1}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Theta}}}_{f}-{\bm{\Theta}}_{f}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}=o_{P}\Big{(}1/\sqrt{K}\Big{)}$ .
(d)

$\varrho_{T}^{-1}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Pi}}}-{\bm{\Pi}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}=\mathcal{O}_{P}\Big{(}s_{T}/p\Big{)}$ and ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Pi}}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}=\mathcal{O}_{P}(1/p)$ .

Proof.

(a)

Using Assumption (A.2) we have $\left\lvert\Lambda_{\text{min}}(p^{-1}{\mathbf{B}}^{\prime}{\mathbf{B}})-\Lambda_{\text{min}}(\breve{{\mathbf{B}}})\right\rvert\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|p^{-1}{\mathbf{B}}^{\prime}{\mathbf{B}}-\breve{{\mathbf{B}}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}$ , which implies Part (a).

(b)

First, notice that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Pi}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}=\Lambda_{\text{min}}({\bm{\Theta}}_{f}+({\mathbf{B}}{\mathbf{H}}^{\prime})^{\prime}{\bm{\Theta}}_{\varepsilon}({\mathbf{B}}{\mathbf{H}}^{\prime}))^{-1}$ . Therefore, we get

{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Pi}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}\leq\Lambda_{\text{min}}(({\mathbf{B}}{\mathbf{H}}^{\prime})^{\prime}{\bm{\Theta}}_{\varepsilon}({\mathbf{B}}{\mathbf{H}}^{\prime}))^{-1}\leq\Lambda_{\text{min}}({\mathbf{B}}^{\prime}{\mathbf{B}})^{-1}\Lambda_{\text{min}}({\bm{\Theta}}_{\varepsilon})^{-1}=\Lambda_{\text{min}}({\mathbf{B}}^{\prime}{\mathbf{B}})^{-1}\Lambda_{\text{max}}({\bm{\Sigma}}_{\varepsilon}),

where the second inequality is due to the fact that the linear space spanned by the rows of ${\mathbf{B}}$ is the same as that by the rows of ${\mathbf{B}}{\mathbf{H}}^{\prime}$ , hence, in practice, it does not matter which one is used. Therefore, the result in Part (b) follows from Part (a), Assumptions (A.1) and (A.2).

(c)

From Lemma 7 we obtained:

\displaystyle\left\lVert\frac{1}{T}\sum_{t=1}^{T}{\mathbf{H}}{\mathbf{f}}_{t}({\mathbf{H}}{\mathbf{f}}_{t})^{\prime}-\frac{1}{T}\sum_{t=1}^{T}\widehat{{\mathbf{f}}}_{t}\widehat{{\mathbf{f}}}^{\prime}_{t}\right\rVert_{F}=\mathcal{O}_{P}\Bigg{(}\frac{K^{3/2}}{\sqrt{T}}+\frac{K^{5/2}}{\sqrt{p}}\Bigg{)}.

Since ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Theta}}_{f}(\widehat{{\bm{\Sigma}}}_{f}-{\bm{\Sigma}}_{f})\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}<1$ , we have

\displaystyle{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Theta}}}_{f}-{\bm{\Theta}}_{f}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}\leq\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Theta}}_{f}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Theta}}_{f}(\widehat{{\bm{\Sigma}}}_{f}-{\bm{\Sigma}}_{f})\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Theta}}_{f}(\widehat{{\bm{\Sigma}}}_{f}-{\bm{\Sigma}}_{f})\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}}=\mathcal{O}_{P}\Bigg{(}\frac{K^{3/2}}{\sqrt{T}}+\frac{K^{5/2}}{\sqrt{p}}\Bigg{)}.

Let $\omega_{4T}=K^{3/2}/\sqrt{T}+K^{5/2}/\sqrt{p}$ . Using the definition of $\varrho_{T}$ from Theorem 2, it follows that $\varrho_{T}^{-1}\omega_{4T}=o_{P}(\omega_{4T}\omega_{3T}^{-1})$ . Let $\widetilde{\gamma}_{T}\equiv\omega_{4T}\omega_{3T}^{-1}$ . Consider $\varrho_{T}^{-1}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Theta}}}_{f}-{\bm{\Theta}}_{f}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}=o_{P}(\gamma_{T})$ . The latter holds for any $\gamma_{t}\geq\widetilde{\gamma}_{T}$ , with the tightest bound obtained when $\gamma_{T}=\widetilde{\gamma}_{T}$ . For the ease of representation, we use $\gamma_{T}=1/\sqrt{K}$ instead of $\widetilde{\gamma}_{T}$ .

(d)

We will bound each term in the definition of $\widehat{{\bm{\Pi}}}-{\bm{\Pi}}$ . First, we have

	$\displaystyle{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\mathbf{B}}}^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}\widehat{{\mathbf{B}}}-({\mathbf{B}}{\mathbf{H}}^{\prime})^{\prime}{\bm{\Theta}}_{\varepsilon}({\mathbf{B}}{\mathbf{H}}^{\prime})\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}$	$\displaystyle\leq{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\mathbf{B}}}-{\mathbf{B}}{\mathbf{H}}^{\prime}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\mathbf{B}}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\mathbf{B}}{\mathbf{H}}^{\prime}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}_{\varepsilon}-{\bm{\Theta}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\mathbf{B}}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}$
		$\displaystyle+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\mathbf{B}}{\mathbf{H}}^{\prime}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\mathbf{B}}}-{\mathbf{B}}{\mathbf{H}}^{\prime}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}=\mathcal{O}_{P}\Bigg{(}p\cdot s_{T}\cdot\varrho_{T}\Bigg{)}.$		(B.4)

Now we combine ((d)) with the results from Parts (b)-(c):

\displaystyle\varrho_{T}^{-1}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Pi}}\Big{(}\widehat{{\bm{\Pi}}}^{-1}-{\bm{\Pi}}^{-1}\Big{)}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}=\mathcal{O}_{P}\Big{(}s_{t}\Big{)}.

Finally, since ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Pi}}\Big{(}\widehat{{\bm{\Pi}}}^{-1}-{\bm{\Pi}}^{-1}\Big{)}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}<1$ , we have

\displaystyle\varrho_{T}^{-1}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Pi}}}-{\bm{\Pi}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}\leq\varrho_{T}^{-1}\frac{{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Pi}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Pi}}\Big{(}\widehat{{\bm{\Pi}}}^{-1}-{\bm{\Pi}}^{-1}\Big{)}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}}{1-{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Pi}}\Big{(}\widehat{{\bm{\Pi}}}^{-1}-{\bm{\Pi}}^{-1}\Big{)}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}}=\mathcal{O}_{P}\Bigg{(}\frac{s_{t}}{p}\Bigg{)}.

∎

B.6 Proof of Theorem 3

Using the Sherman-Morrison-Woodbury formula, we have

$\displaystyle{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}$	$\displaystyle\leq{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}_{\varepsilon}-{\bm{\Theta}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|(\widehat{{\bm{\Theta}}}_{\varepsilon}-{\bm{\Theta}}_{\varepsilon})\widehat{{\mathbf{B}}}\widehat{{\bm{\Pi}}}\widehat{{\mathbf{B}}}^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}_{\varepsilon}(\widehat{{\mathbf{B}}}-{\mathbf{B}}{\mathbf{H}}^{\prime})\widehat{{\bm{\Pi}}}\widehat{{\mathbf{B}}}^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}$
	$\displaystyle+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}_{\varepsilon}{\mathbf{B}}{\mathbf{H}}^{\prime}(\widehat{{\bm{\Pi}}}-{\bm{\Pi}})\widehat{{\mathbf{B}}}^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}_{\varepsilon}{\mathbf{B}}{\mathbf{H}}^{\prime}{\bm{\Pi}}(\widehat{{\mathbf{B}}}-{\mathbf{B}})^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}_{\varepsilon}{\mathbf{B}}{\mathbf{H}}^{\prime}{\bm{\Pi}}({\mathbf{B}}{\mathbf{H}}^{\prime})^{\prime}(\widehat{{\bm{\Theta}}}_{\varepsilon}-{\bm{\Theta}}_{\varepsilon})\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}$
	$\displaystyle=\Delta_{1}+\Delta_{2}+\Delta_{3}+\Delta_{4}+\Delta_{5}+\Delta_{6}.$	(B.5)

We now bound the terms in (B.6) for $l=2$ and $l=\infty$ . We start with $l=2$ . First, note that $\varrho_{T}^{-1}\Delta_{1}=\mathcal{O}_{P}(s_{T})$ by Theorem 2. Second, using Lemmas 8-9 together with Theorem 2, we have $\varrho_{T}^{-1}(\Delta_{2}+\Delta_{6})=\mathcal{O}_{P}(s_{T}\cdot\sqrt{p}\cdot(1/p)\cdot\sqrt{p}\cdot 1)=\mathcal{O}_{P}(s_{T})$ . Third, $\varrho_{T}^{-1}(\Delta_{3}+\Delta_{5})$ is negligible according to Lemma 8(c). Finally, $\varrho_{T}^{-1}\Delta_{4}=\mathcal{O}_{P}\Big{(}1\cdot\sqrt{p}\cdot(s_{T}/p)\cdot\sqrt{p}\cdot 1\Big{)}=\mathcal{O}_{P}(s_{T})$ by Lemmas 8-9 and Theorem 2.
Now consider $l=\infty$ . First, similarly to the previous case, $\varrho_{T}^{-1}\Delta_{1}=\mathcal{O}_{P}(s_{T})$ . Second, $\varrho_{T}^{-1}(\Delta_{2}+\Delta_{6})=\mathcal{O}_{P}\Big{(}s_{T}\cdot\sqrt{pK}\cdot(\sqrt{K}/p)\cdot\sqrt{pK}\cdot\sqrt{d_{T}}\Big{)}=\mathcal{O}_{P}(s_{T}K^{3/2}\sqrt{d_{T}})$ , where we used the fact that for any ${\mathbf{A}}\in\mathcal{S}_{p}$ we have ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\mathbf{A}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\mathbf{A}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\infty}\leq\sqrt{d({\mathbf{A}})}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\mathbf{A}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}$ , where $d({\mathbf{A}})$ measures the maximum vertex degree as described at the beginning of Section 4. Third, the term $\varrho_{T}^{-1}(\Delta_{3}+\Delta_{5})$ is negligible according to Lemma 8(c). Finally, $\varrho_{T}^{-1}\Delta_{4}=\mathcal{O}_{P}(\sqrt{d_{T}}\cdot\sqrt{pK}\cdot\sqrt{K}(s_{T})/p\cdot\sqrt{pK}\cdot\sqrt{d_{T}})=\mathcal{O}_{P}(d_{T}K^{3/2}s_{T})$ .

B.7 Lemmas for Theorem 4

Lemma 10.

Under the assumptions of Theorem 4, ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}=\mathcal{O}(d_{T}K^{3/2})$ , where $d_{T}$ was defined in Section 4.

Proof.

We use the Sherman-Morrison-Woodbury formula:

	$\displaystyle{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{1}$	$\displaystyle\leq{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{1}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}_{\varepsilon}{\mathbf{B}}[{\bm{\Theta}}_{f}+{\mathbf{B}}^{\prime}{\bm{\Theta}}_{\varepsilon}{\mathbf{B}}]^{-1}{\mathbf{B}}^{\prime}{\bm{\Theta}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{1}$
		$\displaystyle=\mathcal{O}(\sqrt{d_{T}})+\mathcal{O}\Big{(}\sqrt{d_{T}}\cdot p\cdot\frac{\sqrt{K}}{p}\cdot K\cdot\sqrt{d_{T}}\Big{)}=\mathcal{O}(d_{T}K^{3/2}).$		(B.6)

The last equality in (B.7) is obtained under the assumptions of Theorem 4. This result is important in several aspects: it shows that the sparsity of the precision matrix of stock returns is controlled by the sparsity in the precision of the idiosyncratic returns. Hence, one does not need to impose an unrealistic sparsity assumption on the precision of returns a priori when the latter follow a factor structure - sparsity of the precision once the common movements have been taken into account would suffice. ∎

Lemma 11.

Define $a={\bm{\iota}}^{\prime}_{p}{\bm{\Theta}}{\bm{\iota}}_{p}/p$ , $b={\bm{\iota}}^{\prime}_{p}{\bm{\Theta}}{\mathbf{m}}/p$ , $d={\mathbf{m}}^{\prime}{\bm{\Theta}}{\mathbf{m}}/p$ , $g=\sqrt{{\mathbf{m}}^{\prime}{\bm{\Theta}}{\mathbf{m}}}/p$ and $\widehat{a}={\bm{\iota}}^{\prime}_{p}\widehat{{\bm{\Theta}}}{\bm{\iota}}_{p}/p$ , $\widehat{b}={\bm{\iota}}^{\prime}_{p}\widehat{{\bm{\Theta}}}\widehat{{\mathbf{m}}}/p$ , $\widehat{d}=\widehat{{\mathbf{m}}}^{\prime}\widehat{{\bm{\Theta}}}\widehat{{\mathbf{m}}}/p$ , $\widehat{g}=\sqrt{\widehat{{\mathbf{m}}}^{\prime}\widehat{{\bm{\Theta}}}\widehat{{\mathbf{m}}}}/p$ . Under the assumptions of Theorem 4 and assuming $(ad-b^{2})>0$ ,

(a)

$a\geq C_{0}>0$ , $b=\mathcal{O}(1)$ , $d=\mathcal{O}(1)$ , where $C_{0}$ is a positive constant representing the minimal eigenvalue of ${\bm{\Theta}}$ .
(b)

$\left\lvert\widehat{a}-a\right\rvert=\mathcal{O}_{P}(\varrho_{T}d_{T}K^{3/2}s_{T})=o_{P}(1)$ .
(c)

$\left\lvert\widehat{b}-b\right\rvert=\mathcal{O}_{P}(\varrho_{T}d_{T}K^{3/2}s_{T})=o_{P}(1)$
(d)

$\left\lvert\widehat{d}-d\right\rvert=\mathcal{O}_{P}(\varrho_{T}d_{T}K^{3/2}s_{T})=o_{P}(1)$ .
(e)

$\left\lvert\widehat{g}-g\right\rvert=\mathcal{O}_{P}\Big{(}[\varrho_{T}d_{T}K^{3/2}s_{T}]^{1/2}\Big{)}=o_{P}(1)$ .
(f)

$\left\lvert(\widehat{a}\widehat{d}-\widehat{b}^{2})-(ad-b^{2})\right\rvert=\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}K^{3/2}s_{T}\Big{)}=o_{P}(1)$ .
(g)

$\left\lvert ad-b^{2}\right\rvert=\mathcal{O}(1)$ .

Proof.

(a)

Part (a) is trivial and follows directly from ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}=\mathcal{O}(1)$ and $\left\lVert{\mathbf{m}}\right\rVert_{\infty}=\mathcal{O}(1)$ from Assumption 1. We show the proof for $d$ : recall, $d={\mathbf{m}}^{\prime}{\bm{\Theta}}{\mathbf{m}}/p\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}^{2}\left\lVert{\mathbf{m}}\right\rVert_{2}^{2}/p=\mathcal{O}(1)$ .

(b)

Using the Hölders inequality, we have

	$\displaystyle\left\lvert\widehat{a}-a\right\rvert=\left\lvert\frac{{\bm{\iota}}^{\prime}_{p}(\widehat{{\bm{\Theta}}}-{\bm{\Theta}}){\bm{\iota}}_{p}}{p}\right\rvert\leq\frac{\left\lVert(\widehat{{\bm{\Theta}}}-{\bm{\Theta}}){\bm{\iota}}_{p}\right\rVert_{1}\left\lVert{\bm{\iota}}_{p}\right\rVert_{\text{max}}}{p}$	$\displaystyle\leq{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{1}$
		$\displaystyle=\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}K^{3/2}(s_{T}+(1/p))\Big{)}=o_{P}(1),$

where the last rate is obtained using the assumptions of Theorem 3.

(c)

First, rewrite the expression of interest:

\displaystyle\widehat{b}-b=[{\bm{\iota}}^{\prime}_{p}(\widehat{{\bm{\Theta}}}-{\bm{\Theta}})(\widehat{{\mathbf{m}}}-{\mathbf{m}})]/p+[{\bm{\iota}}^{\prime}_{p}(\widehat{{\bm{\Theta}}}-{\bm{\Theta}}){\mathbf{m}}]/p+[{\bm{\iota}}^{\prime}_{p}{\bm{\Theta}}(\widehat{{\mathbf{m}}}-{\mathbf{m}})]/p.

(B.7)

We now bound each of the terms in (B.7) using the expressions derived in Callot et al. (2019) (see their Proof of Lemma A.2) and the fact that $\log p/T=o(1)$ .

\displaystyle\left\lvert{\bm{\iota}}^{\prime}_{p}(\widehat{{\bm{\Theta}}}-{\bm{\Theta}})(\widehat{{\mathbf{m}}}-{\mathbf{m}})\right\rvert/p\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}\left\lVert\widehat{{\mathbf{m}}}-{\mathbf{m}}\right\rVert_{\text{max}}=\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}K^{3/2}s_{T}\cdot\sqrt{\frac{\log p}{T}}\Big{)}.

(B.8)

\displaystyle\left\lvert{\bm{\iota}}^{\prime}_{p}(\widehat{{\bm{\Theta}}}-{\bm{\Theta}}){\mathbf{m}}\right\rvert/p\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}=\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}K^{3/2}s_{T}\Big{)}.

(B.9)

\displaystyle\left\lvert{\bm{\iota}}^{\prime}_{p}{\bm{\Theta}}(\widehat{{\mathbf{m}}}-{\mathbf{m}})\right\rvert/p\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}\left\lVert\widehat{{\mathbf{m}}}-{\mathbf{m}}\right\rVert_{\text{max}}=\mathcal{O}_{P}\Big{(}d_{T}K^{3/2}\cdot\sqrt{\frac{\log p}{T}}\Big{)}.

(B.10)

(d)

First, rewrite the expression of interest:

$\displaystyle\widehat{d}-d$	$\displaystyle=[(\widehat{{\mathbf{m}}}-{\mathbf{m}})^{\prime}(\widehat{{\bm{\Theta}}}-{\bm{\Theta}})(\widehat{{\mathbf{m}}}-{\mathbf{m}})]/p+[(\widehat{{\mathbf{m}}}-{\mathbf{m}})^{\prime}{\bm{\Theta}}(\widehat{{\mathbf{m}}}-{\mathbf{m}})]/p$
	$\displaystyle+[2(\widehat{{\mathbf{m}}}-{\mathbf{m}})^{\prime}{\bm{\Theta}}{\mathbf{m}}]/p+[2{\mathbf{m}}^{\prime}(\widehat{{\bm{\Theta}}}-{\bm{\Theta}})(\widehat{{\mathbf{m}}}-{\mathbf{m}})]/p$
	$\displaystyle+[{\mathbf{m}}^{\prime}(\widehat{{\bm{\Theta}}}-{\bm{\Theta}}){\mathbf{m}}]/p.$	(B.11)

We now bound each of the terms in ((d)) using the expressions derived in Callot et al. (2019) (see their Proof of Lemma A.3) and the facts that $\log p/T=o(1)$ and $\left\lVert\widehat{{\mathbf{m}}}-{\mathbf{m}}\right\rVert_{\text{max}}=\mathcal{O}_{P}(\sqrt{\log p/T})$ .

	$\displaystyle\left\lvert(\widehat{{\mathbf{m}}}-{\mathbf{m}})^{\prime}(\widehat{{\bm{\Theta}}}-{\bm{\Theta}})(\widehat{{\mathbf{m}}}-{\mathbf{m}})\right\rvert/p$	$\displaystyle\leq\left\lVert\widehat{{\mathbf{m}}}-{\mathbf{m}}\right\rVert_{\text{max}}^{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{1}$
		$\displaystyle=\mathcal{O}_{P}\Big{(}\frac{\log p}{T}\cdot\varrho_{T}d_{T}K^{3/2}s_{T}\Big{)}$		(B.12)

\displaystyle\left\lvert(\widehat{{\mathbf{m}}}-{\mathbf{m}})^{\prime}{\bm{\Theta}}(\widehat{{\mathbf{m}}}-{\mathbf{m}})\right\rvert/p\leq\left\lVert\widehat{{\mathbf{m}}}-{\mathbf{m}}\right\rVert_{\text{max}}^{2}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}=\mathcal{O}_{P}\Big{(}\frac{\log p}{T}\cdot d_{T}K^{3/2}\Big{)}.

(B.13)

\displaystyle\left\lvert(\widehat{{\mathbf{m}}}-{\mathbf{m}})^{\prime}{\bm{\Theta}}{\mathbf{m}}\right\rvert/p\leq\left\lVert\widehat{{\mathbf{m}}}-{\mathbf{m}}\right\rVert_{\text{max}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}=\mathcal{O}_{P}\Big{(}\sqrt{\frac{\log p}{T}}\cdot d_{T}K^{3/2}\Big{)}.

(B.14)

	$\displaystyle\left\lvert{\mathbf{m}}^{\prime}(\widehat{{\bm{\Theta}}}-{\bm{\Theta}})(\widehat{{\mathbf{m}}}-{\mathbf{m}})\right\rvert/p$	$\displaystyle\leq\left\lVert\widehat{{\mathbf{m}}}-{\mathbf{m}}\right\rVert_{\text{max}}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{1}$
		$\displaystyle=\mathcal{O}_{P}\Big{(}\sqrt{\frac{\log p}{T}}\cdot\varrho_{T}d_{T}K^{3/2}s_{T}\Big{)}.$		(B.15)

\displaystyle\left\lvert{\mathbf{m}}^{\prime}(\widehat{{\bm{\Theta}}}-{\bm{\Theta}}){\mathbf{m}}\right\rvert/p\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}=\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}K^{3/2}s_{T}\Big{)}.

(B.16)

(e)

This is a direct consequence of Part (d) and the fact that $\sqrt{\widehat{d}-d}\geq\sqrt{\widehat{d}}-\sqrt{d}$ .

(f)

First, rewrite the expression of interest:

\displaystyle(\widehat{a}\widehat{d}-\widehat{b}^{2})-(ad-b^{2})=[(\widehat{a}-a)+a][(\widehat{d}-d)+d]-[(\widehat{b}-b)+b]^{2},

therefore, using Lemma 11, we have

	$\displaystyle\left\lvert(\widehat{a}\widehat{d}-\widehat{b}^{2})-(ad-b^{2})\right\rvert$	$\displaystyle\leq\Big{[}\left\lvert\widehat{a}-a\right\rvert\left\lvert\widehat{d}-d\right\rvert+\left\lvert\widehat{a}-a\right\rvert d+a\left\lvert\widehat{d}-d\right\rvert+(\widehat{b}-b)^{2}+2\left\lvert b\right\rvert\left\lvert\widehat{b}-b\right\rvert\Big{]}$
		$\displaystyle=\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}K^{3/2}s_{T}\Big{)}=o_{P}(1).$

(g)

This is a direct consequence of Part (a): $ad-b^{2}\leq ad=\mathcal{O}(1)$ .

∎

B.8 Proof of Theorem 4

Let us derive convergence rates for each portfolio weight formulas one by one. We start with GMV formulation.

\displaystyle\left\lVert\widehat{{\mathbf{w}}}_{\text{GMV}}-{\mathbf{w}}_{\text{GMV}}\right\rVert_{1}\leq\frac{a\frac{\left\lVert(\widehat{{\bm{\Theta}}}-{\bm{\Theta}}){\bm{\iota}}_{p}\right\rVert_{1}}{p}+\left\lvert a-\widehat{a}\right\rvert\frac{\left\lVert{\bm{\Theta}}{\bm{\iota}}_{p}\right\rVert_{1}}{p}}{\left\lvert\widehat{a}\right\rvert a}=\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}^{2}K^{3}s_{T}\Big{)}=o_{P}(1),

where the first inequality was shown in Callot et al. (2019) (see their expression A.50), and the rate follows from Lemmas 11 and 10.
We now proceed with the MWC weight formulation. First, let us simplify the weight expression as follows: ${\mathbf{w}}_{\text{MWC}}=\kappa_{1}({\bm{\Theta}}{\bm{\iota}}_{p}/p)+\kappa_{2}({\bm{\Theta}}{\mathbf{m}}/p)$ , where

	$\displaystyle\kappa_{1}=\frac{d-\mu b}{ad-b^{2}}$
	$\displaystyle\kappa_{2}=\frac{\mu a-b}{ad-b^{2}}.$

Let $\widehat{{\mathbf{w}}}_{\text{MWC}}=\widehat{\kappa}_{1}(\widehat{{\bm{\Theta}}}{\bm{\iota}}_{p}/p)+\widehat{\kappa}_{2}(\widehat{{\bm{\Theta}}}\widehat{{\mathbf{m}}}/p)$ , where $\widehat{\kappa}_{1}$ and $\widehat{\kappa}_{2}$ are the estimators of $\kappa_{1}$ and $\kappa_{2}$ respectively. As shown in Callot et al. (2019) (see their equation A.57), we can bound the quantity of interest as follows:

$\displaystyle\left\lVert\widehat{{\mathbf{w}}}_{\text{MWC}}-{\mathbf{w}}_{\text{MWC}}\right\rVert_{1}$	$\displaystyle\leq\left\lvert(\widehat{\kappa}_{1}-\kappa_{1})\right\rvert\left\lVert(\widehat{{\bm{\Theta}}}-{\bm{\Theta}}){\bm{\iota}}_{p}\right\rVert_{1}/p+\left\lvert(\widehat{\kappa}_{1}-\kappa_{1})\right\rvert\left\lVert{\bm{\Theta}}{\bm{\iota}}_{p}\right\rVert_{1}/p+\left\lvert\kappa_{1}\right\rvert\left\lVert(\widehat{{\bm{\Theta}}}-{\bm{\Theta}}){\bm{\iota}}_{p}\right\rVert_{1}/p$
	$\displaystyle+\left\lvert(\widehat{\kappa}_{2}-\kappa_{2})\right\rvert\left\lVert(\widehat{{\bm{\Theta}}}-{\bm{\Theta}})(\widehat{{\mathbf{m}}}-{\mathbf{m}})\right\rVert_{1}/p+\left\lvert(\widehat{\kappa}_{2}-\kappa_{2})\right\rvert\left\lVert{\bm{\Theta}}(\widehat{{\mathbf{m}}}-{\mathbf{m}})\right\rVert_{1}/p$
	$\displaystyle+\left\lvert(\widehat{\kappa}_{2}-\kappa_{2})\right\rvert\left\lVert(\widehat{{\bm{\Theta}}}-{\bm{\Theta}}){\mathbf{m}}\right\rVert_{1}/p+\left\lvert(\widehat{\kappa}_{2}-\kappa_{2})\right\rvert\left\lVert{\bm{\Theta}}{\mathbf{m}}\right\rVert_{1}/p$
	$\displaystyle+\left\lvert\kappa_{2}\right\rvert\left\lVert(\widehat{{\bm{\Theta}}}-{\bm{\Theta}})(\widehat{{\mathbf{m}}}-{\mathbf{m}})\right\rVert_{1}/p+\left\lvert\kappa_{2}\right\rvert\left\lVert(\widehat{{\bm{\Theta}}}-{\bm{\Theta}}){\mathbf{m}}\right\rVert_{1}/p.$	(B.17)

For the ease of representation, denote $y=ad-b^{2}$ . Then, using similar technique as in Callot et al. (2019) we get

\displaystyle\left\lvert(\widehat{\kappa}_{1}-\kappa_{1})\right\rvert\leq\frac{y\left\lvert\widehat{d}-d\right\rvert+y\mu\left\lvert\widehat{b}-b\right\rvert+\left\lvert\widehat{y}-y\right\rvert\left\lvert d-\mu b\right\rvert}{\widehat{y}y}=\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}K^{3/2}s_{T}\Big{)}=o_{P}(1),

where the rate trivially follows from Lemma 11.
Similarly, we get

\displaystyle\left\lvert(\widehat{\kappa}_{2}-\kappa_{2})\right\rvert=\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}K^{3/2}s_{T}\Big{)}=o_{P}(1).

Callot et al. (2019) showed that $\left\lvert\kappa_{1}\right\rvert=\mathcal{O}(1)$ and $\left\lvert\kappa_{2}\right\rvert=\mathcal{O}(1)$ . Therefore, we can get the rate of (B.8):

\displaystyle\left\lVert\widehat{{\mathbf{w}}}_{\text{MWC}}-{\mathbf{w}}_{\text{MWC}}\right\rVert_{1}=\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}^{2}K^{3}s_{T}\Big{)}=o_{P}(1).

We now proceed with the MRC weight formulation:

	$\displaystyle\left\lVert\widehat{{\mathbf{w}}}_{\text{MRC}}-{\mathbf{w}}_{\text{MRC}}\right\rVert_{1}\leq\frac{\frac{g}{p}\Big{[}\left\lVert(\widehat{{\bm{\Theta}}}-{\bm{\Theta}})(\widehat{{\mathbf{m}}}-{\mathbf{m}})\right\rVert_{1}+\left\lVert(\widehat{{\bm{\Theta}}}-{\bm{\Theta}}){\mathbf{m}}\right\rVert_{1}+\left\lVert{\bm{\Theta}}(\widehat{{\mathbf{m}}}-{\mathbf{m}})\right\rVert_{1}\Big{]}+\left\lvert\widehat{g}-g\right\rvert\left\lVert{\bm{\Theta}}{\mathbf{m}}\right\rVert_{1}}{\left\lvert\widehat{g}\right\rvert g}$
	$\displaystyle\leq\frac{\frac{g}{p}\Big{[}p{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{1}\left\lVert(\widehat{{\mathbf{m}}}-{\mathbf{m}})\right\rVert_{\text{max}}+p{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{1}\left\lVert{\mathbf{m}}\right\rVert_{\text{max}}+p{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{1}\left\lVert(\widehat{{\mathbf{m}}}-{\mathbf{m}})\right\rVert_{\text{max}}\Big{]}+p\left\lvert\widehat{g}-g\right\rvert{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{1}\left\lVert{\mathbf{m}}\right\rVert_{\text{max}}}{\left\lvert\widehat{g}\right\rvert g}$
	$\displaystyle=\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}K^{3/2}s_{T}\cdot\sqrt{\frac{\log p}{T}}\Big{)}+\mathcal{O}_{P}\Big{(}\varrho_{T}d_{T}K^{3/2}s_{T}\Big{)}$
	$\displaystyle+\mathcal{O}_{P}\Big{(}d_{T}K^{3/2}\cdot\sqrt{\frac{\log p}{T}}\Big{)}+\mathcal{O}_{P}\Big{(}[\varrho_{T}d_{T}K^{3/2}s_{T}]^{1/2}\cdot d_{T}K^{3/2}\Big{)}=o_{P}(1),$

where we used Lemmas 10-11 and the fact that $\left\lVert\widehat{{\mathbf{m}}}-{\mathbf{m}}\right\rVert_{\text{max}}=\mathcal{O}_{P}(\sqrt{\log p/T})$ .

B.9 Proof of Theorem 5

We start with the GMV formulation. Using Lemma 11 (a)-(b), we get

\displaystyle\left\lvert\frac{\hat{a}^{-1}}{a^{-1}}-1\right\rvert=\frac{\left\lvert a-\hat{a}\right\rvert}{\left\lvert\hat{a}\right\rvert}=\mathcal{O}_{P}(\varrho_{T}d_{T}K^{3/2}s_{T})=o_{P}(1).

Proceeding to the MWC risk exposure, we follow Callot et al. (2019) and introduce the following notation: $x=a\mu^{2}-2b\mu+d$ and $\hat{x}=\hat{a}\mu-2\hat{b}\mu+\hat{d}$ to rewrite $\widehat{\Phi}_{\text{MWC}}=p^{-1}(\hat{x}/\hat{y})$ . As shown in Callot et al. (2019), $y/x=\mathcal{O}(1)$ (see their equation A.42). Furthermore, by Lemma 11 (b)-(d)

\displaystyle\left\lvert\hat{x}-x\right\rvert\leq\left\lvert\hat{a}-a\right\rvert\mu^{2}+2\left\lvert\hat{b}-b\right\rvert\mu+\left\lvert\hat{d}-d\right\rvert=\mathcal{O}_{P}(\varrho_{T}d_{T}K^{3/2}s_{T})=o_{P}(1),

and by Lemma 11 (f):

\displaystyle\left\lvert\hat{y}-y\right\rvert=\left\lvert\hat{a}\hat{d}-\hat{b}^{2}-(ad-b^{2})\right\rvert=\mathcal{O}_{P}(\varrho_{T}d_{T}K^{3/2}s_{T})=o_{P}(1).

Using the above and the facts that $y=\mathcal{O}(1)$ and $x=\mathcal{O}(1)$ (which were derived by Callot et al. (2019) in A.45 and A.46), we have

\displaystyle\left\lvert\frac{\widehat{\Phi}_{\text{MWC}}-\Phi_{\text{MWC}}}{\Phi_{\text{MWC}}}\right\rvert=\left\lvert\frac{(\hat{x}-x)y+x(y-\hat{y})}{\hat{y}y}\right\rvert\mathcal{O}(1)\mathcal{O}_{P}(\varrho_{T}d_{T}K^{3/2}s_{T})=o_{P}(1).

Finally, to bound MRC risk exposure, we use Lemma 11 (e) and rewrite

\displaystyle\frac{\left\lvert g-\hat{g}\right\rvert}{\left\lvert\hat{g}\right\rvert}=\mathcal{O}_{P}\Big{(}[\varrho_{T}d_{T}K^{3/2}s_{T}]^{1/2})=o_{P}(1).

B.10 Generalization: Sub-Gaussian and Elliptical Distributions

So far the consistency of the Factor Graphical Lasso in Theorem 4 relied on the assumption of the exponential-type tails in 1(c). Since this tail-behavior may be too restrictive for financial portfolio, we comment on the possibility to relax it. First, recall where 1(c) was used before: we required this assumption in order to establish convergence of unknown factors and loadings in Theorem 1, which was further used to obtain the convergence properties of $\widehat{{\bm{\Sigma}}}_{\varepsilon}$ in Theorem 2. Hence, when Assumption 1(c) is relaxed, one needs to find another way to consistently estimate ${\bm{\Sigma}}_{\varepsilon}$ . We achieve it using the tools developed in Fan et al., (2018). Specifically, let ${\bm{\Sigma}}={\bm{\Gamma}}{\bm{\Lambda}}{\bm{\Gamma}}^{{}^{\prime}}$ , where ${\bm{\Sigma}}$ is the covariance matrix of returns that follow a factor structure described in equation (3.1). Define $\widehat{{\bm{\Sigma}}},\widehat{{\bm{\Lambda}}}_{K},\widehat{{\bm{\Gamma}}}_{K}$ to be the estimators of ${\bm{\Sigma}},{\bm{\Lambda}},{\bm{\Gamma}}$ . We further let $\widehat{{\bm{\Lambda}}}_{K}=\text{diag}(\hat{\lambda}_{1},\ldots,\hat{\lambda}_{K})$ and $\widehat{{\bm{\Gamma}}}_{K}=(\hat{v}_{1},\ldots,\hat{v}_{K})$ to be constructed by the first $K$ leading empirical eigenvalues and the corresponding eigenvectors of $\widehat{{\bm{\Sigma}}}$ and $\widehat{{\mathbf{B}}}\widehat{{\mathbf{B}}}^{\prime}=\widehat{{\bm{\Gamma}}}_{K}\widehat{{\bm{\Lambda}}}_{K}\widehat{{\bm{\Gamma}}}_{K}^{{}^{\prime}}$ . Similarly to Fan et al., (2018), we require the following bounds on the componentwise maximums of the estimators:

\edefmbx(C.1)

$\left\lVert\widehat{{\bm{\Sigma}}}-{\bm{\Sigma}}\right\rVert_{\text{max}}=\mathcal{O}_{P}(\sqrt{\log p/T})$ ,

\edefmbx(C.2)

$\left\lVert(\widehat{{\bm{\Lambda}}}_{K}-{\bm{\Lambda}}){\bm{\Lambda}}^{-1}\right\rVert_{\text{max}}=\mathcal{O}_{P}(K\sqrt{\log p/T})$ ,

\edefmbx(C.3)

$\left\lVert\widehat{{\bm{\Gamma}}}_{K}-{\bm{\Gamma}}\right\rVert_{\text{max}}=\mathcal{O}_{P}(K^{1/2}\sqrt{\log p/(Tp)})$ .

Let $\widehat{{\bm{\Sigma}}}^{SG}$ be the sample covariance matrix, with $\widehat{{\bm{\Lambda}}}_{K}^{SG}$ and $\widehat{{\bm{\Gamma}}}_{K}^{SG}$ constructed with the first $K$ leading empirical eigenvalues and eigenvectors of $\widehat{{\bm{\Sigma}}}^{SG}$ respectively. Also, let $\widehat{{\bm{\Sigma}}}^{EL1}=\widehat{{\mathbf{D}}}\widehat{{\mathbf{R}}}_{1}\widehat{{\mathbf{D}}}$ , where $\widehat{{\mathbf{R}}}_{1}$ is obtained using the Kendall’s tau correlation coefficients and $\widehat{{\mathbf{D}}}$ is a robust estimator of variances constructed using the Huber loss. Furthermore, let $\widehat{{\bm{\Sigma}}}^{EL2}=\widehat{{\mathbf{D}}}\widehat{{\mathbf{R}}}_{2}\widehat{{\mathbf{D}}}$ , where $\widehat{{\mathbf{R}}}_{2}$ is obtained using the spatial Kendall’s tau estimator. Define $\widehat{{\bm{\Lambda}}}_{K}^{EL}$ to be the matrix of the first $K$ leading empirical eigenvalues of $\widehat{{\bm{\Sigma}}}^{EL1}$ , and $\widehat{{\bm{\Gamma}}}_{K}^{EL}$ is the matrix of the first $K$ leading empirical eigenvectors of $\widehat{{\bm{\Sigma}}}^{EL2}$ . For more details regarding constructing $\widehat{{\bm{\Sigma}}}^{SG}$ , $\widehat{{\bm{\Sigma}}}^{EL1}$ and $\widehat{{\bm{\Sigma}}}^{EL2}$ see Fan et al., (2018), Sections 3 and 4.

Proposition 1.

For sub-Gaussian distributions, $\widehat{{\bm{\Sigma}}}^{SG}$ , $\widehat{{\bm{\Lambda}}}_{K}^{SG}$ and $\widehat{{\bm{\Gamma}}}_{K}^{SG}$ satisfy 1-1.
For elliptical distributions, $\widehat{{\bm{\Sigma}}}^{EL1}$ , $\widehat{{\bm{\Lambda}}}_{K}^{EL}$ and $\widehat{{\bm{\Gamma}}}_{K}^{EL}$ satisfy 1-1.
When 1-1 are satisfied, the bounds obtained in Theorems 2-5 continue to hold.

Proposition 1 is essentially a rephrasing of the results obtained in Fan et al., (2018), Sections 3 and 4. The difference arises due to the fact that we allow $K$ to increase, which is reflected in the modified rates in 1-1. As evidenced from the above Proposition, $\widehat{{\bm{\Sigma}}}^{EL2}$ is only used for estimating the eigenvectors. This is necessary due to the fact that, in contrast with $\widehat{{\bm{\Sigma}}}^{EL2}$ , the theoretical properties of the eigenvectors of $\widehat{{\bm{\Sigma}}}^{EL}$ are mathematically involved because of the sin function. The FGL for the elliptical distributions will be called the Robust FGL.

Appendix C Additional Simulations

C.1 Verifying Theoretical Rates

To compare the empirical rate with the theoretical expressions derived in Theorems 3-5, we use the facts from Theorem 2 that $\omega_{3T}\equiv K^{2}\sqrt{\log p/T}+K^{3}/\sqrt{p}$ and $\varrho_{T}^{-1}\omega_{3T}\xrightarrow{p}0$ to introduce the following functions that correspond to the theoretical rates for the choice of parameters in the empirical setting:

	$\displaystyle\begin{rcases}f_{{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|\cdot\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}_{2}}=C_{1}+C_{2}\cdot\log_{2}(s_{T}\varrho_{T})\\ g_{{\left\|\kern-0.75346pt\left\|\kern-0.75346pt\left\|\cdot\right\|\kern-0.75346pt\right\|\kern-0.75346pt\right\|}_{1}}=C_{3}+C_{2}\cdot\log_{2}(d_{T}K^{3/2}s_{T}\varrho_{T})\end{rcases}\text{for}\ \widehat{{\bm{\Theta}}}$		(C.1)
	$\displaystyle h_{1}=C_{4}+C_{2}\cdot\log_{2}(\varrho_{T}d_{T}^{2}K^{3}s_{T})\hskip 30.5pt\text{for}\ \widehat{{\mathbf{w}}}_{\text{GMV}},\widehat{{\mathbf{w}}}_{\text{MWC}}$		(C.2)
	$\displaystyle h_{2}=C_{5}+C_{6}\cdot\log_{2}([\varrho_{T}s_{T}]^{1/2}d_{T}^{3/2}K^{3})\quad\text{for}\ \widehat{{\mathbf{w}}}_{\text{MRC}}$		(C.3)
	$\displaystyle h_{3}=C_{7}+C_{2}\cdot\log_{2}(d_{T}K^{3/2}s_{T}\varrho_{T})\hskip 22.0pt\text{for}\ \widehat{\Phi}_{\text{GMV}},\widehat{\Phi}_{\text{MWC}}$		(C.4)
	$\displaystyle h_{4}=C_{8}+C_{9}\cdot\log_{2}(d_{T}K^{3/2}s_{T}\varrho_{T})\hskip 22.0pt\text{for}\ \widehat{\Phi}_{\text{MRC}}$		(C.5)

where $C_{1},\ldots,C_{9}$ are constants with $C_{6}>C_{2}$ (by Theorem 4), $C_{9}>C_{2}$ (by Theorem 5).

Figure 4 shows the averaged (over Monte Carlo simulations) errors of the estimators of ${\bm{\Theta}}$ , ${\mathbf{w}}$ and $\Phi$ versus the sample size $T$ in the logarithmic scale (base 2). In order to confirm the theoretical findings from Theorems 3-5, we also plot the theoretical rates of convergence given by the functions in (C.1)-(C.5). We verify that the empirical and theoretical rates are matched. Since the convergence rates for GMV and MWC portfolio weights ${\mathbf{w}}$ and risk exposures $\Phi$ are very similar, we only report the former. Note that as predicted by Theorem 3, the rate of convergence of the precision matrix in ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}$ -norm is faster than the rate in ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}$ -norm. Furthermore, the convergence rate of the GMV, MWC and MRC portfolio weights and risk exposures are close to the rate of the precision matrix ${\bm{\Theta}}$ in ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{1}$ -norm, which is confirmed by Theorem 4. As evidenced by Figure 4, the convergence rate of the MRC risk exposure is slower than the rate of GMV and MWC exposures. This finding is in accordance with Theorem 5 and it is also consistent with the empirical findings that indicate higher overall risk associated with MRC portfolios.

C.2 Results for Case 1

We compare the performance of FGL with the alternative models listed at the beginning of Section 5 for Case 1. The only instance when FGL is strictly but slightly dominated occurs in Figure 5: POET outperforms FGL in terms of convergence of precision matrix in the spectral norm. This is different from Case 2 in Figure 1 where FGL outperforms all the competing models.

C.3 Robust FGL

The DGP for elliptical distributions is similar to Fan et al., (2018): let $({\mathbf{f}}_{t},{\bm{\varepsilon}}_{t})$ from (3.1) jointly follow the multivariate t-distribution with the degrees of freedom $\nu$ . When $\nu=\infty$ , this corresponds to the multivariate normal distribution, smaller values of $\nu$ are associated with thicker tails. We draw $T$ independent samples of $({\mathbf{f}}_{t},{\bm{\varepsilon}}_{t})$ from the multivariate t-distribution with zero mean and covariance matrix ${\bm{\Sigma}}=\text{diag}({\bm{\Sigma}}_{f},{\bm{\Sigma}}_{\varepsilon})$ , where ${\bm{\Sigma}}_{f}={\mathbf{I}}_{K}$ . To construct ${\bm{\Sigma}}_{\varepsilon}$ we use a Toeplitz structure parameterized by $\rho=0.5$ , which leads to the sparse ${\bm{\Theta}}_{\varepsilon}={\bm{\Sigma}}^{-1}_{\varepsilon}$ . The rows of ${\mathbf{B}}$ are drawn from $\mathcal{N}(\bm{0},{\mathbf{I}}_{K})$ . We let $p=T^{0.85}$ , $K=2(\log T)^{0.5}$ and $T=[2^{h}],\ \text{for}\ h\in\{7,7.5,8,\ldots,9.5\}$ . Figure 8-9 report the averaged (over Monte Carlo simulations) estimation errors (in the logarithmic scale, base 2) for ${\bm{\Theta}}$ and two portfolio weights (GMV and MRC) using FGL and Robust FGL for $\nu=4.2$ . Noticeably, the performance of FGL for estimating the precision matrix is comparable with that of Robust FGL: this suggests that our FGL algorithm is insensitive to heavy-tailed distributions even without additional modifications. Furthermore, FGL outperforms its Robust counterpart in terms of estimating portfolio weights, as evidenced by Figure 9. We further compare the performance of FGL and Robust FGL for different degrees of freedom: Figure 10 reports the log-ratios (base 2) of the averaged (over Monte Carlo simulations) estimation errors for $\nu=4.2$ , $\nu=7$ and $\nu=\infty$ . The results for the estimation of ${\bm{\Theta}}$ presented in Figure 10 are consistent with the findings in Fan et al., (2018): Robust FGL outperforms the non-robust counterpart for thicker tails.

C.4 Relaxing Pervasiveness Assumption

As pointed out by Onatski, (2013), the data on 100 industrial portfolios shows that there are no large gaps between eigenvalues $i$ and $i+1$ of the sample covariance data except for $i=1$ . However, as is commonly believed, such data contains at least three factors. Therefore, the factor pervasiveness assumption suggests the existence of a large gap for $i\geq 3$ . In order to examine sensitivity of portfolios to the pervasiveness assumption and quantify the degree of pervasiveness, we use the same DGP as in (5.2)-(5.3), but with $\sigma_{\varepsilon,ij}=\rho^{\left\lvert i-j\right\rvert}$ and $K=3$ . We consider $\rho\in\{0.4,0.5,0.6,0.7,0.8,0.9\}$ which corresponds to $\lambda_{3}/\lambda_{4}\in\{3.1,2.7,2.6,2.2,1.5,1.1\}$ . In other words, as $\rho$ increases, the systematic-idiosyncratic gap measured by $\hat{\lambda}_{3}/\hat{\lambda}_{4}$ decreases. Table 3-4 report the mean quality of the estimators for portfolio weights and risk over 100 replications for $T=300$ and $p\in\{300,400\}$ . The sample size and the number of regressors are chosen to closely match the values from the empirical application. POET and Projected POET are the most sensitive to a reduction in the gap between the leading and bounded eigenvalues which is evident from a dramatic deterioration in the quality of these estimators. The remaining methods, including FGL, exhibit robust performance. Since the behavior of the estimators for portfolio weights is similar to that of the estimators of precision matrix, we only report the former for the ease of presentation. For $(T,p)=(300,300)$ , FClime shows the best performance followed by FGL and FLW, whereas for $(T,p)=(300,400)$ FGL takes the lead. Interestingly, despite inferior performance of POET and Projected POET in terms of estimating portfolio weights, risk exposure of the portfolios based on these estimators is competitive with the other approaches.

Table 3: Sensitivity of portfolio weights and risk exposure when the gap between the diverging and bounded eigenvalues decreases:

(T,p)=(300,300)

	$\rho=0.4$	$\rho=0.5$	$\rho=0.6$	$\rho=0.7$	$\rho=0.8$	$\rho=0.9$
	( $\lambda_{3}/\lambda_{4}=3.1$ )	( $\lambda_{3}/\lambda_{4}=2.7$ )	( $\lambda_{3}/\lambda_{4}=2.6$ )	( $\lambda_{3}/\lambda_{4}=2.2$ )	( $\lambda_{3}/\lambda_{4}=1.5$ )	( $\lambda_{3}/\lambda_{4}=1.1$ )
	$\left\lVert\widehat{{\mathbf{w}}}_{\text{GMV}}-{\mathbf{w}}_{\text{GMV}}\right\rVert_{1}$
FGL	2.3198	2.3465	2.5177	2.4504	2.5010	2.7319
FClime	1.9554	1.9359	1.9795	1.9103	1.9813	1.9948
FLW	2.3445	2.3948	2.5328	2.4715	2.5918	3.0515
FNLW	2.2381	2.3009	2.3293	2.5497	2.9039	3.1980
POET	47.6746	82.1873	43.9722	54.1131	157.6963	235.8119
Projected POET	9.6335	7.8669	10.1546	10.6205	12.1795	15.2581
	$\left\lvert\widehat{\Phi}_{\text{GMV}}-\Phi_{\text{GMV}}\right\rvert$
FGL	0.0033	0.0032	0.0034	0.0027	0.0021	0.0023
FClime	0.0012	0.0012	0.0012	0.0011	0.0010	0.0010
FLW	0.0049	0.0052	0.0061	0.0056	0.0049	0.0059
FNLW	0.0055	0.0060	0.0054	0.0052	0.0066	0.0057
POET	0.0070	0.0122	0.0058	0.0063	0.0103	0.0160
Projected POET	0.0021	0.0022	0.0019	0.0019	0.0018	0.0026
	$\left\lVert\widehat{{\mathbf{w}}}_{\text{MWC}}-{\mathbf{w}}_{\text{MWC}}\right\rVert_{1}$
FGL	2.3766	2.4108	2.7411	2.6094	2.5669	3.4633
FClime	2.0502	2.0279	2.2901	2.1400	2.1028	3.0737
FLW	2.4694	2.5132	2.8902	2.7315	2.7210	4.0248
FNLW	2.7268	2.3060	2.8984	3.5902	2.9232	3.2076
POET	49.8603	34.2024	469.3605	108.1529	74.8016	99.4561
Projected POET	9.0261	7.4028	8.1899	9.4806	11.9642	13.3890
	$\left\lvert\widehat{\Phi}_{\text{MWC}}-\Phi_{\text{MWC}}\right\rvert$
FGL	0.0033	0.0032	0.0034	0.0027	0.0021	0.0024
FClime	0.0012	0.0012	0.0013	0.0011	0.0010	0.0009
FLW	0.0050	0.0053	0.0062	0.0057	0.0050	0.0059
FNLW	0.0055	0.0060	0.0055	0.0053	0.0066	0.0057
POET	0.0068	0.0047	0.0363	0.0092	0.0060	0.0056
Projected POET	0.0022	0.0022	0.0020	0.0020	0.0018	0.0027
	$\left\lVert\widehat{{\mathbf{w}}}_{\text{MRC}}-{\mathbf{w}}_{\text{MRC}}\right\rVert_{1}$
FGL	0.4872	0.1793	1.0044	0.6332	1.4568	2.3353
FClime	0.5160	0.2148	1.0188	0.6694	1.4855	2.3519
FLW	0.5333	0.2279	1.0345	0.6734	1.4904	2.3691
FNLW	0.8365	1.1285	1.1181	1.4419	1.7694	2.4612
POET	NaN	NaN	NaN	NaN	NaN	NaN
Projected POET	0.7414	0.6383	1.6686	1.8013	2.3297	3.2791
	$\left\lvert\widehat{\Phi}_{\text{MRC}}-\Phi_{\text{MRC}}\right\rvert$
FGL	0.0004	0.0003	0.0025	0.0007	0.0021	0.0071
FClime	0.0005	0.0003	0.0024	0.0004	0.0016	0.0062
FLW	0.0002	0.0002	0.0021	0.0003	0.0018	0.0066
FNLW	0.0062	0.0062	0.0069	0.0119	0.0059	0.0143
POET	NaN	NaN	NaN	NaN	NaN	NaN
Projected POET	0.0003	0.0003	0.0027	0.0031	0.0069	0.0062

Table 4: Sensitivity of portfolio weights and risk exposure when the gap between the diverging and bounded eigenvalues decreases:

(T,p)=(300,400)

	$\rho=0.4$	$\rho=0.5$	$\rho=0.6$	$\rho=0.7$	$\rho=0.8$	$\rho=0.9$
	( $\lambda_{3}/\lambda_{4}=3.1$ )	( $\lambda_{3}/\lambda_{4}=2.7$ )	( $\lambda_{3}/\lambda_{4}=2.6$ )	( $\lambda_{3}/\lambda_{4}=2.2$ )	( $\lambda_{3}/\lambda_{4}=1.5$ )	( $\lambda_{3}/\lambda_{4}=1.1$ )
	$\left\lVert\widehat{{\mathbf{w}}}_{\text{GMV}}-{\mathbf{w}}_{\text{GMV}}\right\rVert_{1}$
FGL	1.6900	1.8134	1.8577	1.8839	1.9843	2.0692
FClime	1.9073	1.9524	1.9997	1.9490	1.9898	2.0330
FLW	2.0239	2.0945	2.1195	2.1235	2.2473	2.4745
FNLW	2.0316	2.0790	2.1927	2.2503	2.4143	2.4710
POET	18.7934	28.0493	155.8479	32.4197	41.8098	71.5811
Projected POET	7.8696	8.4915	8.8641	10.7522	11.2092	19.0424
	$\left\lvert\widehat{\Phi}_{\text{GMV}}-\Phi_{\text{GMV}}\right\rvert$
FGL	8.62E-04	9.22E-04	7.23E-04	7.31E-04	6.83E-04	5.73E-04
FClime	8.40E-04	8.27E-04	8.02E-04	7.87E-04	7.36E-04	6.71E-04
FLW	1.59E-03	1.73E-03	1.57E-03	1.68E-03	1.69E-03	1.54E-03
FNLW	2.24E-03	2.10E-03	1.83E-03	1.88E-03	2.07E-03	1.29E-03
POET	1.11E-03	1.46E-03	3.59E-03	1.27E-03	1.88E-03	2.51E-03
Projected POET	8.97E-04	8.80E-04	6.83E-04	6.79E-04	7.98E-04	6.55E-04
	$\left\lVert\widehat{{\mathbf{w}}}_{\text{MWC}}-{\mathbf{w}}_{\text{MWC}}\right\rVert_{1}$
FGL	1.9034	2.2843	1.9118	3.2569	2.7055	2.8812
FClime	2.1193	2.4024	2.0540	3.3487	2.7277	2.8593
FLW	2.2573	2.5809	2.1790	3.5728	3.0072	3.3164
FNLW	2.3207	3.3335	3.5518	3.4282	2.6446	4.8827
POET	15.8824	100.1419	56.9827	33.6483	38.8961	103.0434
Projected POET	6.5386	7.2169	7.8583	9.7342	12.1420	17.7368
	$\left\lvert\widehat{\Phi}_{\text{MWC}}-\Phi_{\text{MWC}}\right\rvert$
FGL	8.72E-04	9.41E-04	7.26E-04	7.99E-04	7.12E-04	6.08E-04
FClime	8.52E-04	8.49E-04	8.06E-04	8.32E-04	7.50E-04	6.86E-04
FLW	1.59E-03	1.74E-03	1.57E-03	1.71E-03	1.70E-03	1.56E-03
FNLW	2.25E-03	2.22E-03	1.89E-03	1.91E-03	2.08E-03	1.56E-03
POET	1.14E-03	4.91E-03	1.78E-03	1.45E-03	1.57E-03	2.93E-03
Projected POET	9.19E-04	9.20E-04	7.11E-04	7.04E-04	8.26E-04	6.78E-04
	$\left\lVert\widehat{{\mathbf{w}}}_{\text{MRC}}-{\mathbf{w}}_{\text{MRC}}\right\rVert_{1}$
FGL	0.6683	0.7390	1.3103	1.5195	1.7124	3.0935
FClime	0.6903	0.7635	1.3238	1.5403	1.7415	3.1180
FLW	0.7132	0.7828	1.3430	1.5549	1.7517	3.1364
FNLW	0.4909	1.2121	1.4974	1.1996	1.8020	3.2989
POET	NaN	NaN	NaN	NaN	NaN	NaN
Projected POET	1.6851	1.4434	1.9628	2.6182	2.7716	4.1753
	$\left\lvert\widehat{\Phi}_{\text{MRC}}-\Phi_{\text{MRC}}\right\rvert$
FGL	1.02E-03	9.73E-04	4.63E-03	4.49E-03	3.23E-03	8.73E-03
FClime	1.14E-03	1.01E-03	4.55E-03	4.22E-03	2.70E-03	7.72E-03
FLW	6.62E-04	5.54E-04	4.19E-03	4.01E-03	2.71E-03	8.11E-03
FNLW	2.73E-04	6.93E-03	5.11E-03	1.93E-03	6.42E-03	2.98E-02
POET	NaN	NaN	NaN	NaN	NaN	NaN
Projected POET	3.59E-03	1.20E-03	1.49E-03	2.58E-03	7.86E-03	1.39E-02

Appendix D Additional Empirical Results

This Appendix contains the description of the procedure used to estimate unknown factors and loadings using PCA (Appendix D.1), and additional empirical results with portfolio performance for monthly data (Appendix D.2) and verifying robustness of FGL towards different training periods (Appendix D.3), different target risk and return (Appendix D.4), and subperiod analyses for MWC and GMV portfolios (Appendix D.5).

D.1 Estimating Unknown Factors and Loadings

Remark 1.

In practice, the number of common factors, $K$ , is unknown and needs to be estimated. One of the standard and commonly used approaches is to determine $K$ in a data-driven way (Bai and Ng, (2002); Kapetanios, (2010)). As an example, in their paper Fan et al., (2013) adopt the approach from Bai and Ng, (2002). However, all of the aforementioned papers deal with a fixed number of factors. Therefore, we need to adopt a different criteria since $K$ is allowed to grow in our setup. For this reason, we use the methodology by Li et al., (2017): let ${\mathbf{b}}_{i,K}$ and ${\mathbf{f}}_{t,K}$ denote $K\times 1$ vectors of loadings and factors when $K$ needs to be estimated, and ${\mathbf{B}}_{K}$ is a $p\times K$ matrix of stacked ${\mathbf{b}}_{i,K}$ . Define

V(K)=\min_{{\mathbf{B}}_{K},{\mathbf{F}}_{K}}\frac{1}{pT}\sum_{i=1}^{p}\sum_{t=1}^{T}\Big{(}r_{it}-\frac{1}{\sqrt{K}}{\mathbf{b}}^{\prime}_{i,K}{\mathbf{f}}_{t,K}\Big{)}^{2},

(D.1)

where the minimum is taken over $1\leq K\leq K_{\textup{max}}$ , subject to normalization ${\mathbf{B}}^{\prime}_{K}{\mathbf{B}}_{K}/p={\mathbf{I}}_{K}$ . Hence, $\bar{{\mathbf{F}}}^{\prime}_{K}=\sqrt{K}{\mathbf{R}}^{\prime}{\mathbf{B}}_{K}/p$ . Define $\widehat{{\mathbf{F}}}^{\prime}_{K}=\bar{{\mathbf{F}}}^{\prime}_{K}(\bar{{\mathbf{F}}}_{K}\bar{{\mathbf{F}}}^{\prime}_{K}/T)^{1/2}$ , which is a rescaled estimator of the factors that is used to determine the number of factors when $K$ grows with the sample size. We then apply the following procedure described in Li et al., (2017) to estimate $K$ :

\widehat{K}=\arg\!\min_{1\leq K\leq K_{\textup{max}}}\ln(V(K,\hat{{\mathbf{F}}}_{K}))+Kg(p,T),

(D.2)

where $1\leq K\leq K_{\textup{max}}=o(\min\{p^{1/17},T^{1/16}\})$ and $g(p,T)$ is a penalty function of $(p,T)$ such that (i) $K_{\textup{max}}\cdot g(p,T)\rightarrow 0$ and (ii) $C_{p,T,K_{\textup{max}}}^{-1}\cdot g(p,T)\rightarrow\infty$ with $C_{p,T,K_{\textup{max}}}=\mathcal{O}_{P}\Big{(}\max\Big{[}\frac{K^{3}_{\textup{max}}}{\sqrt{p}},\frac{K^{5/2}_{\textup{max}}}{\sqrt{T}}\Big{]}\Big{)}$ . The choice of the penalty function is similar to Bai and Ng, (2002). Throughout the paper we let $\widehat{K}$ be the solution to (D.2).

D.2 Monthly Data

Similarly to daily data, we use monthly returns of the components of the S&P500. The data is fetched from CRSP and Compustat using SAS interface. The full sample for the monthly data has 480 observations on 355 stocks from January 1, 1980 - December 1, 2019. We use January 1, 1980 - December 1, 1994 (180 obs) as a training (estimation) period and January 1, 1995 - December 1, 2019 (300 obs) as the out-of-sample test period. At the end of each month, prior to portfolio construction, we remove stocks with less than 15 years of historical stock return data. We set the return target $\mu=0.7974\%$ which is equivalent to $10\%$ yearly return when compounded. The target level of risk for the weight-constrained and risk-constrained Markowitz portfolio (MWC and MRC) is set at $\sigma=0.05$ which is the standard deviation of the monthly excess returns of the S&P500 index in the first training set. Transaction costs are taken to be the same as for the daily returns in Section 6.

Table 5 reports the results for monthly data. Some comments are in order: (1) interestingly, MRC produces portfolio return and Sharpe Ratio that are mostly higher than those for the weight-constrained allocations MWC and GMV. This means that relaxing the constraint that portfolio weights sum up to one leads to a large increase in the out-of-sample Sharpe Ratio and portfolio return which has not been previously well-studied in the empirical finance literature. (2) Similarly to the results from Table 1, FGL outperforms the competitors including EW and Index in terms of the out-of-sample Sharpe Ratio and turnover. (3) Similarly to the results in Table 1, the observable Fama-French factors produce the FGL portfolios with higher return and higher out-of-sample Sharpe Ratio compared to the FGL portfolios based on statistical factors. Again, this increase in return is not followed by higher risk. (4) To further verify that the shrinkage is functioning as desired and the estimated ${\bm{\Theta}}_{\varepsilon}$ is indeed sparse we include several visualizations. Figure 11 reports optimally tuned values of $\lambda$ (please refer to Section 3 of the main manuscript for a discussion on choosing the optimal shrinkage intensity) over the estimation period. Figure 12 plots the proportion of zero elements in the precision matrix of the idiosyncratic part, $\widehat{{\bm{\Theta}}}_{\varepsilon}$ , corresponding to the optimal values of $\lambda=\hat{\lambda}$ , and several fixed values of $\lambda$ for monthly data over the testing period. Extracting the common factors significantly reduces partial correlations of the error terms, rendering $\widehat{{\bm{\Theta}}}_{\varepsilon}$ sparse over the testing period: the number of zeroes for the optimally tuned $\lambda$ varies from 74.5%-98.8%. Figure 13 plots the Sharpe Ratio of GMV portfolios for a set of fixed values of $\lambda\in\{0.005,0.01,0.05,0.08,0.1,0.12,0.15,0.17,0.2,0.25,0.3,0.4,0.5\}$ . In other words, instead of using optimally tuned $\lambda$ , we fix its value throughout the whole testing period and report the corresponding SR of such portfolios. For comparison, the SR that corresponds to the optimally tuned $\lambda$ is equal to $0.2023$ , which is significantly higher than the SR achieved for any fixed $\lambda$ confirming the importance of selecting shrinkage intensity optimally.

We would like to emphasize that the selection of the tuning parameter is critically important in the literature on graphical models, which is why we build our tuning methodology on the Bayesian Information Criteria (BIC) as used and described in Koike, (2020); Bishop, (2006); Pourahmadi, (2013); Janková and van de Geer, (2018) among others (the detailed treatment relevant to our paper can be found on p.13 of the main manuscript). The advantage of SR obtained using optimally tuned $\lambda$ highlights the importance of tuning and demonstrates that $\lambda$ changes over time. Hence, using a fixed value is expected to produce suboptimal performance.

We now elaborate on the discrepancy between the SR with the optimal vs fixed $\lambda$ . Please note that Figure 11 should not be compared with Figure 13. In contrast to Figure 13, the range of $\lambda$ in Figure 11 is selected optimally by minimizing BIC. In other words, SR is not the objective function that we use for selecting the tuning parameter. To demonstrate the relevant range of $\lambda$ selected by the BIC we have included Figure 14 that shows optimally selected $\lambda$ for six different rolling windows.

Table 5: Monthly portfolio returns, risk, SR and turnover. In the upper part corresponding to the results w/o transactions costs, p-values are in parentheses. In the lower part corresponding to the results with transaction costs, ^∗∗∗ indicates p-value

<

0.01, ^∗∗ indicates p-value

<

0.05, and ^∗ indicates p-value

<

0.10. In-sample: January 1, 1980 - December 31, 1995 (180 obs), Out-of-sample: January 1, 1995 - December 31, 2019 (300 obs).

Markowitz Risk-Constrained

Markowitz Weight-Constrained

Global Minimum-Variance

Return

Risk

Turnover

Return

Risk

Turnover

Return

Risk

Turnover

Without TC

0.0081

0.0519

0.1553

0.0081

0.0519

0.1553

0.0081

0.0519

0.1553

Index

0.0063

0.0453

0.1389

0.0063

0.0453

0.1389

0.0063

0.0453

0.1389

FGL

0.0256

0.0828

0.3099

(0.0799)

0.0059

0.0329

0.1804

(0.0430)

0.0065

0.0321

0.2023

(0.046)

FClime

0.0372

0.2337

0.1593

(0.2715)

0.0067

0.0471

0.1434

(0.0791)

0.0076

0.0466

0.1643

(0.047)

FLW

0.0296

0.1049

0.2817

(0.0879)

0.0059

0.0353

0.1662

(0.0791)

0.0063

0.0353

0.1774

(0.047)

FNLW

0.0264

0.0925

0.2853

(0.0879)

0.0060

0.0333

0.1793

(0.0430)

0.0064

0.0332

0.1930

(0.046)

POET

NaN

-0.1041

2.0105

-0.0518

(0.9925)

0.5984

11.0064

0.0544

(0.6344)

Projected POET

0.0583

0.3300

0.1766

(0.2715)

0.0058

0.0546

0.1056

(0.0791)

0.0069

0.0612

0.1128

(0.2693)

FGL (FF1)

0.0275

0.0800

0.3433

(0.0659)

0.0061

0.0316

0.1941

(0.0415)

0.0073

0.0302

0.2427

(0.035)

FGL (FF3)

0.0274

0.0797

0.3437

(0.0659)

0.0061

0.0314

0.1955

(0.0415)

0.0073

0.0300

0.2440

(0.035)

FGL (FF5)

0.0273

0.0793

0.3443

(0.0659)

0.0061

0.0314

0.1943

(0.0415)

0.0073

0.0300

0.2426

(0.035)

FF1

0.0403

0.2250

0.1789

(0.2715)

0.0025

0.0548

0.0452

(0.9318)

0.0043

0.0546

0.0781

(0.6344)

FF3

0.0389

0.2022

0.1926

(0.2715)

0.0032

0.0528

0.0610

(0.9318)

0.0047

0.0517

0.0915

(0.6344)

FF5

0.0354

0.1803

0.1962

(0.2715)

0.0036

0.0531

0.0670

(0.9318)

0.0048

0.0513

0.0945

(0.6344)

With TC

0.0080

0.0520

0.1538

0.0630

0.0080

0.0520

0.1538

0.0630

0.0080

0.0520

0.1538

0.0630

FGL

0.0222

0.0828

0.2682*

3.1202

0.0050

0.0329

0.1525*

0.8786

0.0056

0.0321

0.1740**

0.8570

FClime

0.0334

0.2334

0.1429

4.9174

0.0062

0.0471

0.1307

0.5945

0.0071

0.0466

0.1522*

0.5528

FLW

0.0237

0.1052

0.2257

5.5889

0.0043

0.0353

0.1231

1.5166

0.0048

0.0354

0.1343

1.5123

FNLW

0.0224

0.0927

0.2415*

3.7499

0.0049

0.0334

0.1463*

1.0812

0.0053

0.0333

0.1596*

1.0793

POET

NaN

-0.1876

1.7274

-0.1086

152.3298

1.0287

14.2676

0.0721

354.6043

Projected POET

0.0166

0.2859

0.0579

69.7600

-0.0002

0.0540

-0.0044

5.9131

-0.0002

0.0613

-0.0027

7.0030

FGL (FF1)

0.0243

0.0800

0.3036*

2.8514

0.0054

0.0317

0.1692*

0.7513

0.0066

0.0302

0.2176**

0.7095

FGL (FF3)

0.0242

0.0797

0.3037*

2.8708

0.0054

0.0314

0.1703*

0.7545

0.0066

0.0300

0.2186**

0.7127

FGL (FF5)

0.0241

0.0793

0.3037*

2.8857

0.0053

0.0315

0.1686*

0.7630

0.0065

0.0300

0.2167**

0.7224

FF1

0.0169

0.2331

0.0767

23.3910

-0.0023

0.0545

-0.0415

4.6257

-0.0004

0.0543

-0.0079

4.5751

FF3

0.0185

0.2268

0.0924

20.6137

-0.0013

0.0524

-0.0243

4.3667

0.0003

0.0514

0.0059

4.2956

FF5

0.0164

0.2254

0.0918

18.5514

-0.0008

0.0527

-0.0145

4.2134

0.0005

0.0508

0.0108

4.1681

D.3 Portfolio Performance for Longer Training Periods

This section examines the performance of the methods when training periods were increased. Tables 6 and 7 report the results: the conclusions that we highlighted when analyzing Tables 5 and 1 continue to hold. We observed an interesting finding: for MRC portfolios (both monthly and daily), a larger training period changed the values of portfolio return and risk for all methods, however, their relative value illustrated by the SR remained unchanged. This is due to the fact that MRC portfolios maximize SR subject to either target risk or target return constraints:

\max_{{\mathbf{w}}}\frac{{\mathbf{m}}^{\prime}{\mathbf{w}}}{\sqrt{{\mathbf{w}}^{\prime}{\bm{\Sigma}}{\mathbf{w}}}}\ \text{s.t.}\ \text{(i)}\ {\mathbf{m}}^{\prime}{\mathbf{w}}\geq\mu\ \text{or}\text{(ii)}\ {\mathbf{w}}^{\prime}{\bm{\Sigma}}{\mathbf{w}}\leq\sigma^{2},

when $\mu=\sigma\sqrt{{\mathbf{m}}^{\prime}{\bm{\Theta}}{\mathbf{m}}}$ , the solution to either of the constraints is given by ${\mathbf{w}}_{\text{MRC}}=\frac{\sigma}{\sqrt{{\mathbf{m}}^{\prime}{\bm{\Theta}}{\mathbf{m}}}}{\bm{\Theta}}{\mathbf{m}}$ . Hence, even though the training period was increased, the maximum achievable SR remained the same since neither target risk nor target return were changed.

Table 6: Monthly portfolio returns, risk, SR and turnover. In the upper part corresponding to the results w/o transactions costs, p-values are in parentheses. In the lower part corresponding to the results with transaction costs, ^∗∗∗ indicates p-value

<

0.01, ^∗∗ indicates p-value

<

0.05, and ^∗ indicates p-value

<

0.10. In-sample: January 1, 1980 - December 31, 1999 (240 obs), Out-of-sample: January 1, 2000 - December 31, 2019 (240 obs).

Markowitz Risk-Constrained

Markowitz Weight-Constrained

Global Minimum-Variance

Return

Risk

Turnover

Return

Risk

Turnover

Return

Risk

Turnover

Without TC

0.0061

0.0029

0.1126

0.0061

0.0029

0.1126

0.0061

0.0029

0.1126

Index

0.0032

0.0022

0.0692

0.0032

0.0022

0.0692

0.0032

0.0022

0.0692

FGL

0.0223

0.0805

0.2770

(0.0609)

0.0054

0.0343

0.1581

(0.0939)

0.0062

0.0333

0.1848

(0.1189)

FClime

0.0339

0.2642

0.1285

(0.0989)

0.0068

0.0510

0.1341

(0.1119)

0.0072

0.0483

0.1482

(0.1419)

FLW

0.0268

0.1030

0.2606

(0.0609)

0.0047

0.0400

0.1173

(0.1119)

0.0056

0.0396

0.1423

(0.1419)

FNLW

0.0234

0.0887

0.2633

(0.0609)

0.0050

0.0360

0.1397

(0.1119)

0.0060

0.0354

0.1701

(0.1189)

POET

NaN

0.2800

3.7119

0.0754

(0.2617)

-0.0239

0.6870

-0.0348

(0.7729)

Projected POET

-0.0371

0.8637

-0.0430

(0.9137)

0.0032

0.0447

0.0708

(0.2617)

0.0014

0.0823

0.0174

(0.7729)

FGL(FF1)

0.0224

0.0775

0.2891

(0.0579)

0.0056

0.0325

0.1721

(0.0839)

0.0064

0.0307

0.2070

(0.0889)

FGL(FF3)

0.0223

0.0772

0.2885

0.0056

0.0324

0.1728

(0.0839)

0.0064

0.0307

0.2075

(0.0889)

FGL(FF5)

0.0222

0.0769

0.2887

(0.0579)

0.0056

0.0325

0.1719

(0.0839)

0.0063

0.0307

0.2058

(0.0889)

FF1

-0.0074

1.0260

-0.0072

(0.9512)

0.0090

0.1158

0.0777

(0.2617)

0.0099

0.1162

0.0853

(0.1738)

FF3

0.0234

0.9301

0.0252

(0.9512)

0.0073

0.1011

0.0726

(0.2617)

0.0083

0.1040

0.0796

(0.1738)

FF5

0.0111

0.7406

0.0149

(0.9512)

0.0064

0.0906

0.0708

(0.2617)

0.0056

0.0942

0.0590

(0.1738)

With TC

0.0060

0.0029

0.1109

0.0627

0.0060

0.0029

0.1109

0.0627

0.0060

0.0029

0.1109

0.0627

FGL

0.0190

0.1826

0.2351*

3.0398

0.0044

0.0344

0.1280

0.9940

0.0052

0.0334

0.1548

0.9519

FClime

0.0305

0.7518

0.1158

5.1354

0.0063

0.0510

0.1231

0.5606

0.0067

0.0484

0.1378

0.4962

FLW

0.0203

0.1992

0.1961*

6.3365

0.0028

0.0401

0.0703

1.8813

0.0038

0.0397

0.0952

1.8528

FNLW

0.0195

0.2199

0.2193*

3.6238

0.0038

0.0361

0.1066

1.1847

0.0049

0.0355

0.1367

1.1596

POET

NaN

0.1707

3.4886

0.0489

151.5813

-0.0846

0.5652

-0.1496

76.9912

Projected POET

-0.0556

0.1885

-0.0757*

33.2465

-0.0014

0.0451

-0.0313

4.3028

-0.0031

0.0734

-0.0429

15.5429

FGL(FF1)

0.0194

0.2708

0.2497*

2.7486

0.0047

0.0326

0.1454

0.8299

0.0055

0.0308

0.1803

0.7703

FGL(FF3)

0.0192

0.1754

0.2485*

2.7682

0.0047

0.0325

0.1460

0.8343

0.0055

0.0307

0.1807

0.7754

FGL(FF5)

0.0192

0.1753

0.2488*

2.7807

0.0047

0.0325

0.1451

0.8377

0.0055

0.0307

0.1789

0.7815

FF1

-0.1143

0.1753

-0.1499

338.4639

-0.0039

0.1160

-0.0337

13.1081

-0.0041

0.1167

-0.0353

14.2580

FF3

-0.0763

0.3416

-0.1047

285.5053

-0.0038

0.1010

-0.0380

11.3194

-0.0042

0.1043

-0.0407

12.6905

FF5

-0.0547

0.3229

-0.0922

604.2117

-0.0038

0.0904

-0.0423

10.3659

-0.0052

0.0944

-0.0548

10.9466

Table 7: Daily portfolio returns, risk, SR and turnover. In the upper part corresponding to the results w/o transactions costs, p-values are in parentheses. In the lower part corresponding to the results with transaction costs, ^∗∗∗ indicates p-value

<

0.01, ^∗∗ indicates p-value

<

0.05, and ^∗ indicates p-value

<

0.10. In-sample: January 20, 2000 - January 25, 2005 (1260 obs), Out-of-sample: January 26, 2005 - January 31, 2020 (3780 obs).

Markowitz Risk-Constrained

Markowitz Weight-Constrained

Global Minimum-Variance

Return

Risk

Turnover

Return

Risk

Turnover

Return

Risk

Turnover

Without TC

2.19E-04

1.98E-02

0.0111

2.19E-04

1.98E-02

0.0111

2.19E-04

1.98E-02

0.0111

Index

2.15E-04

1.16E-02

0.0185

2.15E-04

1.16E-02

0.0185

2.15E-04

1.16E-02

0.0185

FGL

8.86E-04

2.90E-02

0.0305

(0.0450)

3.51E-04

7.07E-03

0.0496

(0.0020)

3.51E-04

6.98E-03

0.0503

(0.0025)

FClime

1.30E-03

8.36E-02

0.0156

(0.2513)

2.41E-04

1.04E-02

0.0231

(0.0315)

2.75E-04

1.10E-02

0.0250

(0.0415)

FLW

4.24E-04

2.88E-02

0.0147

(0.2513)

3.12E-04

7.06E-03

0.0443

(0.0025)

3.15E-04

7.41E-03

0.0425

(0.0033)

FNLW

3.20E-04

5.33E-02

0.0060

(0.6397)

3.23E-04

7.01E-03

0.0461

(0.0020)

3.49E-04

8.44E-03

0.0414

(0.0033)

POET

NaN

5.39E-03

3.82E-01

0.0141

(0.1384)

-8.23E-05

9.49E-02

-0.0009

(0.9218)

Projected POET

7.86E-04

7.74E-02

0.0101

(0.2513)

-1.70E-04

1.09E-02

-0.0156

(0.9713)

-1.78E-04

1.15E-02

-0.0155

(0.9218)

FGL(FF1)

6.03E-04

3.56E-02

0.0169

(0.2513)

3.58E-04

6.98E-03

0.0513

(0.0010)

3.68E-04

7.02E-03

0.0523

(0.0025)

FGL(FF3)

6.02E-04

3.56E-02

0.0169

(0.2513)

3.58E-04

6.98E-03

0.0514

(0.0010)

3.68E-04

7.02E-03

0.0524

(0.0025)

FGL(FF5)

6.01E-04

3.56E-02

0.0169

(0.2513)

3.57E-04

6.98E-03

0.0512

(0.0010)

3.67E-04

7.02E-03

0.0522

(0.0025)

FF1

6.13E-04

5.22E-02

0.0117

(0.2513)

2.93E-04

7.23E-03

0.0405

(0.0032)

2.99E-04

8.06E-03

0.0371

(0.0285)

FF3

6.13E-04

5.22E-02

0.0117

(0.2513)

2.93E-04

7.23E-03

0.0405

(0.0032)

2.99E-04

8.06E-03

0.0371

(0.0285)

FF5

6.13E-04

5.22E-02

0.0117

(0.2513)

2.93E-04

7.23E-03

0.0405

(0.0032)

2.99E-04

8.06E-03

0.0371

(0.0285)

With TC

1.87E-04

1.98E-02

0.0094

0.0294

1.87E-04

1.98E-02

0.0094

0.0294

1.87E-04

1.98E-02

0.0094

0.0294

FGL

5.07E-04

8.37E-02

0.0175

0.3845

2.64E-04

7.09E-03

0.0372***

0.0882

2.66E-04

7.00E-03

0.038**

0.0863

FClime

4.36E-04

3.03E-01

0.0052

0.9279

2.10E-04

1.04E-02

0.0201*

0.0333

2.51E-04

1.10E-02

0.0228**

0.0266

FLW

-1.26E-05

8.62E-02

-0.0004

0.4399

1.86E-04

7.09E-03

0.0263**

0.1267

1.91E-04

7.44E-03

0.0256**

0.1251

FNLW

-4.74E-04

1.05E-01

-0.009

0.806

1.38E-04

7.07E-03

0.0195

0.1856

1.65E-04

8.48E-03

0.0194

0.1846

POET

NaN

-6.70E-03

1.95E-01

-0.0344

25.1772

-7.54E-03

9.20E-02

-0.082

11.6087

Projected POET

-3.51E-03

9.21E-02

-0.0485

10.9077

-6.81E-04

1.12E-02

-0.0610

0.5127

-7.37E-04

1.19E-02

-0.0621

0.5609

FGL(FF1)

2.02E-04

1.09E-01

0.0057

0.4028

2.68E-04

7.00E-03

0.0383***

0.0899

2.80E-04

7.04E-03

0.0397**

0.0877

FGL(FF3)

2.02E-04

8.39E-02

0.0057

0.4028

2.68E-04

7.00E-03

0.0383***

0.0901

2.80E-04

7.04E-03

0.0397**

0.0879

FGL(FF5)

1.99E-04

8.39E-02

0.0056

0.4032

2.67E-04

6.99E-03

0.0382***

0.0901

2.79E-04

7.04E-03

0.0396**

0.088

FF1

-7.16E-04

8.39E-02

-0.0139

1.3523

1.61E-05

7.34E-03

0.0022

0.2748

2.35E-05

8.15E-03

0.0029

0.2736

FF3

-7.16E-04

9.03E-02

-0.0139

1.3523

1.61E-05

7.34E-03

0.0022

0.2748

2.35E-05

8.15E-03

0.0029

0.2736

FF5

-7.16E-04

9.03E-02

-0.0139

1.3523

1.61E-05

7.34E-03

0.0022

0.2748

2.35E-05

8.15E-03

0.0029

0.2736

D.4 Less Risk-Averse Investors

Tables 8 and 9 provide the empirical results for higher target levels of risk and return for both monthly and daily data: target risk for monthly and daily data is set at $\sigma=0.08$ and $\sigma=0.02$ , respectively. Target return for monthly and daily data is set at $1.1715\%$ and $0.0555\%$ , respectively, both are equivalent to $15\%$ yearly return when compounded. Since GMV portfolio weight is not affected by target risk and return, only updated results for MRC and MWC are reported. Furthermore, since EW and Index portfolios are also not affected by target risk and return, their values are the same as in Table 5 and, hence, are also not reported to avoid repetition. The conclusions that we highlighted when analyzing updated Tables 5 and 1 continue to hold.

Table 8: Monthly portfolio returns, risk, SR and turnover. Targeted risk is set at

\sigma=0.08

, monthly targeted return is

1.1715\%

which is equivalent to

15\%

yearly return when compounded. In the upper part corresponding to the results w/o transactions costs, p-values are in parentheses. In the lower part corresponding to the results with transaction costs, ^∗∗∗ indicates p-value

<

0.01, ^∗∗ indicates p-value

<

0.05, and ^∗ indicates p-value

<

0.10. In-sample: January 1, 1980 - December 31, 1995 (180 obs), Out-of-sample: January 1, 1995 - December 31, 2019 (300 obs).

Markowitz Risk-Constrained

Markowitz Weight-Constrained

Return

Risk

Turnover

Return

Risk

Turnover

Without TC

FGL

0.041

0.1324

0.3099

(0.0769)

0.0069

0.0317

0.2187

(0.028)

FClime

0.0596

0.3739

0.1593

(0.1272)

0.0076

0.0441

0.1717

(0.034)

FLW

0.0473

0.1679

0.2817

(0.0849)

0.007

0.0344

0.2047

(0.028)

FNLW

0.0422

0.148

0.2853

(0.0849)

0.0071

0.0324

0.2190

(0.028)

POET

NaN

-0.1144

1.9928

-0.0574

(0.9471)

Projected POET

0.0933

0.5281

0.1766

(0.1272)

0.0075

0.051

0.1471

(0.0837)

FGL(FF1)

0.0439

0.128

0.3433

(0.0649)

0.0072

0.0303

0.2369

(0.0220)

FGL(FF3)

0.0438

0.1275

0.3437

(0.0649)

0.0072

0.0301

0.2385

(0.0220)

FGL(FF5)

0.0437

0.1269

0.3443

(0.0649)

0.0072

0.0301

0.2377

(0.0220)

FF1

0.0644

0.36

0.1789

(0.1272)

0.0038

0.0538

0.0706

(0.4833)

FF3

0.0623

0.3235

0.1926

(0.1272)

0.0045

0.0513

0.0869

(0.4833)

FF5

0.0566

0.2885

0.1962

(0.1272)

0.0047

0.0513

0.0908

(0.4833)

With TC

FGL

0.0353

0.1792

0.2666*

5.2184

0.006

0.0317

0.1897**

0.8622

FClime

0.0528

3.7772

0.1422

10.133

0.007

0.0442

0.1577*

0.5971

FLW

0.0375

0.1881

0.223

9.5001

0.0055

0.0345

0.1606*

1.5019

FNLW

0.0355

0.2159

0.2393*

6.3769

0.006

0.0325

0.185**

1.0653

POET

NaN

-0.1933

1.7451

-0.1108

124.9832

Projected POET

0.0313

0.1825

0.073

85.8766

0.0014

0.0505

0.0277

5.9556

FGL(FF1)

0.0386

0.2476

0.3018*

4.8017

0.0064

0.0303

0.2113**

0.7219

FGL(FF3)

0.0385

0.1738

0.3018*

4.8312

0.0064

0.0301

0.2127**

0.7245

FGL(FF5)

0.0383

0.1733

0.3018*

4.8537

0.0064

0.0302

0.2112**

0.7335

FF1

0.0244

0.1733

0.0707

64.7017

-0.0009

0.0535

-0.0162

4.5438

FF3

0.028

0.2331

0.0896

168.9642

4.04E-05

0.051

0.0008

4.2854

FF5

0.0237

0.2268

0.0836

34.1596

0.0004

0.0509

0.0077

4.1438

Table 9: Daily portfolio returns, risk, SR and turnover. Targeted risk is set at

\sigma=0.02

, daily targeted return is

0.0555\%

which is equivalent to

15\%

<

0.01, ^∗∗ indicates p-value

<

0.05, and ^∗ indicates p-value

<

0.10. In-sample: January 20, 2000 - January 24, 2002 (504 obs), Out-of-sample: January 17, 2002 - January 31, 2020 (4536 obs).

Markowitz Risk-Constrained

Markowitz Weight-Constrained

Return

Risk

Turnover

Return

Risk

Turnover

Without TC

FGL

1.25E-03

4.09E-02

0.0305

(0.0709)

3.10E-04

7.86E-03

0.0394

(0.0260)

FClime

3.30E-03

1.30E-01

0.0254

(0.0814)

2.20E-04

9.61E-03

0.0229

(0.036)

FLW

6.68E-04

4.08E-02

0.0164

(0.1539)

3.21E-04

9.36E-03

0.0343

(0.0280)

FNLW

7.56E-04

1.02E-01

0.0074

(0.7312)

3.02E-04

1.16E-02

0.0261

(0.0360)

POET

NaN

-5.17E-04

2.89E-01

-0.0018

(0.7419)

Projected POET

1.84E-03

2.63E-01

0.0070

(0.7312)

-6.76E-05

1.58E-02

-0.0043

(0.7419)

FGL(FF1)

1.24E-03

4.10E-02

0.0303

(0.0709)

3.10E-04

7.56E-03

0.0410

(0.0260)

FGL(FF3)

1.25E-03

4.09E-02

0.0306

(0.0709)

3.15E-04

7.54E-03

0.0417

(0.0260)

FGL(FF5)

1.24E-03

4.11E-02

0.0301

(0.0709)

3.15E-04

7.52E-03

0.0419

(0.0260)

FF1

1.14E-03

1.71E-01

0.0067

(0.7312)

3.78E-05

1.64E-02

0.0023

(0.5813)

FF3

1.16E-03

1.70E-01

0.0068

(0.7312)

3.14E-05

1.64E-02

0.0019

(0.5813)

FF5

1.17E-03

1.70E-01

0.0069

(0.7312)

2.47E-05

1.64E-02

0.0015

(0.5813)

With TC

FGL

6.14E-04

8.67E-02

0.0150

0.6385

2.43E-04

7.86E-03

0.0310**

0.0673

FClime

1.31E-03

6.49E-01

0.0101

2.4056

1.84E-04

9.61E-03

0.0191

0.0382

FLW

-1.58E-04

9.69E-02

-0.0039

0.8283

2.01E-04

9.38E-03

0.0214**

0.1218

FNLW

-4.50E-03

1.03E-01

-0.0422

10.5211

5.71E-05

1.17E-02

0.0049

0.2461

POET

NaN

-2.50E-02

6.21E-01

-0.0403

113.1667

Projected POET

-2.93E-02

1.15E-01

-0.0315

84.1090

-1.02E-03

1.65E-02

-0.0615

0.9502

FGL(FF1)

5.81E-04

1.43E-01

0.0141

0.6642

2.43E-04

7.57E-03

0.0321**

0.0681

FGL(FF3)

5.89E-04

8.51E-02

0.0144

0.6642

2.47E-04

7.55E-03

0.0327**

0.0685

FGL(FF5)

5.76E-04

8.50E-02

0.0140

0.6646

2.47E-04

7.53E-03

0.0328**

0.0687

FF1

-1.33E-02

8.49E-02

-0.0858

15.6900

-5.30E-04

1.66E-02

-0.0319

0.5790

FF3

-1.32E-02

1.28E-01

-0.0854

15.6211

-5.36E-04

1.66E-02

-0.0323

0.5785

FF5

-1.32E-02

1.28E-01

-0.0852

15.5866

-5.43E-04

1.66E-02

-0.0327

0.5786

D.5 Subperiod Analyses: MWC and GMV

Tables 10 and 11 report subperiod analyses for MWC and MRC portfolio formulations. The values of the EW and Index portfolios are the same as in Table 2 and, hence, are also not reported to avoid repetition. In terms of relative comparison between the competing models, the conclusions are similar to those drawn when examining Table 2 in the main text. However, in terms of relative magnitude, all models that use MWC or GMV portfolios exhibit deteriorated performance in terms of CER and SR during economic downturns (Downturn #1 and Downturn #2): MRC from Table 2 is the only type of portfolio that produces positive CER during both recessions.

Table 10: Cumulative excess return (CER) and risk of MWC portfolios using daily data. Targeted risk is set at

\sigma=0.013

, daily targeted return is

0.0378\%

. P-values are in parentheses. In-sample: January 20, 2000 - January 24, 2002 (504 obs), Out-of-sample: January 17, 2002 - January 31, 2020 (4536 obs).

FGL

FClime

FLW

FNLW

POET

ProjPOET

FGL(FF1)

FGL(FF3)

FGL(FF5)

FF1

FF3

FF5

Downturn #1: Argentine Great Depression (2002)

CER

-0.0138

-0.1045

-0.0158

-0.0195

-0.2820

-0.0217

-0.0153

-0.0176

-0.0187

-0.0334

Risk

0.0082

0.0124

0.0080

0.0078

0.0324

0.0130

0.0078

0.0095

-0.0031

-0.0314

(0.6753)

-0.0045

(0.6753)

-0.0069

(0.6753)

-0.0265

(0.6753)

-0.0007

(0.6753)

-0.0044

(0.6753)

-0.0057

(0.6753)

-0.0063

(0.6753)

-0.0194

(0.6414)

-0.0194

(0.6414)

-0.0194

(0.6414)

Downturn #2: Financial Crisis (2008)

CER

-0.1956

-0.3974

-0.2789

-0.2811

-0.9989

-0.0842

-0.2107

-0.2074

-0.2053

-0.2669

Risk

0.0135

0.0204

0.0126

0.0123

0.1198

0.0176

0.0134

0.0133

0.0183

0.0135

(0.4555)

0.0204

(0.4715)

0.0126

(0.4715)

0.0123

(0.4715)

0.1198

(0.4715)

0.0176

(0.4715)

0.0134

(0.4645)

0.0134

(0.4685)

0.0133

(0.4715)

0.0113

(0.4486)

0.0113

(0.4486)

0.0183

(0.4486)

Boom #1 (2017)

CER

0.1398

0.1309

0.1267

-0.0361

0.5720

-0.0877

0.1406

0.1407

0.1419

-0.0361

-0.0349

-0.0676

Risk

0.0044

0.0041

0.0037

0.0087

0.0630

0.0089

0.0046

0.0070

0.1194

(0.5814)

0.1227

(0.5884)

0.1308

(0.5814)

-0.0124

(0.7644)

0.0510

(0.5218)

-0.0367

(0.7644)

0.1151

(0.5814)

0.1154

(0.5814)

0.1177

(0.5814)

-0.0173

(0.7644)

-0.0165

(0.7644)

-0.0361

(0.7644)

Boom #2 (2019)

CER

0.3787

0.2595

0.3018

0.4078

1.4756

0.5300

0.2492

0.2497

0.2506

0.3839

0.3845

0.3896

Risk

0.0085

0.0078

0.0072

0.0098

0.0403

0.0176

0.0063

0.0064

0.0175

0.1533

(0.5715)

0.1215

(0.5920)

0.1495

(0.5920)

0.1432

(0.5920)

0.1092

(0.6512)

0.1046

(0.6512)

0.1423

(0.5920)

0.1423

(0.5920)

0.1427

(0.5920)

0.0816

(0.8912)

0.0817

(0.8912)

0.0826

(0.8912)

Table 11: Cumulative excess return (CER) and risk of GMV portfolios using daily data. Targeted risk is set at

\sigma=0.013

, daily targeted return is

0.0378\%

. P-values are in parentheses. In-sample: January 20, 2000 - January 24, 2002 (504 obs), Out-of-sample: January 17, 2002 - January 31, 2020 (4536 obs).

FGL

FClime

FLW

FNLW

POET

ProjPOET

FGL(FF1)

FGL(FF3)

FGL(FF5)

FF1

FF3

FF5

Downturn #1: Argentine Great Depression (2002)

CER

-0.0044

-0.1061

-0.0151

-0.0206

-0.3190

-0.0662

-0.0038

-0.0059

-0.0076

-0.0335

Risk

0.0081

0.0129

0.0080

0.0078

0.0330

0.0135

0.0077

0.0096

0.0017

(0.6543)

-0.0306

(0.7564)

-0.0041

(0.6583)

-0.0075

(0.6583)

-0.0322

(0.7564)

-0.0148

(0.6583)

0.0017

(0.6543)

0.0006

(0.6543)

-0.0004

(0.6543)

-0.0193

(0.6583)

-0.0193

(0.6583)

-0.0193

(0.6583)

Downturn #2: Financial Crisis (2008)

CER

-0.2113

-0.4410

-0.2926

-0.2959

-0.9928

0.0829

-0.2291

-0.2251

-0.2226

-0.2938

Risk

0.0138

0.0241

0.0128

0.0124

0.0931

0.0247

0.0136

0.0186

0.0138

(0.4146)

0.0241

(0.4296)

0.0128

(0.4296)

0.0124

(0.4296)

0.0931

(0.4296)

0.0247

(0.4296)

0.0136

(0.4196)

0.0136

(0.4276)

0.0136

(0.4296)

0.0116

(0.4096)

0.0116

(0.4096)

0.0186

(0.4096)

Boom #1 (2017)

CER

0.1384

0.1264

0.1323

-0.0388

-1.0000

-0.1106

0.1387

0.1388

0.1402

-0.0389

-0.0366

-0.0698

Risk

0.0045

0.0041

0.0037

0.0090

0.2414

0.0115

0.0047

0.0046

0.0065

0.0066

0.1177

(0.5994)

0.1183

(0.6044)

0.1366

(0.6044)

-0.0131

(0.8024)

-0.0723

(0.9115)

-0.0347

(0.8024)

0.1131

(0.6014)

0.1133

(0.6044)

0.1157

(0.6044)

-0.0211

(0.8024)

-0.0196

(0.8024)

-0.0404

(0.9023)

Boom #2 (2019)

CER

0.3703

0.2829

0.2994

0.3287

1.6301

0.6870

0.2503

0.2504

0.4031

0.4038

0.4087

Risk

0.0072

0.0081

0.0084

0.0097

0.0318

0.0186

0.0063

0.0185

0.0184

0.1521

(0.5644)

0.1266

(0.5714)

0.1478

(0.5714)

0.1419

(0.5714)

0.1366

(0.5714)

0.1209

(0.5714)

0.1441

(0.5684)

0.1440

(0.5684)

0.1439

(0.5684)

0.0810

(0.7592)

0.0811

(0.7592)

0.0819

(0.7592)

	$\displaystyle{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\mathbf{B}}}^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}\widehat{{\mathbf{B}}}-({\mathbf{B}}{\mathbf{H}}^{\prime})^{\prime}{\bm{\Theta}}_{\varepsilon}({\mathbf{B}}{\mathbf{H}}^{\prime})\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}$	$\displaystyle\leq{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\mathbf{B}}}-{\mathbf{B}}{\mathbf{H}}^{\prime}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\mathbf{B}}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\mathbf{B}}{\mathbf{H}}^{\prime}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}_{\varepsilon}-{\bm{\Theta}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\mathbf{B}}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}$
		$\displaystyle+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\mathbf{B}}{\mathbf{H}}^{\prime}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\mathbf{B}}}-{\mathbf{B}}{\mathbf{H}}^{\prime}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{2}=\mathcal{O}_{P}\Bigg{(}p\cdot s_{T}\cdot\varrho_{T}\Bigg{)}.$		(B.4)

$\displaystyle{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}-{\bm{\Theta}}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}$	$\displaystyle\leq{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|\widehat{{\bm{\Theta}}}_{\varepsilon}-{\bm{\Theta}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|(\widehat{{\bm{\Theta}}}_{\varepsilon}-{\bm{\Theta}}_{\varepsilon})\widehat{{\mathbf{B}}}\widehat{{\bm{\Pi}}}\widehat{{\mathbf{B}}}^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}_{\varepsilon}(\widehat{{\mathbf{B}}}-{\mathbf{B}}{\mathbf{H}}^{\prime})\widehat{{\bm{\Pi}}}\widehat{{\mathbf{B}}}^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}$
	$\displaystyle+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}_{\varepsilon}{\mathbf{B}}{\mathbf{H}}^{\prime}(\widehat{{\bm{\Pi}}}-{\bm{\Pi}})\widehat{{\mathbf{B}}}^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}_{\varepsilon}{\mathbf{B}}{\mathbf{H}}^{\prime}{\bm{\Pi}}(\widehat{{\mathbf{B}}}-{\mathbf{B}})^{\prime}\widehat{{\bm{\Theta}}}_{\varepsilon}\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}+{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|{\bm{\Theta}}_{\varepsilon}{\mathbf{B}}{\mathbf{H}}^{\prime}{\bm{\Pi}}({\mathbf{B}}{\mathbf{H}}^{\prime})^{\prime}(\widehat{{\bm{\Theta}}}_{\varepsilon}-{\bm{\Theta}}_{\varepsilon})\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}_{l}$
	$\displaystyle=\Delta_{1}+\Delta_{2}+\Delta_{3}+\Delta_{4}+\Delta_{5}+\Delta_{6}.$	(B.5)

Optimal Portfolio Using Factor Graphical Lasso 00footnotetext: The authors would like to thank the editor Fabio Trojani and three anonymous referees for helpful and constructive comments on the paper.

Abstract

1 Introduction

Notation

2 Optimal Portfolio Allocation

3 Factor Graphical Lasso

4 Asymptotic Properties

4.1 Assumptions

4.2 The FGL Procedure

4.3 Convergence in Estimation of Factors and Loadings

Theorem 1.

Theorem 2.

4.4 Convergence in Estimation of Precision Matrix and Portfolio Weights

Theorem 3.

Theorem 4.

4.5 Implications on Portfolio Risk Exposure

Theorem 5.

5 Monte Carlo

6 Empirical Application

6.1 Data

6.2 Performance Measures

6.3 Description of Empirical Design

6.4 Empirical Results

7 Conclusion

References

Appendix A Graphical Lasso Algorithm

Appendix B Proofs of the Theorems

B.1 Lemmas for Theorem 1

Lemma 1.

Proof.

Lemma 2.

Proof.

Lemma 3.

Proof.

Lemma 4.

Proof.

Lemma 5.

Proof.

Lemma 6.

Proof.

Lemma 7.

Proof.

B.2 Proof of Theorem 1

B.3 Corollary 1

Corollary 1.

Proof.

B.4 Proof of Theorem 2

B.5 Lemmas for Theorem 3

Lemma 8.

Proof.

Lemma 9.

Proof.

B.6 Proof of Theorem 3

B.7 Lemmas for Theorem 4

Lemma 10.

Proof.

Lemma 11.

Proof.

B.8 Proof of Theorem 4

B.9 Proof of Theorem 5

B.10 Generalization: Sub-Gaussian and Elliptical Distributions

Proposition 1.

Appendix C Additional Simulations

C.1 Verifying Theoretical Rates

C.2 Results for Case 1

C.3 Robust FGL

C.4 Relaxing Pervasiveness Assumption

Appendix D Additional Empirical Results

D.1 Estimating Unknown Factors and Loadings

Remark 1.

D.2 Monthly Data

D.3 Portfolio Performance for Longer Training Periods

D.4 Less Risk-Averse Investors

D.5 Subperiod Analyses: MWC and GMV

Optimal Portfolio Using Factor Graphical Lasso ⁰⁰footnotetext: The authors would like to thank the editor Fabio Trojani and three anonymous referees for helpful and constructive comments on the paper.