Deep Risk Model: A Deep Learning Solution for Mining Latent Risk Factors to Improve Covariance Matrix Estimation

Hengxu Lin Columbia Business SchoolNew YorkUnited States [email protected] , Dong Zhou Microsoft ResearchNo.5 Danling StreetBeijingChina100080 [email protected] , Weiqing Liu Microsoft ResearchNo.5 Danling StreetBeijingChina100080 [email protected] and Jiang Bian Microsoft ResearchNo.5 Danling StreetBeijingChina100080 [email protected]

(2021)

Abstract.

Modeling and managing portfolio risk is perhaps the most important step to achieve growing and preserving investment performance. Within the modern portfolio construction framework that built on Markowitz’s theory, the covariance matrix of stock returns is a required input to calculate portfolio risk. Traditional approaches to estimate the covariance matrix are based on human-designed risk factors, which often require tremendous time and effort to design better risk factors to improve the covariance estimation. In this work, we formulate the quest of mining risk factors as a learning problem and propose a deep learning solution to effectively “design” risk factors with neural networks. The learning objective is also carefully set to ensure the learned risk factors are effective in explaining the variance of stock returns as well as having desired orthogonality and stability. Our experiments on the stock market data demonstrate the effectiveness of the proposed solution: our method can obtain $1.9\%$ higher explained variance measured by $R^{2}$ and also reduce the risk of a global minimum variance portfolio. The incremental analysis further supports our design of both the architecture and the learning objective.

risk management, covariance estimation, neural network, machine learning

^†^†journalyear: 2021^†^†copyright: acmcopyright^†^†conference: 2nd ACM International Conference on AI in Finance; November 3–5, 2021; Virtual Event, USA^†^†booktitle: 2nd ACM International Conference on AI in Finance (ICAIF’21), November 3–5, 2021, Virtual Event, USA^†^†price: 15.00^†^†doi: 10.1145/3490354.3494377^†^†isbn: 978-1-4503-9148-1/21/11^†^†ccs: Computing methodologies Factor analysis^†^†ccs: Computing methodologies Neural networks^†^†ccs: Computing methodologies Machine learning^†^†ccs: Information systems Data mining

1. Introduction

Modern portfolio theory has witnessed the increasing significance of risk management in stock investment in past decades. Successful risk management recognizes the equity market risk that the stocks have exposure. Thereby enables investors to adjust their strategies according to the estimation of returns and arising uncertainty while constructing their portfolio. The most widely adopted framework to select the best stock portfolio is built on Markowitz’s mean-variance theory (Markowitz, 1952). Within this framework, the optimal portfolio is obtained by solving a constrained utility-maximization problem, with the utility defined as the expected portfolio return minus a cost of portfolio risk. To precisely determine the risk of any portfolio, having a good estimation of the covariance matrix of stock returns is the fundamental question.

Nevertheless, the empirical covariance matrix can be extremely ill-conditioned as the dimension of stocks grows and even becomes rank deficient such that cannot be used for portfolio optimization. To facilitate the estimation of the covariance matrix under such high-dimension setting, (Ledoit and Wolf, 2004, 2012) proposed to shrink the off-diagonal elements in the empirical covariance matrix towards zero, based on the assumption that the target covariance matrix is sparse. However, stock returns are often highly related to each other and thus the covariance matrix of stock returns cannot sparse. A more realistic assumption is that the high-dimensional stock returns are driven by a set of low-dimensional risk factors (Sheikh, 1996). For example, it is often observed that stocks with similar market capitalization tend to have similar market behavior, thus the market capitalization can be such a risk factor, which is also widely known as the size (Eugene and French, 1992) factor. By introducing multiple risk factors, the estimation of a high-dimensional covariance matrix of stock returns can be decomposed to the estimation of a low-dimensional covariance matrix of risk factor returns¹¹1Factor returns are the cross-sectional regression coefficients. We will use these two terms interchangeably in the rest of the paper.. Apparently, a good estimation of the covariance matrix requires a good design of the risk factors.

Risk factors have been carefully studied by investment professionals. Fundamental risk factors are proposed first since they are highly interpretable on an ongoing basis. Capital assets pricing model (CAPM) (Sharpe, 1964) first formulated the systematic risk factor known as beta. Multiple factors are further included into this fundamental risk factor family: size and value (Eugene and French, 1992), momentum (Carhart, 1997) and non-linear size (BARRA) (Sheikh, 1996), to name a few. Despite their success, the design of these risk factors was extremely slow: it took almost half a century from CAPM to BARRA, with only roughly ten new factors added in the evolution. This is because the discovery of these risk factors usually relies on tremendous human efforts to verify hypotheses based on the long-term observation. Statistical risk factors (Alexander, 2001; Avellaneda and Lee, 2010) are thereby proposed to address such limitation, which applies matrix decomposition algorithms like principal component analysis (PCA) or Factor Analysis (Harman, 1976) to obtain the latent risk factors. However, such decomposition methods are linear, thus hard to discover non-linear factors like non-linear size as aforementioned. Besides, unlike fundamental risk factors that have been tested in explaining forward returns, statistical risk factors are simply the principle components of history returns and often struggle in out-of-sample generalization. Therefore, a more efficient and effective risk factor mining solution is still desired for improving the estimation of covariance matrix of stock returns.

In this work, we propose a deep learning solution, Deep Risk Model, to facilitate the design of risk factors. From the perspective of machine learning, the task of designing risk factors could be effectively solved given a well-designed architecture to fully utilize the available information and an appropriate learning objective to match the desired properties of risk factors. First of all, the architecture should not only be compatible with the transformations used in human-designed risk factors, but also have the capacity to represent more complex transformation. To this end, we design a hybrid neural network with Gated Recurrent Units (GRU) (Chung et al., 2014) to model the temporal transformation (e.g., the standard deviation of historical stock returns), and a Graph Attention Networks (GAT) (Veličković et al., 2017) layer to represent the cross-sectional transformation (e.g., the excess return over the sector average). Furthermore, we design a learning objective to pursue the desired properties of the learned risk factors on three dimensions: the ability of explaining the variance of stock returns, orthogonality and stability.

To demonstrate the effectiveness of the proposed method, we further conduct extensive experiments on the stock market data. Specifically, we first compare the variance explanation ability of risk factors from the proposed method with the ones from the fundamental risk model and statistical risk model. Experiment results show our method can consistently outperform the other two baselines with $1.9\%$ higher $R^{2}$ . We further use the learned risk factors to estimate the covariance with which we construct a Global Minimum Variance (GMV) portfolio. We find the proposed deep risk model’s factors are more reliable and achieve the lowest volatility for the GMV portfolio. Furthermore, we analyze the characteristics of the learned factors and receive excellent accomplishment of controlling multi-collinearity and stability. Ablation studies on our model also manifest the necessity and superiority of our architecture and learning objective design.

The main contributions are summarized as follows:

•

To the best of our knowledge, we are the first to formulate the quest of risk factor mining as a supervised learning task.
•

We propose Deep Risk Model as an effective and adaptive solution for learning risk factors with a comprehensive design of both the architecture and the learning objective.
•

We conduct extensive experiments with real-world stock market data and compare with existing state-of-the-art baselines. Experiment results demonstrate the effectiveness of the proposed method.

2. Related Works

Covariance Matrix Estimation

Large scale covariance matrix estimation is fundamental in multivariate analysis and ubiquitous in financial economics panel data. The main challenge of this field remains on the singular matrix arising from an insufficient sample size compared to the large sample dimension. Related solutions mainly focus on two aspects: rank-based estimation and factor-based estimation. Rank-based estimation relies on the sparsity assumption of the matrix: a majority of the matrix elements is nearly zero, thus various thresholds (Bickel and Levina, 2008; Rigollet and Tsybakov, 2012; Lam and Fan, 2009) could be designed to control these elements and the parameters to be estimated are reduced significantly. Factor-based estimation is widely adopted in the scenario where the sparsity assumption is not applicable when variables are highly correlated with each other (e.g. stocks). By introducing factors and the factor covariance matrix to the decomposition, the residual covariance could be expressed as a conditional sparse matrix: conditional to common factors, the covariance matrix of remaining components are sparse. The factor model is more reasonable and commonly used in economics forecasting for its capacity to perform well in out-of-sample test (Fan et al., 2011; Feng et al., 2020; Fan et al., 2018).

Matrix decomposition

Matrix Decomposition or Matrix Factorization, which is fundamental in linear algebra (Trefethen and Bau III, 1997; Fan et al., 2016), factorizing the matrix into a product of matrices. Matrix decomposition algorithms are designed for specific purpose. LU decomposition, QR decomposition, Cholesky decomposition, rank factorization are used to solve systems of linear equations. Eigendecomposition, or spectral decomposition is used to attain eigenvectors and eigenvalues. Jordan decomposition, Schur decomposition, scale invariant decomposition (Uhlmann, 2018), singular value decomposition (SVD) (Van Loan, 1976) are further implementations derived from eigendecomposition. In addition, principal component analysis (PCA) (Wold et al., 1987) is an application of SVD.

Representation Learning

Representation Learning, or Feature Learning, is a technique used to extract effective representations needed for specific downstream tasks (e.g. classification, regression, feature detection) from raw features. Unsupervised and supervised representation learning has been well studied for decades: matrix factorization as discussed above, unsupervised dictionary learning, principal components analysis, independent components analysis (Vasilescu and Terzopoulos, 2005), clustering techniques like K-Means, linear discriminant analysis (Balakrishnama and Ganapathiraju, 1998). Recently, the success of deep learning brings multiple supervised representation learning techniques into prosperity. Neural networks have demonstrated their distinguished competency and accomplished state-of-the-art in representation learning even for various kinds of tasks: vision computing (Caron et al., 2018; Asano et al., 2020), natural language processing (Mikolov et al., 2013; Sutskever et al., 2014; Devlin et al., 2019).

3. Preliminaries

The main interest of this section is to give an overview of the modern multi-factor model theory used in factor-based risk models, and to manifest the necessity of adopting an advanced solution to discover and capture intrinsic and adaptive risk factors.

3.1. Factor-Based Risk Models

Driven by the assumption that the intrinsic dependency of stock returns could be explained by some common latent factors, a large volume of academic and industrial research discusses and studies the design of factors. Mathematically, the multi-factor model is a multi-variable linear model with risk factors as the independent variables and stock return as the dependent variable. Let $\mathrm{y_{i,t}}$ denote the observable return of stock $i$ at time $t$ , the multi-factor model takes the form of

(1)

\mathrm{y}_{i,t}=\mathbf{f}_{i}^{\intercal}\mathbf{b}_{t}+\mathrm{u}_{i,t}

where $\mathbf{f}_{i}\in\mathbb{R}^{K}$ represents a vector of $K$ risk factors for the $i$ -th stock, $\mathrm{u}_{i,t}$ is the idiosyncratic error term, and $\mathbf{b_{t}}\in\mathbb{R}^{K}$ is the vector of regression coefficients, which is also called factor returns.

Moreover, considering forward returns of a specific stock over some horizon $H$ , we can further decompose the stock returns into factor returns by

(2)

\displaystyle\begin{bmatrix}\mathrm{y}_{i,t}\\ \mathrm{y}_{i,t+1}\\ ...\\ \mathrm{y}_{i,t+H-1}\\ \end{bmatrix}=\begin{bmatrix}\mathbf{f}_{i}^{\intercal}\mathbf{b}_{t}+\mathrm{u}_{i,t}\\ \mathbf{f}_{i}^{\intercal}\mathbf{b}_{t+1}+\mathrm{u}_{i,t+1}\\ ...\\ \mathbf{f}_{i}^{\intercal}\mathbf{b}_{t+H-1}+\mathrm{u}_{i,t+H-1}\\ \end{bmatrix},

or equivalently in its matrix form

(3)

\mathbf{y}_{i}=\mathbf{B}\mathbf{f}_{i}+\mathbf{u}_{i},

where $\mathbf{B}\in\mathbb{R}^{H\times K}$ contains the factor returns from time $t$ to $t+H-1$ . Without loss of generality, we assume that $\mathbf{E}[\bar{\mathrm{y}}_{i}]=0$ and $\mathbf{E}[\bar{\mathrm{u}}_{i}]=0$ , hereby, the covariance $\Sigma_{ij}$ of any two stocks $i$ , $j$ can be derived as:

(4)

\Sigma_{ij}=\mathbf{f}_{i}\Sigma_{\mathbf{B}}\mathbf{f}^{\intercal}_{j}+\sigma_{i}\rho_{ij}\sigma_{j}

where $\Sigma_{\mathbf{B}}$ is the covariance of the factor return matrix $\mathbf{B}$ , $\sigma_{i}$ is standard deviation of the idiosyncratic errors $\mathbf{u}_{i}$ , and $\rho_{ij}$ denotes the correlation between $\mathbf{u}_{i}$ and $\mathbf{u}_{j}$ . A common assumption is that the idiosyncratic errors are independent (otherwise they can be explained by certain risk factors) and thus we have $\rho_{ij}=1$ if $i=j$ and $\rho_{ij}=0$ otherwise.

Additionally, let $\mathbf{F}:=[\mathbf{f}_{1},...,\mathbf{f}_{N}]^{\intercal}\in\mathbb{R}^{N\times K}$ denote the risk factor matrix for all $N$ stocks, the entire covariance matrix of stock returns can be derived as

(5)

\Sigma=\mathbf{F}\Sigma_{\mathbf{B}}\mathbf{F}^{\intercal}+\Delta,

where $\Delta$ is a diagonal matrix with $\Delta_{ii}=\sigma_{i}^{2}$ . Thereby we have decomposed the covariance of stock returns $\Sigma\in\mathbb{R}^{N\times N}$ into the covariance of factor returns $\Sigma_{\mathbf{B}}\in\mathbb{R}^{K\times K}$ which has lower dimension²²2As we cannot access all forward returns $\mathrm{y}_{i,t+1}$ at time $t$ , a common practice is using historical stock returns to estimate $\Sigma_{\mathbf{B}}$ .. Apparently, the design of risk factors $\mathbf{F}$ is crucial to give accurate covariance estimation as both $\mathbf{b}$ and $\mathbf{u}$ are determined when $\mathbf{F}$ is specified in Equation 1.

3.2. Risk Factor Design

The most popular risk factors are designed by human experts (Sheikh, 1996), which have been grounded in academic research and historically recognized to explanations of stock return: Size (measures a company’s scale of market capital), Value (measures a stock’s intrinsic value), Beta (explanation to market index return), Momentum (uses price trend to forecast future return), to name a few. Despite the achievement from the human-designed risk factors, it actually took a long time for human specialists to draw these factors based on the long-term performance. Besides, evidence also exemplifies that these factors have become invalid during unpredictable market movements. Statistical risk factors are thereby proposed to address such limitations of human-designed factors, which usually applies principal component analysis (PCA) on historical stock returns to obtain the first $K$ components as the risk factors. Although efficient, such a method often has poor out-of-sample performance in risk forecasting as the obtained risk factors simply overfit the history returns but not explain the forward-looking returns. The limitations of both these two methods stimulate us to design a more efficient and effective solution for the design of risk factors.

4. Deep Risk Model

From the perspective of machine learning, the design of risk factors can be seen as the learning of a transformation function g, which maps raw data $\mathbf{X}$ to risk factors $\mathbf{F}$ by $\mathbf{F}=\emph{g}(\mathbf{X})$ . From this perspective, human-designed factors are equivalent to specifying the transformation g as explicit formulas. In this work, we consider adopting a deep neural network to learn a parameterized transformation $\emph{g}_{\theta}$ in a data-driven approach. Apparently, a well-designed neural network architecture should not only be compatible with human-designed transformation formulas but also have the capacity to capture more complex transformations. Therefore, we design a hybrid neural network to represent both temporal transformation and cross-section transformation, besides the neural network can also ensure capture the complex non-linear transformations. Moreover, we also prepare our learning objective carefully to ensure the ability of learned risk factors to explain the variance of stock returns, while satisfying the requirements of both orthogonality and stability.

4.1. Architecture Design

Refer to caption — Figure 1. The proposed architecture for risk factors mining consists of two branches of Gated Recurrent Units (GRU) modules to learn temporal transformations. One of the two branches has a Graph Attention Network (GAT) layer to learn cross-sectional transformation.

4.1.1. Temporal Transformation

In order to learn the temporal transformation, e.g., the standard deviation of historical returns, to a full extent from the panel data, we employ Gated Recurrent Units (GRU) (Chung et al., 2014) as the backbone model. GRU has been widely used in natural language processing and receives remarkable achievements in modeling sequential information. Denote the input feature as $\mathbf{x}_{i,t}\in\mathbb{R}^{P}$ for stock $i$ at time $t$ with $P$ as the number of features. The general idea of GRU is to recurrently project the input time series into the hidden representation distribution space. At each time step, the GRU learns the hidden representation $\mathbf{h}_{i,t}$ by jointly leveraging current input $\mathbf{x}_{i,t}$ and previous hidden representation $\mathbf{h}_{i,t-1}$ in a recursive manner:

(6)

\mathbf{h}_{i,t}=\mathrm{GRU}(\mathbf{x}_{i,t},\mathbf{h}_{i,t-1}).

Besides, in order to adaptively select the most important information from different time steps, we also use the attention mechanism (Feng et al., 2019; Lin et al., 2021) to aggregate all hidden representations.

4.1.2. Cross-Sectional Transformation

The aforementioned temporal transformation simply treats each stock independently, which, however, is unable to represent the transformation that involves relations among different stocks. For example, a common transformation in human-designed factors is subtracting stock returns from the average return of stocks that belongs to the same sector to get the residual momentum factor (Blitz et al., 2011). Such a transformation requires the model to be aware of the relations of individual stocks in addition to the temporal transformation. To fulfill this goal, we introduce another branch of GRU with an additional Graph Attention Networks (GAT) (Veličković et al., 2017) layer to model the cross-sectional transformation as illustrated in Future 1. The GAT layer not only can represent the traditional stock relations like sector, but can also dynamically capture latent relations with learnable attention weights.

Mathematically, given the raw features $\mathbf{X}\in\mathbb{R}^{N\times P}$ for $N$ stocks at time $t$ , the GAT layer first projects the input features into another space with a shared linear transformation $\mathbf{W}\in\mathbb{R}^{P\times P}$ , and then performs self-attention $a:\mathbb{R}^{P}\times\mathbb{R}^{P}\to\mathbb{R}$ on the nodes (i.e., stocks) to compute the attention coefficients:

(7)

e_{ij}=\mathrm{LeakyReLU}\big{(}a(\mathbf{W}\mathbf{x}_{i},\mathbf{W}\mathbf{x}_{j})\big{)}.

The attention coefficients can indicate the similarity of stock $i$ and stock $j$ . A prior graph structure, e.g., sector or business relation, can also be injected into GAT by performing masked attention (Veličković et al., 2017).

The second step in GAT is aggregating the information from stocks with high attention coefficients, which can be accomplished by normalzing the attention coefficients by the softmax function:

(8)

\alpha_{ij}=\mathrm{softmax}_{j}(e_{ij})=\frac{\mathrm{exp}(e_{ij})}{\sum_{k\in\mathcal{N}_{i}}\mathrm{exp}(e_{ik})},

where $\mathcal{N}_{i}$ contains all neighborhoods of the $i$ -th stock. The normalized attention coefficients $\alpha_{ij}$ can thereby be used to compute a linear combination of the features by

(9)

\mathbf{\tilde{x}}_{i}=\mathrm{LeakyReLU}\Big{(}\sum_{j\in\mathcal{N}_{i}}\alpha_{ij}\mathbf{W}\mathbf{x}_{i}\Big{)}.

A shared GAT layer will be applied at all time stamps to get the aggregated information $\mathbf{\tilde{x}}_{i,t}$ . As we are interested in learning the relative information compared to the group of the related stocks for cross-sectional transformation, we will subtract the aggregated information from neighborhoods from the current feature and pass the residuals to the GRU module:

(10)

\mathbf{\tilde{h}}_{i,t}=\mathrm{GRU}(\mathbf{x}_{i,t}-\mathbf{\tilde{x}}_{i,t},\mathbf{\tilde{h}}_{i,t-1}),

where we use another GRU to further accomplish the temporal transformation from the residual features. The output hidden representations from both branches will be project into the risk factors by two separate linear projections $\mathbf{Q}\in\mathbb{R}^{P\times K_{1}}$ and $\mathbf{\tilde{Q}}\in\mathbb{R}^{P\times(K-K_{1})}$ :

(11)

\mathbf{f}_{i,t}=\big{[}\mathrm{Norm}(\mathbf{Q}\mathbf{h}_{i,t})\|\mathrm{Norm}(\mathbf{\tilde{Q}}\mathbf{\tilde{h}}_{i,t})\big{]},

where $\|$ represents the concatenation operation, and $\mathrm{Norm}$ is a normalization function that transforms the learned risk factors to have capitalization weighted zero mean and equal-weighted unit standard deviation (Balakrishnama and Ganapathiraju, 1998).

4.2. Loss Design

Now we ponder the design of loss function (i.e. the learning objective) to fulfill three desired properties of the mined risk factors.

4.2.1. Explained Variance

The most important metric to evaluate the quality of a given set of risk factors is the proportion of the variance of stock returns predicted with the risk factors. In statistics, the coefficient of determination, $R^{2}$ , is used to measure the ratio of explained variance. Let $\mathbf{F}_{\cdot t}=\emph{g}_{\theta}(\mathbf{X}_{\cdot t})$ denote the risk factors produced by our model at time $t$ for all stocks, then $R^{2}_{\cdot t}$ can be calculated as

(12)

R^{2}_{\cdot t}=1-\frac{\big{\|}\mathbf{y}_{\cdot t}-\mathbf{F}_{\cdot t}(\mathbf{F}_{\cdot t}^{\intercal}\mathbf{F}_{\cdot t})^{-1}\mathbf{F}_{\cdot t}^{\intercal}\mathbf{y}_{\cdot t}\big{\|}_{2}^{2}}{\|\mathbf{y}_{\cdot t}\|_{2}^{2}},

where $||\cdot||_{2}^{2}$ represents the $\ell^{2}$ distance. Thence, we can use the below empirical loss function to optimize the model towards higher $R^{2}$ :

(13)

\max\frac{1}{T}\sum_{t=1}^{T}R^{2}_{\cdot t}\triangleq\min_{\theta}\frac{1}{T}\sum_{t=1}^{T}\frac{\big{\|}\mathbf{y}_{\cdot t}-\mathbf{F}_{\cdot t}(\mathbf{F}_{\cdot t}^{\intercal}\mathbf{F}_{\cdot t})^{-1}\mathbf{F}_{\cdot t}^{\intercal}\mathbf{y}_{\cdot t}\big{\|}_{2}^{2}}{\|\mathbf{y}_{\cdot t}\|_{2}^{2}}.

4.2.2. Multi-collinearity Regularization

Directly training the model towards minimizing the objective defined in Equation 13 often ends up with highly correlated risk factors, which will make the regression coefficients in Equation 1 unstable and thus harm the estimation of the covariance matrix. A practical statistical ratio used for diagnosing multi-collinearity is variance inflation factor (VIF) (Belsley et al., 2005), which quantifies the extent of correlation between $i$ -th variate $\mathbf{F}_{i\cdot}$ in all time and the remaining covariates:

(14)

\mathrm{VIF}_{i\cdot}=\frac{1}{1-R_{i\cdot}^{2}}

where $R_{i\cdot}^{2}$ is the coefficient of determination for the regression of the $i$ -th risk factors $\mathbf{F}_{i\cdot}$ on the other covariates. A straight forward idea to reduce the multi-collinearity is using $\sum_{i=1}^{K}\mathrm{VIF}_{i\cdot}$ as the regularization term, which however, requires $K$ separate regressions to obtain all VIF ratios and is time consuming.

In this work, we introduce a computationally efficient alternative to obtain $\sum_{i=1}^{K}\mathrm{VIF}_{i\cdot}$ . In the remainder of this section, we use $\mathbf{F}_{-i\cdot}$ to denote all the remaining factors excluding the $i$ -th factor. Let $\mathbf{F}_{-i\cdot}$ be the independent variables and $\mathbf{F}_{i\cdot}$ as the dependent variable and through linear regression we can obtain the regression coefficients $\boldsymbol{\beta}_{-i}\in\mathbb{R}^{K-1}$ . Then we have

(15)	$\displaystyle\sum_{i=1}^{K}\mathrm{VIF}_{i\cdot}$	$\displaystyle=\sum_{i=1}^{K}(1-R_{i\cdot}^{2})^{-1}$
		$\displaystyle=\sum_{i=1}^{K}\mathbf{F}_{i\cdot}^{\intercal}\mathbf{F}_{i\cdot}[(\mathbf{F}_{i\cdot}-\mathbf{F}_{-i\cdot}\boldsymbol{\beta}_{-i})^{\intercal}(\mathbf{F}_{i\cdot}-\mathbf{F}_{-i\cdot}\boldsymbol{\beta}_{-i})]^{-1}$
		$\displaystyle=\sum_{i=1}^{K}\mathbf{F}_{i\cdot}^{\intercal}\mathbf{F}_{i\cdot}[\mathbf{F}_{i\cdot}^{\intercal}\mathbf{F}_{i\cdot}-2\boldsymbol{\beta}_{-i}^{\intercal}\mathbf{F}_{-i\cdot}^{\intercal}\mathbf{F}_{i\cdot}+\boldsymbol{\beta}_{-i}^{\intercal}\mathbf{F}_{-i\cdot}^{\intercal}\mathbf{F}_{-i\cdot}\boldsymbol{\beta}_{-i}]^{-1}$

Since the factors are normalized to have unit deviation in Equation 11, we substitute $\mathbf{F}_{i\cdot}^{\intercal}\mathbf{F}_{i\cdot}=N$ . And from standardized regression characteristics we also have $\mathbf{F}_{-i\cdot}^{\intercal}\mathbf{F}_{-i\cdot}\boldsymbol{\beta}_{-i}=\mathbf{F}_{-i\cdot}^{\intercal}\mathbf{F}_{i\cdot}$ , then:

(16)	$\displaystyle\sum_{i=1}^{K}\mathrm{VIF}_{i\cdot}$	$\displaystyle=N\sum_{i=1}^{K}\big{(}\mathbf{F}_{i\cdot}^{\intercal}\mathbf{F}_{i\cdot}-\boldsymbol{\beta}_{-i}{-i\cdot}^{\intercal}\mathbf{F}_{-i\cdot}^{\intercal}\mathbf{F}_{i\cdot}\big{)}^{-1}$
		$\displaystyle=N\sum_{i=1}^{K}\big{(}\mathbf{F}_{i\cdot}^{\intercal}\mathbf{F}_{i\cdot}-\mathbf{F}_{i\cdot}^{\intercal}\mathbf{F}_{-i\cdot}(\mathbf{F}_{-i\cdot}^{\intercal}\mathbf{F}_{-i\cdot})^{-1}\mathbf{F}_{-i\cdot}^{\intercal}\mathbf{F}_{i\cdot}\big{)}^{-1}$
		$\displaystyle=N\sum_{i=1}^{K}\big{(}\mathbf{F}^{\intercal}\mathbf{F}\big{)}_{i\cdot,i\cdot}^{-1}=N*\mathrm{tr}\big{(}\big{(}\mathbf{F}^{\intercal}\mathbf{F}\big{)}^{-1}\big{)}.$

The second to last equation is supported by Schur complement without loss of generality. Compared with regressing each variate $i$ on the remaining covariates to obtain $\sum_{i=1}^{K}\mathrm{VIF}_{i\cdot}$ , the trace of matrix $(\mathbf{F}^{\intercal}\mathbf{F})^{-1}$ is more feasible and efficient. Consequently, we can regularize multi-collinearity by minimizing $\mathrm{tr}\big{(}\big{(}\mathbf{F}^{\intercal}\mathbf{F}\big{)}^{-1}\big{)}$ .

4.2.3. Multi-Task Learning

Another desired property of the risk factors is their stability, i.e., the value of risk factors should be continuous along the time. This is an important property as in Equation 3 and Equation 5 we assume the risk factors are constant in the estimation period. To ensure the stability of the risk factors, we design a multi-task learning objective that requires the learned risk factors $\mathbf{F}_{\cdot t}$ can be used to explain the variance of stock returns in multiple forward periods:

(17)

\min_{\theta}\frac{1}{H}\sum_{h=1}^{H}\frac{\big{\|}\mathbf{y}_{\cdot,t+h}-\mathbf{F}_{\cdot t}(\mathbf{F}^{\intercal}_{\cdot t}\mathbf{F}_{\cdot t})^{-1}\mathbf{F}^{\intercal}_{\cdot t}\mathbf{y}_{\cdot t}\big{\|}_{2}^{2}}{\|\mathbf{y}_{\cdot,t+h}\|_{2}^{2}},

where $H$ is the number of forward periods.

Finally, we put together all the loss terms from the above discussion and propose the following learning objective:

(18)

\min_{\theta}\frac{1}{T}\sum_{t=1}^{T}\Big{[}\frac{1}{H}\sum_{h=1}^{H}\frac{\big{\|}\mathbf{y_{\cdot,t+h}}-\mathbf{F}_{\cdot t}(\mathbf{F}^{\intercal}_{\cdot t}\mathbf{F}_{\cdot t})^{-1}\mathbf{F}^{\intercal}_{\cdot t}\mathbf{y_{\cdot t}}\big{\|}_{2}^{2}}{\|\mathbf{y_{\cdot,t+h}}\|_{2}^{2}}+\lambda\mathrm{tr}\big{(}(\mathbf{F}^{\intercal}_{\cdot t}\mathbf{F}_{\cdot t})^{-1}\big{)}\Big{]}

where $\lambda$ is a hyper-parameter that controls the strength for the multi-collinearity regularization.

5. Experiments

5.1. Experiment Settings

5.1.1. Data

The proposed framework can accept all common structured data like price, financial data, or even text, as input features to train the risk model. In this work, we use $10$ common style factors in (Orr and Mashtaler, 2012) as the input features, including Size, Beta, Momentum, Residual Volatility, Non-linear Size, Book-to-Price, Liquidity, Earnings Yield, Growth and Leverage. All these style factors will also serve as risk factors in the compared fundamental risk model to make sure our method have no advantage of using external information in the experiments. In fact, this is a more challenging experiment setting as these risk factors are summarized from decades of research efforts and our model must be able to learn better risk factors to beat human experts. These factors are implemented for China stock market and the studied universe covers all listed stocks. We will follow the temporal order to split data into training (2009/04/30 $\sim$ 2014/12/31), validation (2015/01/01 $\sim$ 2016/12/31) and testing (2017/01/01 $\sim$ 2020/02/10).

5.1.2. Model Implementation

We implement the architecture designed in Section 4.1 with PyTorch. Both the two GRU modules in the designed network have $2$ hidden layers and $32$ hidden units. We let the number of learned risk factors equal to $K/2$ for both branches. The GAT module has the same hidden size as the input size, which is $10$ . We also add attention dropout in GAT with dropout probability set to $0.5$ . We pack all stocks in the same day as a single batch to ensure GAT can well capture stock relations and leverage gradient accumulation with the update frequency set to $64$ to ensure the model see as many diverse samples as possible. We use Adam (Kingma and Ba, 2014) as the optimizer with a fixed learning rate set as $0.0002$ . To further ensure the stability of the predictions, we also add exponential smoothing to the predictions with a smoothing factor set to $0.99$ . Unless otherwise specified, the regularization parameter $\lambda$ is set to $0.01$ . The number of tasks in Equation 18 is set to $20$ , i.e., we encourage the learned risk factors can be used in the next $20$ days. The model is trained on a single TITAN Xp GPU and it takes $6.5$ hours to finish the model training for learning $10$ risk factors.

5.2. Evaluation Protocol

5.2.1. Compared Methods

We compare the following methods for designing risk factors:

•

Fundamental Risk Model (FRM): FRM directly uses the $10$ style factors described in Section 5.1.1 as risk factors. FRM shows the best performance of the risk factors designed by human experts.
•

Statistical Risk Model (SRM): SRM takes the first $10$ components with the largest eigenvalues by applying PCA on stock returns in the last $252$ trading days. SRM shows the best performance of the traditional data-driven approach for learning latent risk factors.
•

Deep Risk Model (DRM): DRM is the proposed method in this work, which learns $10$ risk factors directly from the data described in Section 5.1.1.

As both SRM and DRM can learn any number of latent risk factors, we also add the results of SRM (2x) and DRM (2x) which will learn $20$ risk factors.

5.2.2. Covariance Estimation

After the acquisition of the risk factors with different risk models, we will combine the risk factors with the country factor and $29$ industry factors for cross-sectional regression with Equation 1. As the country factor introduces an exact collinearity into this equation, we follow (Menchero et al., 2011) to introduce an extra constraint to force the coefficients of industry factors to have zero sum. After obtaining the factor returns, we then follow Equation 5 to derive the covariance matrix. To further ensure the estimated covariance more responsive to the market, we estimate the correlation and variance with an exponential halflife of $240$ and $60$ days respectively and we also apply Volatility Regime Adjustment (Menchero et al., 2011) with an exponential halflife of $20$ days to alleviate the forecasting bias.

5.2.3. Evaluation Metrics

To measure the risk forecasting performances of different risk models, we consider the below metrics:

•

$R^{2}$ : $R^{2}$ measures the proportion of the explained variance of stock returns when using different sets of risk factors. The calculation of $R^{2}$ is defined as Equation 12 and we will report the average $R^{2}$ in the test periods. A good risk model should have higher $R^{2}$ .
•

GMV: We further show the performance of the estimated covariance matrix with different risk factors by constructing a Global Minimum Variance (GMV) portfolio and reporting its annualized volatility. Given a covariance matrix $\Sigma$ , the GMV portfolio can be solved as $\frac{\boldsymbol{\Sigma}^{-1}\mathrm{1}}{\mathrm{1}^{\intercal}\boldsymbol{\Sigma}^{-1}\mathrm{1}}$ where $\mathrm{1}$ is a vector of ones. A good risk model should ensure the GMV portfolio has lower out-of-sample volatility.
•

GMV+: This has the same objective as the GMV portfolio, except an extract no short-selling constraint is added. We use CVXPY (Diamond and Boyd, 2016) to solve the optimized portfolio.

To further show the characteristics of the learned risk factors, we will also report the following statistics:

•

Mean —t—: This is the average absolute t-statistic of the regression coefficients in Equation 1. A higher absolute t-statistic means the factor is more significant.
•

Pct —t—¿2: As we have multiple t-statistics across the whole studied periods, we also report the percent of days when the corresponding risk factor is statistical significant ( $|t|>2$ ).
•

VIF: This is the variance inflation factor (VIF) defined in Equation 14, which shows whether the current factor is collinear with the remaining factors. A lower VIF is always preferred.
•

Auto Corr.: Auto correlation measures the stability of the risk factor. We will report the auto correlation of each factor with 1 day’s lagging. A higher correlation is preferred.

5.3. Main Results

5.3.1. Risk Forecasting Performance

The risk forecasting metrics for the compared methods are summarized in Table 1. From the table, we have the following observations and conclusions:

•

DRM can improve $R^{2}$ by $0.6\%$ when generating the same number of risk factors compared to FRM, which shows the risk factors designed human experts still have room for improvement. Besides, with DRM’s ability to learn more risk factors, we can further improve $R^{2}$ by $1.9\%$ . We also present $R^{2}$ over the entire testing period in Figure 2, from which we can see DRM can consistently outperform the competitors.
•

DRM can also reduce the volatility risk of both the GMV portfolio and the GMV+ portfolio, by $1.0\%$ and $0.6\%$ respectively. As GVM portfolio is exactly determined by the covariance matrix, lower out-of-sample volatility means the estimated covariance is more accurate. This further demonstrated the learned risk factors can consistently improve covariance estimation.
•

The other data-driven approach for risk factor mining, SRM, can also improve $R^{2}$ but still underperforms our method. Besides, the GMV portfolio solved with SRM often performs badly. Therefore, the proposed solution is superior to SRM for mining latent risk factors.

Table 1. The performance of the compared risk models measured by explained variance of stock returns (

R^{2}

) and the volatility risk of two portfolios (GMV and GMV+).

\uparrow

means the higher the better while

\downarrow

means the opposite.

Risk Model	$R^{2}$ ( $\uparrow$ )	GMV ( $\downarrow$ )	GMV+ ( $\downarrow$ )
Market	-	18.4%	18.4%
FRM	29.8%	12.6%	12.9%
SRM	30.1%	18.6%	13.5%
SRM (x2)	30.7%	16.5%	16.2%
DRM	30.4%	12.3%	12.9%
DRM (x2)	31.7%	11.6%	12.3%

5.3.2. Factor Characteristics

We further present the statistical metrics of the learned risk factors in Table 2. We can see that the average absolute t-statistics for all factors are higher than $2$ and more than $50\%$ of the time more factors have t-statistics higher than $2$ , which shows the learned risk factors are indeed significant. Further, all our risk factors have VIF scores around $1.0$ , which shows there is little multicollinearity in our risk factors. Last, the auto correlation of the learned risk factors is high, which shows our method can also ensure the stability of the learned risk factors.

Table 2. A summary statistics of the learned risk factors in terms of significance (t-statistics), orthogonality (VIF) and stability (auto correlation).

Factor ID	Mean —t—	Pct —t—¿2	VIF	Auto Corr.
0	4.878	72.9%	0.984	0.995
1	3.559	65.9%	1.301	0.995
2	2.748	55.0%	1.180	0.997
3	2.173	45.9%	1.277	0.998
4	3.244	59.0%	1.309	0.997
5	2.831	56.0%	1.163	0.966
6	3.461	64.7%	1.007	0.994
7	2.697	56.8%	1.211	0.966
8	3.247	60.3%	1.293	0.990
9	2.699	55.6%	1.104	0.985

Figure 3 shows the cumulative factor returns, i.e., regression coefficients in Equation 1. We can see the regression coefficients may change from positive to negative at a specific time (e.g., $3$ ), which indicates the investment philosophy of the entire market may have changed. This indicates the risk factors from our model do capture certain trading patterns. However, unlike fundamental risk factors, the learned risk factors still lack interpretability. We will leave the explanation of the learned risk factor as future work.

5.4. Incremental Analysis

In this section, we want to answer the following research questions through incremental experiments:

•

RQ1: Is GAT indeed a necessary component in the proposed neural network architecture?
•

RQ2: What is the effect of the multi-collinearity regularization parameter $\lambda$ in Equation 18?
•

RQ3: Does the multi-task learning objective help stabilize the learned risk factors?

5.4.1. RQ1

We train the proposed network with and without the GAT module in parallel. To give a more convincing conclusion, we compare these two models with different number of risk factors as learning targets. The final $R^{2}$ of these models are summarized in Table 3. We can see that, introducing the GAT module can obtain higher $R^{2}$ in most cases. Besides, considering GAT is designed as a new transformation of the input data, we believe it will be necessary for mining more high quality risk factors if more information is introduced for risk factor mining.

Table 3.

R^{2}

of DRM with or without GAT.

# Factors	10	12	14	16	18	20
w/ GAT	30.4%	30.6%	30.8%	31.1%	31.4%	31.7%
w/o GAT	30.3%	30.5%	30.9%	31.1%	31.3%	31.5%

5.4.2. RQ2

We further compare training the network with different regularization parameter $\lambda$ . We report the VIF of the learned risk factors in Figure 4. It can be observed that as the regularization strength increases, the VIF gradually goes down. When setting $\lambda=0.001$ , the VIF can be higher than $5$ , which implies the existence of high collinearity among the learned risk factors.

5.4.3. RQ3

Last, we verify the necessity of the multi-task objective in Equation 17 by comparing with the single-task objective (set $H=1$ in Equation 17). Figure 5 shows the average auto correlation of all learned risk factors with either single-task objective or multi-task objective. We can see that without using the multi-task objective, the risk factor becomes unstable. Therefore, we should use the multi-task objective to optimize the model.

6. Conclusion

In order to improve the estimation of the covariance matrix of stock returns, we proposed a deep learning solution to effectively design risk factors. Fundamental risk model requires prodigious human professionals’ effort, meanwhile statistical risk model lacks non-linear capacity and is deficient in out-of-sample performance. Our Deep Risk Model (DRM) demonstrates impressive performance in forecasting risk with a more effective and adaptive estimate of covariance matrix. Our method also exhibits better factor characteristics in terms of high stability and low multi-collinearity. Ablation studies exemplify the necessity and superiority of our design in the architecture and learning objective.

References

(1)
Alexander (2001) Carol Alexander. 2001. Market models. A Guide to Financial Data Analysis 1 (2001).
Asano et al. (2020) Yuki M. Asano, Christian Rupprecht, and Andrea Vedaldi. 2020. Self-labelling via simultaneous clustering and representation learning. In International Conference on Learning Representations (ICLR).
Avellaneda and Lee (2010) Marco Avellaneda and Jeong-Hyun Lee. 2010. Statistical arbitrage in the US equities market. Quantitative Finance 10, 7 (2010), 761–782.
Balakrishnama and Ganapathiraju (1998) Suresh Balakrishnama and Aravind Ganapathiraju. 1998. Linear discriminant analysis-a brief tutorial. Institute for Signal and information Processing 18, 1998 (1998), 1–8.
Belsley et al. (2005) David A Belsley, Edwin Kuh, and Roy E Welsch. 2005. Regression diagnostics: Identifying influential data and sources of collinearity. Vol. 571. John Wiley & Sons.
Bickel and Levina (2008) Peter J Bickel and Elizaveta Levina. 2008. Covariance regularization by thresholding. The Annals of Statistics 36, 6 (2008), 2577–2604.
Blitz et al. (2011) David Blitz, Joop Huij, and Martin Martens. 2011. Residual momentum. Journal of Empirical Finance 18, 3 (2011), 506–521.
Carhart (1997) Mark M Carhart. 1997. On persistence in mutual fund performance. The Journal of finance 52, 1 (1997), 57–82.
Caron et al. (2018) Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV). 132–149.
Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
Diamond and Boyd (2016) Steven Diamond and Stephen Boyd. 2016. CVXPY: A Python-embedded modeling language for convex optimization. The Journal of Machine Learning Research 17, 1 (2016), 2909–2913.
Eugene and French (1992) Fama Eugene and Kenneth French. 1992. The cross-section of expected stock returns. Journal of Finance 47, 2 (1992), 427–465.
Fan et al. (2016) Jianqing Fan, Yuan Liao, and Han Liu. 2016. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal 19, 1 (2016), C1–C32.
Fan et al. (2011) Jianqing Fan, Yuan Liao, and Martina Mincheva. 2011. High dimensional covariance matrix estimation in approximate factor models. Annals of statistics 39, 6 (2011), 3320.
Fan et al. (2018) Jianqing Fan, Han Liu, and Weichen Wang. 2018. Large covariance estimation through elliptical factor models. Annals of statistics 46, 4 (2018), 1383.
Feng et al. (2019) Fuli Feng, Huimin Chen, Xiangnan He, Ji Ding, Maosong Sun, and Tat-Seng Chua. 2019. Enhancing Stock Movement Prediction with Adversarial Training. IJCAI (2019).
Feng et al. (2020) Guanhao Feng, Stefano Giglio, and Dacheng Xiu. 2020. Taming the factor zoo: A test of new factors. The Journal of Finance 75, 3 (2020), 1327–1370.
Harman (1976) Harry H Harman. 1976. Modern factor analysis. University of Chicago press.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Lam and Fan (2009) Clifford Lam and Jianqing Fan. 2009. Sparsistency and rates of convergence in large covariance matrix estimation. Annals of statistics 37, 6B (2009), 4254.
Ledoit and Wolf (2004) Olivier Ledoit and Michael Wolf. 2004. Honey, I shrunk the sample covariance matrix. The Journal of Portfolio Management 30, 4 (2004), 110–119.
Ledoit and Wolf (2012) Olivier Ledoit and Michael Wolf. 2012. Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics 40, 2 (2012), 1024–1060.
Lin et al. (2021) Hengxu Lin, Dong Zhou, Weiqing Liu, and Jiang Bian. 2021. Learning Multiple Stock Trading Patterns with Temporal Routing Adaptor and Optimal Transport. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD ’21). ACM.
Markowitz (1952) Harry Markowitz. 1952. Portfolio selection. The journal of finance 7, 1 (1952), 77–91.
Menchero et al. (2011) J. Menchero, D. J. Orr, and J. Wang. 2011. The Barra US Equity Model ( USE 4 ) Methodology Notes.
Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
Orr and Mashtaler (2012) DJ Orr and Igor Mashtaler. 2012. Supplementary Notes on the Barra China Equity Model (CNE5). (2012).
Rigollet and Tsybakov (2012) Philippe Rigollet and Alexandre Tsybakov. 2012. Estimation of covariance matrices under sparsity constraints. arXiv preprint arXiv:1205.1210 (2012).
Sharpe (1964) William F Sharpe. 1964. Capital asset prices: A theory of market equilibrium under conditions of risk. The journal of finance 19, 3 (1964), 425–442.
Sheikh (1996) Aamir Sheikh. 1996. BARRA’s risk models. Barra Research Insights (1996), 1–24.
Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104–3112.
Trefethen and Bau III (1997) Lloyd N Trefethen and David Bau III. 1997. Numerical linear algebra. Vol. 50. Siam.
Uhlmann (2018) Jeffrey Uhlmann. 2018. A generalized matrix inverse that is consistent with respect to diagonal transformations. SIAM J. Matrix Anal. Appl. 39, 2 (2018), 781–800.
Van Loan (1976) Charles F Van Loan. 1976. Generalizing the singular value decomposition. SIAM Journal on numerical Analysis 13, 1 (1976), 76–83.
Vasilescu and Terzopoulos (2005) M Alex O Vasilescu and Demetri Terzopoulos. 2005. Multilinear independent components analysis. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, 547–553.
Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
Wold et al. (1987) Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and intelligent laboratory systems 2, 1-3 (1987), 37–52.