This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Theoretically Motivated Data Augmentation and Regularization for Portfolio Construction

Liu Ziyin1, Kentaro Minami2, Kentaro Imajo2
1Department of Physics, The University of Tokyo
2Preferred Networks, Inc.
Abstract

The task we consider is portfolio construction in a speculative market, a fundamental problem in modern finance. While various empirical works now exist to explore deep learning in finance, the theory side is almost non-existent. In this work, we focus on developing a theoretical framework for understanding the use of data augmentation for deep-learning-based approaches to quantitative finance. The proposed theory clarifies the role and necessity of data augmentation for finance; moreover, our theory implies that a simple algorithm of injecting a random noise of strength |rt1|\sqrt{|r_{t-1}|} to the observed return rtr_{t} is better than not injecting any noise and a few other financially irrelevant data augmentation techniques.111This is the full-length version of our work published at the 3rd ACM International Conference on AI in Finance (ICAIF’22). See https://doi.org/10.1145/3533271.3561720 for the shorter published version.
The code is available at: https://github.com/pfnet-research/Finance_data_augmentation_ICAIF2022

1 Introduction

There is an increasing interest in applying machine learning methods to problems in the finance industry. This trend has been expected for almost forty years (Fama,, 1970), when well-documented and fine-grained (minute-level) data of stock market prices became available. In fact, the essence of modern finance is fast and accurate large-scale data analysis (Goodhart and O’Hara,, 1997), and it is hard to imagine that machine learning should not play an increasingly crucial role in this field. In contemporary research, the central theme in machine-learning based finance is to apply existing deep learning models to financial time-series prediction problems (Imajo et al.,, 2020; Buehler et al.,, 2019; Jay et al.,, 2020; Imaki et al.,, 2021; Jiang et al.,, 2017; Fons et al.,, 2020; Lim et al.,, 2019; Zhang et al.,, 2020), which have demonstrated the hypothesized usefulness of deep learning for the financial industry.

However, one major existing gap in this interdisciplinary field of deep-learning finance is the lack of a theory relevant to justify finance-oriented algorithm design. The goal of this work is to propose such a framework, where machine learning practices are analyzed in a traditional financial-economic utility theory setting. Our theory implies that a simple theoretically motivated data augmentation technique is better for the task portfolio construction than not injecting any noise at all or some naive noise injection methods that have no theoretical justification. To summarize, our main contributions are (1) to demonstrate how we can use utility theory to analyze practices of deep-learning-based finance, and (2) to theoretically study the role of data augmentation in the deep-learning-based portfolio construction problem. Organization: the next section discusses the main related works; Section 3 provides the requisite finance background for understanding this work; Section 4 presents our theoretical contributions, which is a framework for understanding machine-learning practices in the portfolio construction problem; Section 5 describes how to practically implement the theoretically motivated algorithm; section 6 validates the theory with experiments on toy and real data.

2 Related Works

Existing deep learning finance methods. In recent years, various empirical approaches to apply state-of-the-art deep learning methods to finance have been proposed (Imajo et al.,, 2020; Ito et al.,, 2020; Buehler et al.,, 2019; Jay et al.,, 2020; Imaki et al.,, 2021; Jiang et al.,, 2017; Fons et al.,, 2020). The interested readers are referred to (Ozbayoglu et al.,, 2020) for detailed descriptions of existing works. However, we notice that one crucial gap is the complete lack of theoretical analysis or motivation in this interdisciplinary field of AI-finance. This work makes one initial step to bridge this gap. One theme of this work is that finance-oriented prior knowledge and inductive bias is required to design the relevant algorithms. For example, Ziyin et al., (2020) shows that incorporating prior knowledge into architecture design is key to the success of neural networks and applied neural networks with periodic activation functions to the problem of financial index prediction. Imaki et al., (2021) shows how to incorporate no-transaction prior knowledge into network architecture design when transaction cost is incorporated.

Refer to caption
Figure 1: Performance (measured by the Sharpe ratio) of various algorithms on MSFT (Microsoft) from 2018-2020. Directly applying generic machine learning methods, such as weight decay, fails to improve the vanilla model. The proposed method show significant improvement.

In fact, most generic and popular machine learning techniques are proposed and have been tested for standard ML tasks such as image classification or language processing. Directly applying the ML methods that work for image tasks is unlikely to work well for financial tasks, where the nature of the data is different. See Figure 1, where we show the performance of a neural network directly trained to maximize wealth return on MSFT during 2019-2020. Using popular, generic deep learning techniques such as weight decay or dropout does not result in any improvement over the baseline. In contrast, our theoretically motivated method does. Combining the proposed method with weight decay has the potential to improve the performance a little further, but the improvement is much lesser than the improvement of using the proposed method over the baseline. This implies that a generic machine learning method is unlikely to capture well the inductive biases required to tackle a financial task. The present work proposes to fill this gap by showing how finance knowledge can be incorporated into algorithm design.

Data augmentation. Consider a training loss function of the additive form L=1Ni(xi,yi)L=\frac{1}{N}\sum_{i}\ell(x_{i},y_{i}) for NN pairs of training data points {(xi,yi)}i=1N\{(x_{i},y_{i})\}_{i=1}^{N}. Data augmentation amounts to defining an underlying data-dependent distribution and generating new data points stochastically from this underlying distribution. A general way to define data augmentation is to start with a datum-level training loss and transform it to an expectation over an augmentation distribution P(z|(xi,yi))P\left(z|(x_{i},y_{i})\right) (Dao et al.,, 2019), (xi,yi)𝔼(zi,gi)P(z,g|(xi,yi))[(zi,gi)]\ell(x_{i},y_{i})\to\mathbb{E}_{(z_{i},g_{i})\sim P(z,g|(x_{i},y_{i}))}[\ell(z_{i},g_{i})], and the total training loss function becomes

Laug=1Ni=1N𝔼(zi,gi)P(z,g|(xi,yi))[(zi,gi)].L_{\rm aug}=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{(z_{i},g_{i})\sim P(z,g|(x_{i},y_{i}))}[\ell(z_{i},g_{i})]. (1)

One common example of data augmentation is injecting isotropic gaussian noise to the input (Shorten and Khoshgoftaar,, 2019; Fons et al.,, 2020), which is equivalent to setting P(z,g|(xi,yi))δ(gyi)exp[(zxi)T(zxi)/(2σ2)]P(z,g|(x_{i},y_{i}))\sim\delta(g-y_{i})\exp\left[-(z-x_{i})^{\rm T}(z-x_{i})/(2\sigma^{2})\right] for some specified strength σ2\sigma^{2}. Despite the ubiquity of data augmentation in deep learning, existing works are often empirical in nature (Fons et al.,, 2020; Zhong et al.,, 2020; Shorten and Khoshgoftaar,, 2019; Antoniou et al.,, 2017). For a relevant example, Fons et al., (2020) empirically evaluates the effect of different types of data augmentation in a financial series prediction task. Dao et al., (2019) is one major recent theoretical work that tries to understand modern data augmentation theoretically; it shows that data augmentation is approximately learning in a special kernel. He et al., (2019) argues that data augmentation can be seen as an effective regularization. However, no theoretically motivated data augmentation method for finance exists yet. One major challenge and achievement of this work is to develop a theory that bridges the traditional finance theory and machine learning methods. In the next section, we introduce the portfolio theory.

3 Background: Markowitz Portfolio Theory

How to make optimal investment in a financial market is the central concern of portfolio theory. One unfamiliar with the portfolio theory may easily confuse the task the portfolio construction with wealth maximization trading or future price prediction. Before we introduce the portfolio theory, we first stress that the task of portfolio construction is not equivalent to wealth maximization or accurate price prediction. One can construct an optimal portfolio without predicting the price or maximizing the wealth increase.

Consider a market with an equity (a stock) and a fixed-interest rate bond (a government bond). We denote the price of the equity at time step tt as StS_{t}, and the price return is defined as rt=St+1StStr_{t}=\frac{S_{t+1}-S_{t}}{S_{t}}, which is a random variable with variance CtC_{t}, and the expected return gt:=𝔼[rt]g_{t}:=\mathbb{E}[r_{t}]. Our wealth at time step tt is Wt=Mt+ntStW_{t}=M_{t}+n_{t}S_{t}, where MtM_{t} is the amount of cash we hold, and nin_{i} the shares of stock we hold for the ii-th stock. As in the standard finance literature, we assume that the shares are infinitely divisible. Usually, a positive nn denotes holding (long) and a negative nn denotes borrowing (short). The wealth we hold initially is W0>0W_{0}>0, and we would like to invest our money on the equity. We denote the relative value of the stock we hold as πt=ntStWt\pi_{t}=\frac{n_{t}S_{t}}{W_{t}}. π\pi is called a portfolio. The central challenge in portfolio theory is to find the best π\pi. At time tt, our wealth is WtW_{t}; after one time step, our wealth changes due to a change in the price of the stock (setting the interest rate to be 0): ΔWt:=Wt+1Wt=Wtπtrt\Delta W_{t}:=W_{t+1}-W_{t}=W_{t}\pi_{t}r_{t}. The goal is to maximize the wealth return Gt:=πtrtG_{t}:=\pi_{t}\cdot r_{t} at every time step while minimizing risk222It is important to not to confuse the price return rtr_{t} with the wealth return GtG_{t}.. The risk is defined as the variance of the wealth change:

Rt:=R(πt):=Varrt[Gt]=(𝔼[rt2]gt2)πt2=πt2Ct.R_{t}:=R(\pi_{t}):=\text{Var}_{r_{t}}[G_{t}]=\left(\mathbb{E}[r_{t}^{2}]-g_{t}^{2}\right)\pi_{t}^{2}=\pi_{t}^{2}C_{t}. (2)

The standard way to control risk is to introduce a “risk regularizer” that punishes the portfolios with a large risk (Markowitz,, 1959; Rubinstein,, 2002).333In principle, any concave function in GtG_{t} can be a risk regularizer from classical economic theory (Von Neumann and Morgenstern,, 1947). One common alternative would be R(G)=log(G)R(G)=\log(G) (Kelly Jr,, 2011), and our framework can be easily extended to such cases. Introducing a parameter λ\lambda for the strength of regularization (the factor of 1/21/2 appears for convention), we can now write down our objective:

πt=argmaxπU(π):=argmaxπ[πTGtλ2R(π)].\pi_{t}^{*}=\arg\max_{\pi}U(\pi):=\arg\max_{\pi}\left[\pi^{\rm T}G_{t}-\frac{\lambda}{2}R(\pi)\right]. (3)

Here, UU stands for the utility function; λ\lambda can be set to be the desired level of risk-aversion. When gtg_{t} and CtC_{t} is known, this problem can be explicitly solved. However, one main problem in finance is that its data is highly limited and we only observe one particular realized data trajectory, and gtg_{t} and CtC_{t} are hard to estimate. This fact motivates for the necessity of data augmentation and synthetic data generation in finance (Assefa,, 2020). In this paper, we treat the case where there is only one asset to trade in the market, and the task of utility maximization amounts to finding the best balance between cash-holding and investment. The equity we are treating is allowed to be a weighted combination of multiple stocks (a portfolio of some public fund manager, for example), and so our formalism is not limited to single-stock situations. In section C.1, we discuss portfolio theory with multiple stocks.

4 Portfolio Construction as a Training Objective

Recent advances have shown that the financial objectives can be interpreted as training losses for an appropriately inserted neural-network model (Ziyin et al.,, 2019; Buehler et al.,, 2019). It should come as no surprise that the utility function (3) can be interpreted as a loss function. When the goal is portfolio construction, we parametrize the portfolio πt=π𝐰(xt)\pi_{t}=\pi_{\mathbf{w}}(x_{t}) by a neural network with weights 𝐰\mathbf{w}, and the utility maximization problem becomes a maximization problem over the weights of the neural network. The time-dependence is modeled through the input to the network xtx_{t}, which possibly consists of the available information at time tt for determining the future price.444It is helpful to imagine xtx_{t} as, for example, the prices of the stocks in the past 1010 days. The objective function (to be maximized) plus a pre-specified data augmentation transform xtztx_{t}\to z_{t} with underlying distribution p(z|xt)p(z|x_{t}) is then

πt=argmax𝐰{1Tt=1T𝔼t[Gt(π𝐰(zt))]λVart[Gt(π𝐰(zt))]},\pi^{*}_{t}=\arg\max_{\mathbf{w}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{t}\left[G_{t}(\pi_{\mathbf{w}}(z_{t}))\right]-\lambda\text{Var}_{t}[G_{t}(\pi_{\mathbf{w}}(z_{t}))]\right\}, (4)

where 𝔼t:=𝔼ztp(z|xt)\mathbb{E}_{t}:=\mathbb{E}_{z_{t}\sim p(z|x_{t})}. In this work, we abstract away the details of the neural network to approximate π\pi. We instead focus on studying the maximizers of this equation, which is a suitable choice when the underlying model is a neural network because one primary motivation for using neural networks in finance is that they are universal approximators and are often expected to find such maximizers (Buehler et al.,, 2019; Imaki et al.,, 2021).

The ultimate financial goal is to construct π\pi^{*} such that the utility function is maximized with respect to the true underlying distribution of StS_{t}, which can be used as the generalization loss (to be maximized):

πt=argmaxπt{𝔼St[Gt(π)]λVarSt[Gt(π)]}.\pi^{*}_{t}=\arg\max_{\pi_{t}}\left\{\mathbb{E}_{S_{t}}\left[G_{t}(\pi)\right]-\lambda\text{Var}_{S_{t}}[G_{t}(\pi)]\right\}. (5)

Note the difference in taking the expectation between Eq (4) and (5) is that 𝔼t\mathbb{E}_{t} is computed with respect to the training set we hold, while 𝔼St:=𝔼Stp(St)\mathbb{E}_{S_{t}}:=\mathbb{E}_{S_{t}\sim p(S_{t})} is computed with respect to the underlying distribution of StS_{t} given its previous prices. We used the same short-hands for Vart\text{Var}_{t} and VarSt\text{Var}_{S_{t}}. Technically, the true utility we defined is an in-sample counterfactual objective, which roughly evaluates the expected utility to be obtained if we restart from yesterday, which is a relevant measure for financial decision making. In Section 4.5, we also analyze the out-of-sample performance when the portfolio is static.

4.1 Standard Models of Stock Prices

The expectations in the true objective Equation (5) need to be taken with respect to the true underlying distribution of the stock price generation process. In general, the price follows the following stochastic process ΔSt=f({Si}i=1t)+g({Si}i=1t)ηt\Delta S_{t}=f(\{S_{i}\}_{i=1}^{t})+g(\{S_{i}\}_{i=1}^{t})\eta_{t} for a zero-mean and unit variance random noise ηt\eta_{t}; the term ff reflects the short-term predictability of the stock price based on past prices, and gg reflects the extent of unpredictability in the price. A key observation in finance is that gg is non-stationary (heteroskedastic) and price-dependent (multiplicative). One model is the geometric Brownian motion (GBM)

St+1=(1+r)St+σtStηt,S_{t+1}=(1+r)S_{t}+\sigma_{t}S_{t}\eta_{t}, (6)

which is taken as the minimal standard model of the motion of stock prices (Mandelbrot,, 1997; Black and Scholes,, 1973); this paper also assumes the GBM model as the underlying model. Here, we note that the theoretical problem we consider can be seen as a discrete-time version of the classical Merton’s portfolio problem (Merton,, 1969). The more flexible Heston model (Heston,, 1993) takes the form dSt=rStdt+νtStdWtdS_{t}=rS_{t}dt+\sqrt{\nu_{t}}S_{t}dW_{t}, where νt\nu_{t} is the instantaneous volatility that follows its own random walk, and dWtdW_{t} is drawn from a Gaussian distribution. Despite the simplicity of these models, the statistical properties of these models agree well with the known statistical properties of the real financial markets (Drǎgulescu and Yakovenko,, 2002). The readers are referred to (Karatzas et al.,, 1998) for a detailed discussion about the meaning and financial significance of these models.

4.2 No Data Augmentation

In practice, there is no way to observe more than one data point for a given stock at a given time tt. This means that it can be very risky to directly train on the raw observed data since nothing prevents the model from overfitting to the data. Without additional assumptions, the risk is zero because there is no randomness in the training set conditioning on the time tt. To control this risk, we thus need data augmentation. One can formalize this intuition through the following proposition, whose proof is given in Section C.3.

Proposition 1.

(Utility of no-data-augmentation strategy.) Let the price trajectory be generated with GBM in Eq. (6) with initial price S0S_{0}, then the true utility for the no-data-augmentation strategy is

Unoaug=[12Φ(r/σ)]rλ2σ2U_{\rm no-aug}=[1-2\Phi({-r}/{\sigma})]r-\frac{\lambda}{2}\sigma^{2} (7)

where U(π)U(\pi) is the utility function defined in Eq. (3); Φ\Phi is the c.d.f. of a standard normal distribution.

This means that, the larger the volatility σ\sigma, the smaller is the utility of the no-data-augmentation strategy. This is because the model may easily overfit to the data when no data augmentation is used. In the next section, we discuss the case when a simple data augmentation is used.

4.3 Additive Gaussian Noise

While it is still far from clear how the stock price is correlated with the past prices, it is now well-recognized that VarSt[St|St1]0\text{Var}_{S_{t}}[S_{t}|S_{t-1}]\neq 0 (Mandelbrot,, 1997; Cont,, 2001). This motivates a simple data augmentation technique to add some randomness to the financial sequence we observe, {S1,,ST+1}\{S_{1},...,S_{T+1}\}. This section analyzes a vanilla version of data augmentation of injecting simple Gaussian noise, compared to a more sophisticated data augmentation method in the next section. Here, we inject random Gaussian noises ϵt𝒩(0,ρ2)\epsilon_{t}\sim\mathcal{N}(0,\rho^{2}) to StS_{t} during the training process such that zt=St+ϵz_{t}=S_{t}+\epsilon. Note that the noisified return needs to be carefully defined since noise might also appear in the denominator, which may cause divergence; to avoid this problem, we define the noisified return to be r~t:=zt+1ztSt\tilde{r}_{t}:=\frac{z_{t+1}-z_{t}}{S_{t}}, i.e., we do not add noise to the denominator. Theoretically, we can find the optimal strength ρ\rho^{*} of the gaussian data augmentation to be such that the true utility function is maximized for a fixed training set. The result can be shown to be

(ρ)2=σ22rt(rtSt2)2trtSt2.(\rho^{*})^{2}=\frac{\sigma^{2}}{2r}\frac{\sum_{t}(r_{t}S_{t}^{2})^{2}}{\sum_{t}r_{t}S_{t}^{2}}. (8)

The fact the ρ\rho^{*} depends on the prices of the whole trajectory reflects the fact that time-independent data augmentation is not suitable for a stock price dynamics prescribed by Eq. (6), whose inherent noise σStηt\sigma S_{t}\eta_{t} is time-dependent through the dependence on StS_{t}. Finally, we can plug in the optimal ρ\rho^{*} to obtain the optimal achievable strategy for the additive Gaussian noise augmentation. As before, the above discussion can be formalized, with the true utility given in the next proposition (proof in Section C.4).

Proposition 2.

(Utility of additive Gaussian noise strategy.) Under additive Gaussian noise strategy, and let other conditions the same as in Proposition 1, the true utility is

UAdd=r22λσ2T𝔼St[(trtSt)2t(rtSt)2Θ(trtSt2)],U_{\rm Add}=\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\frac{(\sum_{t}r_{t}S_{t})^{2}}{\sum_{t}(r_{t}S_{t})^{2}}\Theta\left(\sum_{t}r_{t}S_{t}^{2}\right)\right], (9)

where Θ\Theta is the Heaviside step function.

4.4 Multiplicative Gaussian Noise

In this section, we derive a general kind of data augmentation for the price trajectories specified by the GBM and the Heston model. From the previous discussions, one might expect that a better kind of augmentation should have ρ=ρ0St\rho=\rho_{0}S_{t}, i.e., the injected noise should be multiplicative; however, we do not start from imposing ρρSt\rho\to\rho S_{t}; instead, we consider ρρt\rho\to\rho_{t}, i.e., a general time-dependent noise. In the derivation, one can find an interesting relation for the optimal augmentation strength:

(ρt+1)2+(ρt)2=σ22rrtSt2.(\rho_{t+1}^{*})^{2}+(\rho_{t}^{*})^{2}=\frac{\sigma^{2}}{2r}r_{t}S_{t}^{2}. (10)

The following proposition gives the true utility of using this data augmentation (derivations in Section C.5).

Proposition 3.

(Utility of general multiplicative Gaussian noise strategy.) Under general multiplicative noise augmentation strategy, and let other conditions the same as in Proposition 1, then the true utility is

Umult=r22λσ2[1Φ(r/σ)].U_{\rm mult}=\frac{r^{2}}{2\lambda\sigma^{2}}[1-\Phi(-r/\sigma)]. (11)

Combining the above propositions, we can prove the main theorem of this work ((Proof in Section C.6)), which shows that the mean-variance utility of the proposed method is strictly higher than that of no data-augmention and that of additive Gaussian noise.

Theorem 1.

If σ0\sigma\neq 0, then Umult>UaddU_{\rm mult}>U_{\rm add} and Umult>UnoaugU_{\rm mult}>U_{\rm no-aug} with probability 11.

Heston Model and Real Price Augmentation. We also consider the more general Heston model. The derivation proceeds similarly by replacing σ2νt2\sigma^{2}\to\nu_{t}^{2}; one arrives at the relation for optimal augmentation: (ρt+1)2+(ρt)2=12rνt2rtSt2(\rho_{t+1}^{*})^{2}+(\rho_{t}^{*})^{2}=\frac{1}{2r}\nu^{2}_{t}r_{t}S_{t}^{2}. One quantity we do not know is the volatility νt\nu_{t}, which has to be estimated by averaging over the neighboring price returns. One central message from the above results is that one should add noises with variance proportional to rtSt2r_{t}S_{t}^{2} to the observed prices for augmenting the training set.

4.5 Stationary Portfolio

In the previous sections, we have discussed the case when the portfolio is dynamic (time-dependent). One slight limitation of the previous theory is that one can only compare the in-sample counterfactual performance of a dynamic portfolio. Here, we alternatively motivate the proposed data augmentation technique when the model is a stationary portfolio. One can show that, for a stationary portfolio, the proposed data augmentation technique gives the overall optimal performance.

Theorem 2.

Under the multiplicative data augmentation strategy, the in-sample counterfactual utility and the out-of-sample utility is optimal among all stationary portfolios.

Remark.

See Section C.8 for a detailed discussion and the proof. Stationary portfolios are important in financial theory and can be shown to be optimal even among all dynamic portfolios in some situations (Cover and Thomas,, 2006; Merton,, 1969). While restricting to stationary portfolios allows us to also compare on out-of-sample performance, the limitation is that a stationary portfolio is less relevant for a deep learning model than the dynamical portfolios considered in the previous sections.

4.6 General Framework

So far, we have been analyzing the data augmentation for specific examples of the utility function and the data augmentation distribution to argue that certain types of data augmentation is preferable. Now we outline how this formulation can be generalized to deal with a wider range of problems, such as different utility functions and different data augmentations. This general framework can be used to derive alternative data augmentations schemes if one wants to maximize other financial metrics other than the Sharpe ratio, such as the Sortino ratio (Estrada,, 2006), or to incorporate regularization effect that into account of the heavy tails of the prices distribution.

For a general utility function U=U(x,π)U=U(x,\pi) for some data point xx that describes the current state of the market, and π\pi that describes our strategy in this market state, we would like to ultimately maximize

maxπV(π),for V(π)=𝔼x[U(x,π)]\max_{\pi}V(\pi),\quad\text{for }V(\pi)=\mathbb{E}_{x}[U(x,\pi)] (12)

However, only observing finitely many data points, we can only optimize the empirical loss with respect to some θ\theta-parametrized augmentation distribution PθP_{\theta}:

π^(θ)=argmaxπ1NiN𝔼zipθ(z|xi)[U(zi,πi)].\hat{\pi}(\theta)=\arg\max_{\pi}\frac{1}{N}\sum_{i}^{N}\mathbb{E}_{z_{i}\sim p_{\theta}(z|x_{i})}[U(z_{i},\pi_{i})]. (13)

The problem we would like to solve is to find the effect of using such data augmentation on the true utility VV, and then, if possible, compare different data augmentations and identify the better one. Surprisingly, this is achievable since V=V(π^(θ))V=V(\hat{\pi}(\theta)) is now also dependent on the parameter θ\theta of the data augmentation. Note that the true utility has to be found with respect to both the sampling over the test points and the sampling over the NN-sized training set:

V(π^(θ))=𝔼xp(x)𝔼{xi}NpN(x)[U(x,π^(θ))]V(\hat{\pi}(\theta))=\mathbb{E}_{x\sim p(x)}\mathbb{E}_{\{x_{i}\}^{N}\sim p^{N}(x)}[U(x,\hat{\pi}(\theta))] (14)

In principle, this allows one to identify the best data augmentation for the problem at hand:

θ=argmaxθV(π^(θ))argmaxθ𝔼xp(x)𝔼{xi}NpN(x)\displaystyle\theta^{*}=\arg\max_{\theta}V(\hat{\pi}(\theta))\arg\max_{\theta}\mathbb{E}_{x\sim p(x)}\mathbb{E}_{\{x_{i}\}^{N}\sim p^{N}(x)}
[U(x,argmaxπ1NiN𝔼zipθ(z|xi)[U(zi,πi)])],\displaystyle\left[U\left(x,\arg\max_{\pi}\frac{1}{N}\sum_{i}^{N}\mathbb{E}_{z_{i}\sim p_{\theta}(z|x_{i})}[U(z_{i},\pi_{i})]\right)\right], (15)

and the analysis we performed in the previous sections is simply a special case of obtaining solutions to this maximization problem. Moreover, one can also compare two different parametric augmentation distributions; let their parameter be denoted as θα\theta_{\alpha} and θβ\theta_{\beta} respectively, then we can say that data augmentation α\alpha is better than β\beta if and only if maxθαV(π^(θα))>maxθβV(π^(θβ)).\max_{\theta_{\alpha}}V(\hat{\pi}(\theta_{\alpha}))>\max_{\theta_{\beta}}V(\hat{\pi}(\theta_{\beta})). This general formulation can also have applicability outside the field of finance because one can interpret the utility UU as a standard machine learning loss function and π\pi as the model output. This procedure also mimics the procedure of finding a Bayes estimator in the statistical decision theory (Wasserman,, 2013), with θ\theta being the estimator we want to find; we outline an alternative general formulation to find the “minimax” augmentation in Section C.2.

Refer to caption
Refer to caption
Figure 2: Experiment on geometric brownian motion; S0=1S_{0}=1, r=0.005r=0.005, σ=0.04\sigma=0.04. Left: Examples of prices trajectories in green; the black line shows the expected value of the price. Right: Comparison with other related data augmentation techniques. The black dashed line shows the optimal achievable Sharpe ratio. We see that the proposed method stay close to the optimality across a 600-step trading period as the theory predicts.

5 Algorithms

Our results strongly motivate for a specially designed data augmentation for financial data. For a data point consisting purely of past prices (St,,St+L,St+L+1)(S_{t},...,S_{t+L},S_{t+L+1}) and the associated returns (rt,,rt+L1,rt+L)(r_{t},...,r_{t+L-1},r_{t+L}), we use x=(St,,St+L)x=(S_{t},...,S_{t+L}) as the input for our model ff, possibly a neural network, and use St+L+1S_{t+L+1} as the unseen future price for computing the training loss. Our results suggests that we should randomly noisify both the input xx and St+L+1S_{t+L+1} at every training step by

{SiSi+cσ^i2|ri|Si2ϵifor Six;St+L+1St+L+1+cσ^i2|rt+L|St+L2ϵt+L+1;\begin{cases}S_{i}\to S_{i}+c\sqrt{\hat{\sigma}^{2}_{i}|r_{i}|S_{i}^{2}}\epsilon_{i}\quad\quad\text{for }S_{i}\in x;\\ S_{t+L+1}\to S_{t+L+1}+c\sqrt{\hat{\sigma}^{2}_{i}|r_{t+L}|S_{t+L}^{2}}\epsilon_{t+L+1};\end{cases} (16)

where ϵi\epsilon_{i} are i.i.d. samples from 𝒩(0,1)\mathcal{N}(0,1), and cc is a hyperparameter to be tuned. While the theory suggests that cc should be 1/21/2, it is better to make it a tunable-parameter in algorithm design for better flexibility; σ^t\hat{\sigma}_{t} is the instantaneous volatility, which can be estimated using standard methods in finance (Degiannakis and Floros,, 2015). One might also assume σ^\hat{\sigma} into cc.

5.1 Using return as inputs

Practically and theoretically, it is better and standard to use the returns x=(rt,,rt+L1,rt+L)x=(r_{t},...,r_{t+L-1},r_{t+L}) as the input, and the algorithm can be applied in a simpler form:

{riri+cσ^i2|ri|ϵifor rix;rt+Lrt+L+cσ^i2|rt+L|ϵt+L+1.\begin{cases}r_{i}\to r_{i}+c\sqrt{\hat{\sigma}^{2}_{i}|r_{i}|}\epsilon_{i}&\text{for }r_{i}\in x;\\ r_{t+L}\to r_{t+L}+c\sqrt{\hat{\sigma}^{2}_{i}|r_{t+L}|}\epsilon_{t+L+1}.\end{cases} (17)

5.2 Equivalent Regularization on the output

One additional simplification can be made by noticing the effect of injecting noise to rt+Lr_{t+L} on the training loss is equivalent to a regularization. We show in Section B that, under the GBM model, the training objective can be written as

argmaxbt{1Tt=1T𝔼z[Gt(π)]λc2σ^t2|rt|πt2},\arg\max_{b_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{z}\left[G_{t}(\pi)\right]-\lambda c^{2}\hat{\sigma}_{t}^{2}|r_{t}|\pi_{t}^{2}\right\}, (18)

where the expectation over xx is now only taken with respect to the input. This means that the noise injection on the rt+Lr_{t+L} is equivalent to adding a L2L_{2} regularization on the model output πt\pi_{t}. This completes the main proposed algorithm of this work. We discuss a few potential variants in Section B. Also, it is well known that the magnitude of |rt||r_{t}| has strong time-correlation (i.e., a large |rt||r_{t}| suggests a large |rt+1||r_{t+1}|) (Lux and Marchesi,, 2000; Cont and Bouchaud,, 1997; Cont,, 2007), and this suggests that one can also use the average of the neighboring returns to smooth the |rt||r_{t}| factor in the last term for some time-window of width τ\tau: |rt||r^t|=1τ0τ|rtτ||r_{t}|\to|\hat{r}_{t}|=\frac{1}{\tau}\sum_{0}^{\tau}|r_{t-\tau}|. In our S&P500 experiments, we use this smoothing technique with τ=20\tau=20.

Table 1: Sharpe ratio on S&P 500 by sectors; the larger the better. Best performances in Bold.
Industry Sectors # Stock Merton no aug. weight decay additive aug. naive mult. proposed
Communication Services 9 0.06±0.04-0.06_{\pm 0.04} 0.06±0.04-0.06_{\pm 0.04} 0.06±0.27-0.06_{\pm 0.27} 0.22±0.18\mathbf{0.22_{\pm 0.18}} 0.20±0.21\mathbf{0.20_{\pm 0.21}} 0.33±0.16\mathbf{0.33_{\pm 0.16}}
Consumer Discretionary 39 0.01±0.03-0.01_{\pm 0.03} 0.07±0.03-0.07_{\pm 0.03} 0.06±0.10-0.06_{\pm 0.10} 0.48±0.100.48_{\pm 0.10} 0.41±0.090.41_{\pm 0.09} 0.64±0.08\mathbf{0.64_{\pm 0.08}}
Consumer Staples 27 0.05±0.030.05_{\pm 0.03} 0.24±0.030.24_{\pm 0.03} 0.23±0.110.23_{\pm 0.11} 0.36±0.08\mathbf{0.36_{\pm 0.08}} 0.34±0.09\mathbf{0.34_{\pm 0.09}} 0.35±0.07\mathbf{0.35_{\pm 0.07}}
Energy 17 0.07±0.030.07_{\pm 0.03} 0.03±0.030.03_{\pm 0.03} 0.02±0.12-0.02_{\pm 0.12} 0.70±0.090.70_{\pm 0.09} 0.52±0.100.52_{\pm 0.10} 0.91±0.10\mathbf{0.91_{\pm 0.10}}
Financials 46 0.57±0.04-0.57_{\pm 0.04} 0.61±0.03-0.61_{\pm 0.03} 0.61±0.09-0.61_{\pm 0.09} 0.06±0.10-0.06_{\pm 0.10} 0.13±0.09-0.13_{\pm 0.09} 0.18±0.08\mathbf{0.18_{\pm 0.08}}
Health Care 44 0.23±0.040.23_{\pm 0.04} 0.60±0.040.60_{\pm 0.04} 0.61±0.110.61_{\pm 0.11} 0.86±0.09\mathbf{0.86_{\pm 0.09}} 0.81±0.09\mathbf{0.81_{\pm 0.09}} 0.83±0.07\mathbf{0.83_{\pm 0.07}}
Industrials 44 0.09±0.03-0.09_{\pm 0.03} 0.11±0.03-0.11_{\pm 0.03} 0.11±0.08-0.11_{\pm 0.08} 0.36±0.080.36_{\pm 0.08} 0.28±0.080.28_{\pm 0.08} 0.48±0.08\mathbf{0.48_{\pm 0.08}}
Information Technology 41 0.41±0.040.41_{\pm 0.04} 0.41±0.040.41_{\pm 0.04} 0.41±0.110.41_{\pm 0.11} 0.67±0.100.67_{\pm 0.10} 0.74±0.11\mathbf{0.74_{\pm 0.11}} 0.79±0.09\mathbf{0.79_{\pm 0.09}}
Materials 19 0.07±0.030.07_{\pm 0.03} 0.06±0.030.06_{\pm 0.03} 0.03±0.140.03_{\pm 0.14} 0.47±0.13\mathbf{0.47_{\pm 0.13}} 0.43±0.13\mathbf{0.43_{\pm 0.13}} 0.53±0.10\mathbf{0.53_{\pm 0.10}}
Real Estate 22 0.14±0.04-0.14_{\pm 0.04} 0.39±0.03-0.39_{\pm 0.03} 0.40±0.12-0.40_{\pm 0.12} 0.05±0.100.05_{\pm 0.10} 0.05±0.090.05_{\pm 0.09} 0.19±0.07\mathbf{0.19_{\pm 0.07}}
Utilities 24 0.29±0.02-0.29_{\pm 0.02} 0.29±0.02-0.29_{\pm 0.02} 0.28±0.07-0.28_{\pm 0.07} 0.01±0.06-0.01_{\pm 0.06} 0.00±0.06-0.00_{\pm 0.06} 0.15±0.04\mathbf{0.15_{\pm 0.04}}
S&P500 Avg. 365 0.02±0.04-0.02_{\pm 0.04} 0.00±0.04-0.00_{\pm 0.04} 0.01±0.04-0.01_{\pm 0.04} 0.39±0.030.39_{\pm 0.03} 0.35±0.030.35_{\pm 0.03} 0.51±0.03\mathbf{0.51_{\pm 0.03}}

6 Experiments

We validate our theoretical claim that using multiplicative noise with strength r\sqrt{r} is better than not using any data augmentation or using a data augmentation that is not suitable for the nature of portfolio construction (such as an additive Gaussian noise). We emphasize that the purpose of this section is for demonstrating the relevance of our theory to real financial problems, not for establishing the proposed method as a strong competitive method in the industry. We start with a toy dataset that follows the theoretical assumptions and then move on to real data with S&P500 prices. The detailed experimental settings are given in Section A. Unless otherwise specified, we use a feedforward neural network with the number of neurons 106464110\to 64\to 64\to 1 with ReLU activations. Training proceeds with the Adam optimizer with a minibatch size of 6464 for 100 epochs with the default parameter settings.555In our initial experiments, we also experimented with different architectures (different depth or width of the FNN, RNN, LSTM), and our conclusion that the proposed augmentation outperforms the specified baselines remain unchanged.

We use the Sharpe ratio as the performance metric (the larger the better). Sharpe ratio is defined as SRt=𝔼[ΔWt]Var[ΔWt]SR_{t}=\frac{\mathbb{E}[\Delta W_{t}]}{\sqrt{\text{Var}[\Delta W_{t}]}}, which is a measure of the profitability per risk. We choose this metric because, in the framework of portfolio theory, it is the only theoretically motivated metric of success (Sharpe,, 1966). In particular, our theory is based on the maximization of the mean-variance utility in Eq. (3) and it is well-known that the maximization of the mean-variance utility is equivalent to the maximization of the Sharpe ratio. In fact, it is a classical result in classical financial research that all optimal strategies must have the same Sharpe ratio (Sharpe,, 1964) (also called the efficient capital frontier). For the synthetic tasks, we can generate arbitrarily many test points to compare the Sharpe ratios unambiguously. We then move to experiments on real stock price series; the limitation is that the Sharpe ratio needs to be estimated and involves one additional source of uncertainty.666We caution the readers to not to confuse the problem of portfolio construction with the problem of financial price prediction. Portfolio construction is the primary focus of our work and is fundamentally different from the problem of financial price prediction. Our method is not relevant and cannot be used directly for predicting future prices. As in real life, one does not need to be able to predict prices to decide which stock to purchase.

6.1 Geometric Brownian Motion

We first start from experimenting with stock prices generated with a GBM, as specified in Eq. (6), and we generate a fixed price trajectory with length T=400T=400 for training; each training point consists of a sequence of past prices (St,,St+9,St+10)(S_{t},...,S_{t+9},S_{t+10}) where the first ten prices are used as the input to the model, and St+10S_{t+10} is used for computing the loss.

Results and discussion. See Figure 2. The proposed method is plotted in blue. The right figure compares the proposed method with the other two baseline data augmentations we studied in this work. As the theory shows, the proposed method is optimal for this problem, achieving the optimal Sharpe ratio across a 600-step trading period. This directly confirms our theory.

6.2 S&P 500 Prices

This section demonstrates the relevance of the proposed algorithm to real market data. In particular, We use the data from S&P500 from 2016 to 2020, with 10001000 days in total. We test on the 365365 stocks that existed on S&P500 from 20002000 to 20202020. We use the first 800800 days as the training set and the last 200200 days for testing. The model and training setting is similar to the previous experiment. We treat each stock as a single dataset and compare on all of the 365365 stocks (namely, the evaluation is performed independently on 365365 different datasets). Because the full result is too long, We report the average Sharpe ratio per industrial sector (categorized according to GISC) and the average Sharpe ratio of all 365 datasets. See Section A.1 and A.4 for more detail.

Results and discussion. See Table 1. We see that, without data augmentation, the model works poorly due to its incapability of assessing the underlying risk. We also notice that weight decay does not improve the performance (if it is not deteriorating the performance). We hypothesize that this is because weight decay does not correctly capture the inductive bias that is required to deal with a financial series prediction task. Using any kind of data augmentation seems to improve upon not using data augmentation. Among these, the proposed method works the best, possibly due to its better capability of risk control. In this experiment, we did not allow for short selling; when short selling is allowed, the proposed method also works the best; see Section A.4. In Section A.5.1, we also perform a case study to demonstrate the capability of the learned portfolio to avoid a market crash in 2020. We also compare with the Merton’s portfolio (Merton,, 1969), which is the classical optimal stationary portfolio constructed from the training data; this method does not perform well either. This is because the market during the time 201920202019-2020 is volatile and quite different from the previous years, and a stationary portfolio cannot capture the nuances in the change of the market condition. This shows that it is also important to leverage the flexibility and generalization property of the modern neural networks, along side the financial prior knowledge.

Refer to caption
Figure 3: Available portfolios and the market capital line (MCL). The black dots are the return-risk combinations of the original stocks; the orange dots are the learned portfolios. The MCL of the proposed method is lower than that of the original stocks, suggesting improved return and lower risk.

6.3 Comparison with Data Generation Method

One common alternative to direct data augmentation in the field is to generate additional realistic synthetic data using a GAN. While it is not the purpose of this work to propose an industrial level method, nor do we claim that the proposed method outperforms previous methods, we provide one experimental comparison in Section A.5 for the task of portfolio construction. We compare our theoretically motivated technique with QuantGAN (Wiese et al.,, 2020), a major and recent technique in the field of financial data augmentation/generation. The experiment setting is the same as the S&P500 experiment. The result shows that directly applying QuantGAN to the portfolio construction problem in our setting does not significantly improve over the baseline without any augmentation and achieves a much lower Sharpe ratio than our suggested method. This underperformance is possibly because QuantGAN is not designed for Sharpe ratio maximization.

6.4 Market Capital Lines

In this section, we link the result we obtained in the previous section with the concept of market capital line (MCL) in the capital asset pricing model (Sharpe,, 1964), a foundational theory in classical finance. The MCL of a set of portfolios denotes the line of the best return-risk combinations when these portfolios are combined with a risk-free asset such as the government bond; an MCL with smaller slope means better return and lower risk and is considered to be better than an MCL that is to the upper left in the return-risk plane. See Figure 3. The risk-free rate r0r_{0} is set to be 0.010.01, roughly equal to the average 1-year treasury yield from 2018 to 2020. We see that the learned portfolios achieves a better MCL than the original stocks. The slope of the SP500 MCL is roughly 0.530.53, while that of the proposed method is 0.350.35, i.e., much better return-risk combinations can be achieved using the proposed method. For example, if we specify the acceptable amount of risk to be 0.10.1, then the proposed method can result in roughly 10%10\% more gain in annual return than investing in the best stock in the market. This example also shows that how tools in classical finance theory can be used to visualize and better understand the machine learning methods that are applied to finance, a crucial point that many previous works lack.

6.5 Case Study

For completeness, we also present the performance of the proposed method during the Market crush in Feb. 2020 for the interested readers. See Section A.5.1.

7 Outlook

In this work, we have presented a theoretical framework relevant to finance and machine learning to understand and analyze methods related to deep-learning-based finance. The result is a machine learning algorithm incorporating prior knowledge about the underlying financial processes. The good performance of the proposed method agrees with the standard expectation in machine learning that performance can be improved if the right inductive biases are incorporated. We have thus succeeded in showing that building machine learning algorithms that is rooted firmly in financial theories can have a considerable and yet-to-be achieved benefit. We hope that our work can help motivating for more works that approaches the theoretical aspects of machine learning algorithms that are used for finance.

The limitation of the present work is obvious; we only considered the kinds of data augmentation that takes the form of noise injection. Other kinds of data augmentation may also be useful to the finance; for example, (Fons et al.,, 2020) empirically finds that magnify (Um et al.,, 2017), time warp (Kamycki et al.,, 2020), and SPAWNER (Le Guennec et al.,, 2016) are helpful for financial series prediction, and it is interesting to apply our theoretical framework to analyze these methdos as well; a correct theoretical analysis of these methods is likely to advance both the deep-learning based techniques for finance and our fundamental understanding of the underlying financial and economic mechanisms. Meanwhile, our understanding of the underlying financial dynamics is also rapidly advancing; we foresee better methods to be designed, and it is likely that the proposed method will be replaced by better algorithms soon. There is potentially positive social effects of this work because it is widely believed that designing better financial prediction methods can make the economy more efficient by eliminating arbitrage (Fama,, 1970); the cautionary note is that this work is only for the purpose of academic research, and should not be taken as an advice for monetary investment, and the readers should evaluate their own risk when applying the proposed method.

References

  • Antoniou et al., (2017) Antoniou, A., Storkey, A., and Edwards, H. (2017). Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340.
  • Assefa, (2020) Assefa, S. (2020). Generating synthetic data in finance: opportunities, challenges and pitfalls. Challenges and Pitfalls (June 23, 2020).
  • Black and Scholes, (1973) Black, F. and Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of political economy, 81(3):637–654.
  • Bouchaud and Potters, (2009) Bouchaud, J. P. and Potters, M. (2009). Financial applications of random matrix theory: a short review.
  • Buehler et al., (2019) Buehler, H., Gonon, L., Teichmann, J., and Wood, B. (2019). Deep hedging. Quantitative Finance, 19(8):1271–1291.
  • Cont, (2001) Cont, R. (2001). Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance.
  • Cont, (2007) Cont, R. (2007). Volatility clustering in financial markets: empirical facts and agent-based models. In Long memory in economics, pages 289–309. Springer.
  • Cont and Bouchaud, (1997) Cont, R. and Bouchaud, J.-P. (1997). Herd behavior and aggregate fluctuations in financial markets. arXiv preprint cond-mat/9712318.
  • Cover and Thomas, (2006) Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, New York, NY, USA.
  • Dao et al., (2019) Dao, T., Gu, A., Ratner, A., Smith, V., De Sa, C., and Ré, C. (2019). A kernel theory of modern data augmentation. In International Conference on Machine Learning, pages 1528–1537. PMLR.
  • Degiannakis and Floros, (2015) Degiannakis, S. and Floros, C. (2015). Methods of volatility estimation and forecasting. Modelling and Forecasting High Frequency Financial Data, pages 58–109.
  • Drǎgulescu and Yakovenko, (2002) Drǎgulescu, A. A. and Yakovenko, V. M. (2002). Probability distribution of returns in the heston model with stochastic volatility. Quantitative finance, 2(6):443–453.
  • Estrada, (2006) Estrada, J. (2006). Downside risk in practice. Journal of Applied Corporate Finance, 18(1):117–125.
  • Fama, (1970) Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417.
  • Fons et al., (2020) Fons, E., Dawson, P., jun Zeng, X., Keane, J., and Iosifidis, A. (2020). Evaluating data augmentation for financial time series classification.
  • Goodhart and O’Hara, (1997) Goodhart, C. A. and O’Hara, M. (1997). High frequency data in financial markets: Issues and applications. Journal of Empirical Finance, 4(2-3):73–114.
  • He et al., (2019) He, Z., Xie, L., Chen, X., Zhang, Y., Wang, Y., and Tian, Q. (2019). Data augmentation revisited: Rethinking the distribution gap between clean and augmented data.
  • Heston, (1993) Heston, S. L. (1993). A closed-form solution for options with stochastic volatility with applications to bond and currency options. The review of financial studies, 6(2):327–343.
  • Imajo et al., (2020) Imajo, K., Minami, K., Ito, K., and Nakagawa, K. (2020). Deep portfolio optimization via distributional prediction of residual factors.
  • Imaki et al., (2021) Imaki, S., Imajo, K., Ito, K., Minami, K., and Nakagawa, K. (2021). No-transaction band network: A neural network architecture for efficient deep hedging. Available at SSRN 3797564.
  • Ito et al., (2020) Ito, K., Minami, K., Imajo, K., and Nakagawa, K. (2020). Trader-company method: A metaheuristic for interpretable stock price prediction.
  • Jay et al., (2020) Jay, P., Kalariya, V., Parmar, P., Tanwar, S., Kumar, N., and Alazab, M. (2020). Stochastic neural networks for cryptocurrency price prediction. IEEE Access, 8:82804–82818.
  • Jiang et al., (2017) Jiang, Z., Xu, D., and Liang, J. (2017). A deep reinforcement learning framework for the financial portfolio management problem. arXiv preprint arXiv:1706.10059.
  • Kamycki et al., (2020) Kamycki, K., Kapuscinski, T., and Oszust, M. (2020). Data augmentation with suboptimal warping for time-series classification. Sensors, 20(1):98.
  • Karatzas et al., (1998) Karatzas, I., Shreve, S. E., Karatzas, I., and Shreve, S. E. (1998). Methods of mathematical finance, volume 39. Springer.
  • Kelly Jr, (2011) Kelly Jr, J. L. (2011). A new interpretation of information rate. In The Kelly capital growth investment criterion: theory and practice, pages 25–34. World Scientific.
  • Le Guennec et al., (2016) Le Guennec, A., Malinowski, S., and Tavenard, R. (2016). Data augmentation for time series classification using convolutional neural networks. In ECML/PKDD workshop on advanced analytics and learning on temporal data.
  • Lim et al., (2019) Lim, B., Zohren, S., and Roberts, S. (2019). Enhancing time-series momentum strategies using deep neural networks. The Journal of Financial Data Science, 1(4):19–38.
  • Lux and Marchesi, (2000) Lux, T. and Marchesi, M. (2000). Volatility clustering in financial markets: a microsimulation of interacting agents. International journal of theoretical and applied finance, 3(04):675–702.
  • Mandelbrot, (1997) Mandelbrot, B. B. (1997). The variation of certain speculative prices. In Fractals and scaling in finance, pages 371–418. Springer.
  • Markowitz, (1959) Markowitz, H. (1959). Portfolio selection.
  • Merton, (1969) Merton, R. C. (1969). Lifetime portfolio selection under uncertainty: The continuous-time case. The review of Economics and Statistics, pages 247–257.
  • Ozbayoglu et al., (2020) Ozbayoglu, A. M., Gudelek, M. U., and Sezer, O. B. (2020). Deep learning for financial applications: A survey. Applied Soft Computing, page 106384.
  • Rubinstein, (2002) Rubinstein, M. (2002). Markowitz’s” portfolio selection”: A fifty-year retrospective. The Journal of finance, 57(3):1041–1045.
  • Sharpe, (1964) Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium under conditions of risk. The journal of finance, 19(3):425–442.
  • Sharpe, (1966) Sharpe, W. F. (1966). Mutual fund performance. The Journal of business, 39(1):119–138.
  • Shorten and Khoshgoftaar, (2019) Shorten, C. and Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):1–48.
  • Um et al., (2017) Um, T. T., Pfister, F. M., Pichler, D., Endo, S., Lang, M., Hirche, S., Fietzek, U., and Kulić, D. (2017). Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 216–220.
  • Von Neumann and Morgenstern, (1947) Von Neumann, J. and Morgenstern, O. (1947). Theory of games and economic behavior, 2nd rev.
  • Wasserman, (2013) Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.
  • Wiese et al., (2020) Wiese, M., Knobloch, R., Korn, R., and Kretschmer, P. (2020). Quant gans: Deep generation of financial time series. Quantitative Finance, 20(9):1419–1440.
  • Zhang et al., (2020) Zhang, Z., Zohren, S., and Roberts, S. (2020). Deep learning for portfolio optimization. The Journal of Financial Data Science, 2(4):8–20.
  • Zhong et al., (2020) Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y. (2020). Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13001–13008.
  • Ziyin et al., (2020) Ziyin, L., Hartwig, T., and Ueda, M. (2020). Neural networks fail to learn periodic functions and how to fix it. arXiv preprint arXiv:2006.08195.
  • Ziyin et al., (2019) Ziyin, L., Wang, Z., Liang, P., Salakhutdinov, R., Morency, L., and Ueda, M. (2019). Deep gamblers: Learning to abstain with portfolio theory. In Proceedings of the Neural Information Processing Systems Conference.

Appendix A Experiments

This section describes the additional experiments and the experimental details in the main text. The experiments are all done on a single TITAN RTX GPU. The S&P500 data is obtained from Alphavantage.777https://www.alphavantage.co/documentation/ The code will be released on github.

A.1 Dataset Construction

For all the tasks, we observe a single trajectory of a single stock prices S1,,STS_{1},...,S_{T}. For the toy tasks, T=400T=400; for the S&P500 task, T=800T=800. We then transform this into TLT-L input-target pairs {(xi,yi)}i=1TL\{(x_{i},y_{i})\}_{i=1}^{T-L}, where

{xi=(Si,,SL1);yi=SL.\begin{cases}x_{i}=(S_{i},...,S_{L-1});\\ y_{i}=S_{L}.\end{cases} (19)

xix_{i} is used as the input to the model for training; yiy_{i} is used as the unseen future price for calculating the loss function. For the toy tasks, L=10L=10; for the S&P500 task, L=15L=15. In simple words, we use the most recent LL prices for constructing the next-step portfolio.

A.2 Sharpe Ratio for S&P500

The empirical Sharpe Ratios are calculated in the standard way (for example, it is the same as in (Ito et al.,, 2020; Imajo et al.,, 2020)). Given a trajectory of wealth W1,,WTW_{1},...,W_{T} of a strategy π\pi, the empirical Sharpe ratio is estimated as

Ri=Wi+1Wi1;R_{i}=\frac{W_{i+1}}{W_{i}}-1; (20)
M^=1Ti=1T1Ri;\hat{M}=\frac{1}{T}\sum_{i=1}^{T-1}R_{i}; (21)
SR^=M^1Ti=1TRi2M^2=average wealth returnstd. of wealth return,\widehat{SR}=\frac{\hat{M}}{\sqrt{\frac{1}{T}\sum_{i=1}^{T}R_{i}^{2}-\hat{M}^{2}}}=\frac{\text{average wealth return}}{\text{std. of wealth return}}, (22)

and SR^\widehat{SR} is the reported Sharpe Ratio for S&P500 experiments.

A.3 Variance of Sharpe Ratio

We do not report an uncertainty for the single stock Sharpe Ratios, but one can easily estimate the uncertainties. The Sharpe Ratio is estimated across a period of TT time steps. For the S&P500S\&P500 stocks, T=200T=200, and by the law of large numbers, the estimated mean M^\hat{M} has variance roughly σ2/T\sigma^{2}/T, where σ\sigma is the true volatility, and so is the estimated standard deviation. Therefore, the estimated Sharpe Ratio can be written as

SR^\displaystyle\widehat{SR} =M^1Ti=1TRi2M^2\displaystyle=\frac{\hat{M}}{\sqrt{\frac{1}{T}\sum_{i=1}^{T}R_{i}^{2}-\hat{M}^{2}}} (23)
=M+σTϵσ+cTη\displaystyle=\frac{M+\frac{\sigma}{\sqrt{T}}\epsilon}{\sigma+\frac{c}{\sqrt{T}}\eta} (24)
M+σTϵσ=Mσ+1Tϵ\displaystyle\approx\frac{M+\frac{\sigma}{\sqrt{T}}\epsilon}{\sigma}=\frac{M}{\sigma}+\frac{1}{\sqrt{T}}\epsilon (25)

where ϵ\epsilon and η\eta are zero-mean random variables with unit variance. This shows that the uncertainty in the estimated SR^\widehat{SR} is approximately 1/T0.071/\sqrt{T}\approx 0.07 for each of the single stocks, which is often much smaller than the difference between different methods.

A.4 S&P500 Experiment

This section gives more results and discussion of the S&P500S\&P500 experiment.

A.4.1 Underperformance of Weight Decay

This section gives the detail of the comparison made in Figure 1. The experimental setting is the same as the S&P500S\&P500 experiments. For illustration and motivation, we only show the result on MSFT (Microsoft). Choosing most of the other stocks would give a qualitatively similar plot.

See Figure 1, where we show the performance of directly training a neural network to maximize wealth return on MSFT during 2018-2020. Using popular, generic deep learning techniques such as weight decay or dropout does not improve the baseline. In contrast, our theoretically motivated method does. Combining the proposed method with weight decay has the potential to improve the performance a little further, but the improvement is much lesser than the improvement of using the proposed method over the baseline. This implies that generic machine learning is unlikely to capture the inductive bias required to process a financial task.

In the plot, we did not interpolate the dropout method between a large pp and a small pp. The result is similar to the case of weight decay in our experiments.

Refer to caption
Refer to caption
Figure 4: Case study of the performance of the model on MSFT from 2019 August to 2020 May. We see that the model learns to invest less and less as the price of the stock rises to an unreasonable level, thus avoiding the high risk of the market crash in February 2020.

A.5 Comparison with GAN

Sector Q-GAN Aug. Ours
Comm. Services 0.07±0.020.07_{\pm 0.02} 0.33±0.160.33_{\pm 0.16}
Consumer Disc. 0.00±0.030.00_{\pm 0.03} 0.64±0.080.64_{\pm 0.08}
Consumer Staples 0.10±0.020.10_{\pm 0.02} 0.35±0.070.35_{\pm 0.07}
Energy 0.08±0.050.08_{\pm 0.05} 0.91±0.100.91_{\pm 0.10}
Financials 0.22±0.03-0.22_{\pm 0.03} 0.18±0.080.18_{\pm 0.08}
Health Care 0.35±0.030.35_{\pm 0.03} 0.83±0.070.83_{\pm 0.07}
Industrials 0.12±0.03-0.12_{\pm 0.03} 0.48±0.080.48_{\pm 0.08}
Information Tech. 0.28±0.040.28_{\pm 0.04} 0.79±0.090.79_{\pm 0.09}
Materials 0.03±0.03-0.03_{\pm 0.03} 0.53±0.100.53_{\pm 0.10}
Real Estate 0.35±0.03-0.35_{\pm 0.03} 0.19±0.070.19_{\pm 0.07}
Utilities 0.20±0.02-0.20_{\pm 0.02} 0.15±0.040.15_{\pm 0.04}
S&P500 Avg. 0.05±0.030.05_{\pm 0.03} 0.51±0.03\mathbf{0.51}_{\pm 0.03}
Figure 5: Performance (Sharpe Ratio) of the training set augmented with QuantGAN.

We stress that it is not the goal of this paper to compare methods but to understand why certain methods work or do not work from the perspective of the classical finance theory. With this caveat clearly stated, we compare our theoretically motivated technique with QuantGAN, a major and recent technique in the field of financial data augmentation/generation. We first train a QuantGAN on each stock trajectory and use the trained QuantGAN to generate 10000 additional data points to augment the original training set (containing roughly 1000 data points for each stock) and train the same feedforward net as described in the main text. This feedforward net is then used for evaluation. QuantGAN is implemented as close to the original paper as possible, using a temporal convolutional network trained with RMSProp for 50 epochs. Other experimental settings are the same as that of S&P500 experiment in the manuscript. See the right Table. Also, compare with Table 1 in the manuscript. We see that directly applying QuantGAN to the portfolio construction problem in our setting does not significantly improve over the baseline without any augmentation and achieves a much lower Sharpe ratio than our suggested method. This underperformance is possibly because the QuantGAN is not designed for Sharpe ratio maximization.

A.5.1 Case Study

In this section, we qualitatively study the behavior of the learned portfolio of the proposed method. The model is trained as in the other S&P500 experiments. See Figure 4. We see that the model learns to invest less and less as the stock price rises to an excessive level, thus avoiding the high risk of the market crash in February 2020. This avoidance demonstrates the effectiveness of the proposed method qualitatively.

A.5.2 List of Symbols for S&P500

The following are the symbols we used for the S&P500 experiments, separated by quotation marks.

[’A’ ’AAPL’ ’ABC’ ’ABT’ ’ADBE’ ’ADI’ ’ADM’ ’ADP’ ’ADSK’ ’AEE’ ’AEP’ ’AES’ ’AFL’ ’AIG’ ’AIV’ ’AJG’ ’AKAM’ ’ALB’ ’ALL’ ’ALXN’ ’AMAT’ ’AMD’ ’AME’ ’AMG’ ’AMGN’ ’AMT’ ’AMZN’ ’ANSS’ ’AON’ ’AOS’ ’APA’ ’APD’ ’APH’ ’ARE’ ’ATVI’ ’AVB’ ’AVY’ ’AXP’ ’AZO’ ’BA’ ’BAC’ ’BAX’ ’BBT’ ’BBY’ ’BDX’ ’BEN’ ’BIIB’ ’BK’ ’BKNG’ ’BLK’ ’BLL’ ’BMY’ ’BRK.B’ ’BSX’ ’BWA’ ’BXP’ ’C’ ’CAG’ ’CAH’ ’CAT’ ’CB’ ’CCI’ ’CCL’ ’CDNS’ ’CERN’ ’CHD’ ’CHRW’ ’CI’ ’CINF’ ’CL’ ’CLX’ ’CMA’ ’CMCSA’ ’CMI’ ’CMS’ ’CNP’ ’COF’ ’COG’ ’COO’ ’COP’ ’COST’ ’CPB’ ’CSCO’ ’CSX’ ’CTAS’ ’CTL’ ’CTSH’ ’CTXS’ ’CVS’ ’CVX’ ’D’ ’DE’ ’DGX’ ’DHI’ ’DHR’ ’DIS’ ’DLTR’ ’DOV’ ’DRE’ ’DRI’ ’DTE’ ’DUK’ ’DVA’ ’DVN’ ’EA’ ’EBAY’ ’ECL’ ’ED’ ’EFX’ ’EIX’ ’EL’ ’EMN’ ’EMR’ ’EOG’ ’EQR’ ’EQT’ ’ES’ ’ESS’ ’ETFC’ ’ETN’ ’ETR’ ’EW’ ’EXC’ ’EXPD’ ’F’ ’FAST’ ’FCX’ ’FDX’ ’FE’ ’FFIV’ ’FISV’ ’FITB’ ’FL’ ’FLIR’ ’FLS’ ’FMC’ ’FRT’ ’GD’ ’GE’ ’GILD’ ’GIS’ ’GLW’ ’GPC’ ’GPS’ ’GS’ ’GT’ ’GWW’ ’HAL’ ’HAS’ ’HBAN’ ’HCP’ ’HD’ ’HES’ ’HIG’ ’HOG’ ’HOLX’ ’HON’ ’HP’ ’HPQ’ ’HRB’ ’HRL’ ’HRS’ ’HSIC’ ’HST’ ’HSY’ ’HUM’ ’IBM’ ’IDXX’ ’IFF’ ’INCY’ ’INTC’ ’INTU’ ’IP’ ’IPG’ ’IRM’ ’IT’ ’ITW’ ’IVZ’ ’JBHT’ ’JCI’ ’JEC’ ’JNJ’ ’JNPR’ ’JPM’ ’JWN’ ’K’ ’KEY’ ’KLAC’ ’KMB’ ’KMX’ ’KO’ ’KR’ ’KSS’ ’KSU’ ’L’ ’LB’ ’LEG’ ’LEN’ ’LH’ ’LLY’ ’LMT’ ’LNC’ ’LNT’ ’LOW’ ’LRCX’ ’LUV’ ’M’ ’MAA’ ’MAC’ ’MAR’ ’MAS’ ’MAT’ ’MCD’ ’MCHP’ ’MCK’ ’MCO’ ’MDT’ ’MET’ ’MGM’ ’MHK’ ’MKC’ ’MLM’ ’MMC’ ’MMM’ ’MNST’ ’MO’ ’MOS’ ’MRK’ ’MRO’ ’MS’ ’MSFT’ ’MSI’ ’MTB’ ’MTD’ ’MU’ ’MYL’ ’NBL’ ’NEE’ ’NEM’ ’NI’ ’NKE’ ’NKTR’ ’NOC’ ’NOV’ ’NSC’ ’NTAP’ ’NTRS’ ’NUE’ ’NVDA’ ’NWL’ ’O’ ’OKE’ ’OMC’ ’ORCL’ ’ORLY’ ’OXY’ ’PAYX’ ’PBCT’ ’PCAR’ ’PCG’ ’PEG’ ’PEP’ ’PFE’ ’PG’ ’PGR’ ’PH’ ’PHM’ ’PKG’ ’PKI’ ’PLD’ ’PNC’ ’PNR’ ’PNW’ ’PPG’ ’PPL’ ’PRGO’ ’PSA’ ’PVH’ ’PWR’ ’PXD’ ’QCOM’ ’RCL’ ’RE’ ’REG’ ’REGN’ ’RF’ ’RJF’ ’RL’ ’RMD’ ’ROK’ ’ROP’ ’ROST’ ’RRC’ ’RSG’ ’SBAC’ ’SBUX’ ’SCHW’ ’SEE’ ’SHW’ ’SIVB’ ’SJM’ ’SLB’ ’SLG’ ’SNA’ ’SNPS’ ’SO’ ’SPG’ ’SRCL’ ’SRE’ ’STT’ ’STZ’ ’SWK’ ’SWKS’ ’SYK’ ’SYMC’ ’SYY’ ’T’ ’TAP’ ’TGT’ ’TIF’ ’TJX’ ’TMK’ ’TMO’ ’TROW’ ’TRV’ ’TSCO’ ’TSN’ ’TTWO’ ’TXN’ ’TXT’ ’UDR’ ’UHS’ ’UNH’ ’UNM’ ’UNP’ ’UPS’ ’URI’ ’USB’ ’UTX’ ’VAR’ ’VFC’ ’VLO’ ’VMC’ ’VNO’ ’VRSN’ ’VRTX’ ’VTR’ ’VZ’ ’WAT’ ’WBA’ ’WDC’ ’WEC’ ’WFC’ ’WHR’ ’WM’ ’WMB’ ’WMT’ ’WY’ ’XEL’ ’XLNX’ ’XOM’ ’XRAY’ ’XRX’ ’YUM’ ’ZION’]

Appendix B Additional Discussion of the Proposed Algorithm

B.1 Derivation

We first derive Equation 18. The original training loss is

1Tt=1T𝔼t[Gt(π)]λVart[Gt(π)].\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{t}\left[G_{t}(\pi)\right]-\lambda\text{Var}_{t}[G_{t}(\pi)]. (26)

The last term can be written as

λVart[Gt(π)]\displaystyle\lambda\text{Var}_{t}[G_{t}(\pi)] =𝔼z1,,zt[zt2πt2]𝔼z1,,zt[ztπt]2\displaystyle=\mathbb{E}_{z_{1},...,z_{t}}[z_{t}^{2}\pi_{t}^{2}]-\mathbb{E}_{z_{1},...,z_{t}}[z_{t}\pi_{t}]^{2} (27)
=𝔼zt[zt2]𝔼z1,,zt1[πt2]𝔼zt[zt]2𝔼z1,,zt1[πt]2\displaystyle=\mathbb{E}_{z_{t}}[z_{t}^{2}]\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}^{2}]-\mathbb{E}_{z_{t}}[z_{t}]^{2}\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}]^{2} (28)
=λrt2𝔼z1,,zt1[πt2]+λc2σ^t2|rt|𝔼z1,,zt1[πt2]λrt2𝔼z1,,zt1[πt]2\displaystyle=\lambda r_{t}^{2}\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}^{2}]+\lambda c^{2}\hat{\sigma}_{t}^{2}|r_{t}|\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}^{2}]-\lambda r_{t}^{2}\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}]^{2} (29)
=λrt2Varz1,,zt1[πt]+λc2σ^t2|rt|𝔼z1,,zt1[πt2]\displaystyle=\lambda r_{t}^{2}\text{Var}_{z_{1},...,z_{t-1}}[\pi_{t}]+\lambda c^{2}\hat{\sigma}_{t}^{2}|r_{t}|\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}^{2}] (30)

Plug in, this leads to the following maximization problem, which is the desired equation.

argmaxbt{1Tt=1T𝔼x[Gt(π)]A: wealth gainλrt2Varz1,,zt1[πt]B: Risk due to uncertainty in past priceλc2σ^t2|rt|𝔼z1,,zt1[πt2]C: Risk due to Future Price},\arg\max_{b_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\underbrace{\mathbb{E}_{x}\left[G_{t}(\pi)\right]}_{A:\text{ wealth gain}}-\underbrace{\lambda r_{t}^{2}\text{Var}_{z_{1},...,z_{t-1}}[\pi_{t}]}_{B:\text{ Risk due to uncertainty in past price}}-\underbrace{\lambda c^{2}\hat{\sigma}_{t}^{2}|r_{t}|\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}^{2}]}_{C:\text{ Risk due to Future Price}}\right\}, (31)

where we have given each term a name for reference; the expectation is taken with respect to the augmented data points ziz_{i}:

rizi=ri+cσ^i2|ri|ϵifor rix.r_{i}\to z_{i}=r_{i}+c\sqrt{\hat{\sigma}^{2}_{i}|r_{i}|}\epsilon_{i}\quad\text{for }r_{i}\in x. (32)

Under the GBM model (or when the optimal portfolio only weakly depends on z1,,zt1z_{1},...,z_{t-1}), the optimal πt\pi_{t} does not depend on z1,,zt1z_{1},...,z_{t-1}, and so the objective can be further simplified to be

argmaxbt{1Tt=1T𝔼z[Gt(π)]λc2σ^t2|rt|πt2};\arg\max_{b_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{z}\left[G_{t}(\pi)\right]-\lambda c^{2}\hat{\sigma}_{t}^{2}|r_{t}|\pi_{t}^{2}\right\}; (33)

the first term is the data augmentation for wealth gain, and the second term is a regularization for risk control. Most of the experiments in this paper use this equation for training. When it does not work well, the readers are encouraged to try the full training objective in Equation 31.

B.2 Extension to Multi-Asset Setting

It is possible and interesting to derive the data augmentation for a multi-asset setting. However, this is hindered by the lack of a standard model to describe the co-motion of multiple stocks. For example, it is unsure what the geometric Brownian motion should be for a multi-stock setting. In this case, we tentatively suggest the following form of the formula for the injected noise, whose effectiveness and theoretical justification are left for future work. Let 𝐒t=(S1,t,,SN,t)\mathbf{S}_{t}=(S_{1,t},...,S_{N,t}) be the prices of the stocks viewed as an NN-dimensional vector. The return is assumed to have covariance Σ\Sigma, then, by analogy with the discovery of this work:

Si,tSi,t+cjΣij|rj|Sj2ϵt{S}_{i,t}\to{S}_{i,t}+c\sqrt{\sum_{j}\Sigma_{ij}|r_{j}|S_{j}^{2}}\epsilon_{t} (34)

for some white gaussian noise ϵ\epsilon and some tunable parameter cc. The matrix Σ\Sigma has to be estimated by the data using standard methods of estimating multi-stock volatility.

B.3 Non-Gaussian Noise

While the proposed algorithm proposes to use Gaussian noise for injection; the theory developed in this work only requires the noise to have a finite second moment and, therefore, any other distribution works as well for the particular kind of utility function we specified in Eq. (5). This is because this utility function only contains the second moments of the wealth return and is indifferent to higher moments. Therefore, there is one caveat to the present theory: when the utility function involves higher moments of the wealth return, the utility of a certain type of noise injection is not indifferent to the choice of the injection distribution. The practitioners are recommended to analyze the specific utility function they use in our framework and decide on the strength and distribution of the injected noise.

Appendix C Additional Theory and Proofs

This section contains all the additional theoretical discussions and the proofs.

C.1 Background: Classical Solution to the Portfolio Construction Problem

Consider a market with NN stocks, we denote the price of all these stocks as StN{S}_{t}\in\mathbb{R}^{N}, and one can likewise define the stock price return as

rt=St+1StSt.r_{t}=\frac{S_{t+1}-S_{t}}{S_{t}}. (35)

which is a random variable with the covariance matrix Ct:=Cov[rt]C_{t}:=\text{Cov}[r_{t}] and the expected return gt:=𝔼[rt]g_{t}:=\mathbb{E}[r_{t}]. Our wealth at time step tt is defined as Wt=Mt+iNnt,iSt,i:=ntTStW_{t}=M_{t}+\sum_{i}^{N}n_{t,i}S_{t,i}:=n^{\rm T}_{t}S_{t} where MtM_{t} is the amount of cash we hold, and nin_{i} the shares of stock we hold for the ii-th stock; we also defined vectors nt:=(nt,i,..,nt,N,1)n_{t}:=(n_{t,i},..,n_{t,N},1) and St:=(St,1,,St,N,Mt)S_{t}:=(S_{t,1},...,S_{t,N},M_{t}) where the cash is included in the definition of ntn_{t}. As in the standard finance literature, we assume that the shares are infinitely divisible; usually, a positive nin_{i} denotes holding (long) and a negative nn denotes borrowing (short). The wealth we hold initially is W0>0W_{0}>0, and we would like to invest our money on NN stocks; we denote the relative value of each stock we hold also as a vector πt,i=nt,iSt,iWtN+1\pi_{t,i}=\frac{n_{t,i}S_{t,i}}{W_{t}}\in\mathbb{R}^{N+1}; π\pi is called a portfolio; the central challenge in portfolio theory is to find the best π\pi. At time tt, our relative wealth is WtW_{t}; after one time step, our wealth changes due to a change in the price of the stocks: ΔWt:=Wt+1Wt=Wtπtrt\Delta W_{t}:=W_{t+1}-W_{t}=W_{t}\pi_{t}\cdot r_{t} . The standard goal is to maximize the wealth return Gt:=πtrtG_{t}:=\pi_{t}\cdot r_{t} at every time step while minimizing risk888It is important to not to confuse the price return rtr_{t} with the wealth return GtG_{t}.. The risk is defined as the variance of the wealth change999In principle, any concave function in GtG_{t} can be a risk regularizer from classical economic theory (Von Neumann and Morgenstern,, 1947), and our framework can be easily extended to such cases; one common alternative would be R(G)=log(G)R(G)=\log(G).:

Rt:=R(πt):=Varrt[Gt]=πtT(𝔼[rtrtT]gtgtT)πt=πtTCtπt.R_{t}:=R(\pi_{t}):=\text{Var}_{r_{t}}[G_{t}]=\pi_{t}^{\rm T}\left(\mathbb{E}[r_{t}r_{t}^{\rm T}]-g_{t}g_{t}^{\rm T}\right)\pi_{t}=\pi_{t}^{\rm T}C_{t}\pi_{t}. (36)

The standard way to control risk is to introduce a “risk regularizer” that punishes the portfolios with a large risk (Markowitz,, 1959; Rubinstein,, 2002). Introducing a parameter λ\lambda for the strength of regularization (the factor of 1/21/2 appears for convention), we can now write down our objective:

πt=argmaxπU(π):=argmaxπ[πTGtλ2R(π)].\pi_{t}^{*}=\arg\max_{\pi}U(\pi):=\arg\max_{\pi}\left[\pi^{\rm T}G_{t}-\frac{\lambda}{2}R(\pi)\right]. (37)

Here, UU stands for the utility function; λ\lambda can be set to be the desired level of risk-aversion. When gtg_{t} and CtC_{t} is known, this problem can be explicitly solved (see Section C.1). However, one main problem in finance is that its data is very limited and we only observe one particular realized data trajectory, and, therefore, gtg_{t} and CtC_{t} cannot be accurately estimated.

Eq. (37) can be solved directly by taking derivative and set to 0; we can obtain

{πt=1λCt1gt;Rt(πt)=1λ2gtTCt1gt.\begin{cases}\pi_{t}^{*}=\frac{1}{\lambda}C_{t}^{-1}g_{t};\\ R_{t}(\pi_{t}^{*})=\frac{1}{\lambda^{2}}g_{t}^{\rm T}C_{t}^{-1}g_{t}.\end{cases} (38)

This formula is the standard formula to use when both gtg_{t} and CtC_{t} are known or can be accurately estimated (Bouchaud and Potters,, 2009). Meanwhile, when one finds difficulty estimating gtg_{t} or, more importantly, CtC_{t}, then the above formula can go arbitrarily wrong. Let C^\hat{C} denote our estimated covariance and g^\hat{g} the estimated mean101010For example, using some Bayesian machine learning model., then the in-sample risk and the true is respectively given by

{R^t(πt)=1λ2g^tTC^t1g^t;Rt(πt)=1λ2g^tTC^t1CtC^t1g^t.\begin{cases}\hat{R}_{t}(\pi_{t}^{*})=\frac{1}{\lambda^{2}}\hat{g}_{t}^{\rm T}\hat{C}_{t}^{-1}\hat{g}_{t};\\ {R}_{t}(\pi_{t}^{*})=\frac{1}{\lambda^{2}}\hat{g}_{t}^{\rm T}\hat{C}_{t}^{-1}C_{t}\hat{C}_{t}^{-1}\hat{g}_{t}.\end{cases} (39)

The readers are encouraged to examine the differences between the two equations carefully.

C.2 Analogy to Statistical Decision Theory and the Minimax Formulation

In Section 4.6, we mentioned that the procedure we used is analogous to the process of finding a Bayesian estimator in a statistical decision theory. Here, we explain this analogy a little more (but keep in mind that this is only an analogy, not a rigorous equivalence relation). Equation 14 of empirical utility can be seen as an equivalent of the statistical risk function RR; finding the optimal data augmentation strength is similar to finding the best Bayesian estimator. To make an exact agreement with the statistical decision theory, we also need to define a prior over the risk in Equation 14:

r:=𝔼p(Ω)[V(π^(θ))]=𝔼p(Ω)𝔼xp(x;Ω)𝔼{xi}NpN(x)[U(x,π^(θ))]r:=\mathbb{E}_{p(\Omega)}[V(\hat{\pi}(\theta))]=\mathbb{E}_{p(\Omega)}\mathbb{E}_{x\sim p(x;\Omega)}\mathbb{E}_{\{x_{i}\}^{N}\sim p^{N}(x)}[U(x,\hat{\pi}(\theta))] (40)

where we have written the distributions p(x;Ω)p(x;\Omega) as a function of the true parameters in our underlying model.111111For example, in the GBM model, the true parameters are the growth rate rr and the volatility σ\sigma, and so Ω=(r,σ)\Omega=(r,\sigma), and p(Ω)=p(r,σ)p(\Omega)=p(r,\sigma). In the main text, we have effectively assumed that p(Ω)p(\Omega) is a Dirac delta distribution, but, in the more general case, it is possible that the true parameter is not known or cannot be accurately estimated, and it makes sense to assign a prior distribution to them. One can then find the optimal data augmentation with respect to rr: θ=argmaxθr\theta^{*}=\arg\max_{\theta}r.

Financial Terms Statistical Terms
Utility UU Loss function LL
Expected Utility VV Risk RR
Data Augmentation Parameter θ\theta Estimator θ^\hat{\theta}
True Parameter Ω\Omega True Parameter θ\theta
Prior p(Ω)p(\Omega) Prior p(θ)p(\theta)
Table 2: Correspondence table between the concepts in our general theory and the classical statistical decision theory.

See table 2 for the list of correspondences. This analogy breaks down at the following point: the Bayesian estimator tries to find θ^\hat{\theta} that is as close to θ\theta as possible, while, in our formulation, the goal is not to make data augmentation θ\theta as close as possible to Ω\Omega.

One might also hope to establish an analogous “minimax” theory for the portfolio construction problem. This can be done simply by replacing the expectation over p(Ω)p(\Omega) with a minimization over Ω\Omega:

rminimax:=minΩV(π^(θ))r_{minimax}:=\min_{\Omega}V(\hat{\pi}(\theta)) (41)

and the best augmentation parameter can be found as the maximizer of this risk: θ=maxθminΩV\theta^{*}=\max_{\theta}\min_{\Omega}V.

C.3 Proof for no data augmentation

when there is no data augmentation, 𝔼t[Gt(π)]=btrt\mathbb{E}_{t}\left[G_{t}(\pi)\right]=b_{t}r_{t} and Vart[Gt(π)]=0\text{Var}_{t}\left[G_{t}(\pi)\right]=0. One immediately see that the utility is then maximized at

πt={1,if rt01,if rt<0.\pi_{t}^{*}=\begin{cases}1,&\text{if }r_{t}\geq 0\\ -1,&\text{if }r_{t}<0.\end{cases} (42)

We restate the theorem here.

Proposition 4.

(Utility of no-data augmentation strategy.) Let the strategy be as specified in Eq. (42), and let the price trajectory be generated with GBM in Eq. (6) with initial price S0S_{0}, then the true utility is

U=[12Φ(r/σ)]rλ2σ2U=[1-2\Phi({-r}/{\sigma})]r-\frac{\lambda}{2}\sigma^{2} (43)

where U(π)U(\pi) is the utility function defined in Eq. (3).

Proof. For a time-dependent strategy πt\pi^{*}_{t}, the true utility is defined as121212While we mainly use Θ(x)\Theta(x) as the Heaviside step function, we overload this notation a little. When we write Θ(x>0)\Theta(x>0), Θ\Theta is defined as the indicator function. We think that this is harmless because the difference is clearly shown by the argument to the function.

U(π)=𝔼S0,S1,,ST,ST+1[1Tt=1T+1πtrt(λ2Tt=1T(πtrt)2𝔼S0,S1,,ST,ST+1[πtrt]2)]U(\pi^{*})=\mathbb{E}_{S_{0}^{\prime},S_{1}^{\prime},...,S_{T}^{\prime},S_{T+1}^{\prime}}\left[\frac{1}{T}\sum_{t=1}^{T+1}\pi_{t}^{*}r_{t}^{\prime}-\left(\frac{\lambda}{2T}\sum_{t=1}^{T}(\pi_{t}^{*}r_{t}^{\prime})^{2}-\mathbb{E}_{S_{0}^{\prime},S_{1}^{\prime},...,S_{T}^{\prime},S_{T+1}^{\prime}}[\pi_{t}^{*}r_{t}^{\prime}]^{2}\right)\right] (44)

where S1,,ST,ST+1S_{1}^{\prime},...,S_{T}^{\prime},S_{T+1}^{\prime} is an independently sampled distribution for testing, and rt:=St+1StStr_{t}^{\prime}:=\frac{S_{t+1}^{\prime}-S_{t}^{\prime}}{S_{t}^{\prime}} are their respective returns. Now, we note that we can write the price update equation (the GBM model) in terms of the returns:

St+1=(1+r)St+σtStηtrt=r+σηtS_{t+1}=(1+r)S_{t}+\sigma_{t}S_{t}\eta_{t}\to r_{t}=r+\sigma\eta_{t} (45)

which means that rt𝒩(r,σ2)r_{t}\sim\mathcal{N}(r,\sigma^{2}) obeys a Gaussian distribution. Therefore,

U(π)=rTt=1Tπtλσ22Tt=1T(πt)2.U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\pi_{t}^{*}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}(\pi_{t}^{*})^{2}. (46)

Now we would like to average over πt\pi^{*}_{t}, because we also want to average over the realizations of the training set to make the true utility independent of the sampling of the training set (see Section 4.6 for an explanation).

Recall that the strategy is defined as

πt={1,if rt01,if rt<0.=Θ(rt1)Θ(rt<1)\pi_{t}^{*}={\begin{cases}1,&\text{if }r_{t}\geq 0\\ -1,&\text{if }r_{t}<0.\end{cases}}=\Theta(r_{t}\geq 1)-\Theta(r_{t}<1) (47)

for a training set {S0,,ST}\{S_{0},...,S_{T}\}, and Θ\Theta is the Heaviside step function. We thus have that

{(πt)2=1;𝔼S1,ST+1[πt]=𝔼S1,ST+1[Θ(rt0)Θ(rt<0)]=12Φ(r/σ)\begin{cases}(\pi^{*}_{t})^{2}=1;\\ \mathbb{E}_{S_{1},...S_{T+1}}[\pi_{t}^{*}]=\mathbb{E}_{S_{1},...S_{T+1}}[\Theta(r_{t}\geq 0)-\Theta(r_{t}<0)]=1-2\Phi({-r}/{\sigma})\end{cases} (48)

where Φ\Phi is the Gauss c.d.f. We can use this to average the utility over the training set; noticing that the training set and the test set are independent, we can obtain

U\displaystyle U =𝔼S1,ST+1[U(π)]\displaystyle=\mathbb{E}_{S_{1},...S_{T+1}}[U(\pi^{*})] (49)
=1Tt=1T[12Φ(r/σ)]rλ21Tt=1Tσ2\displaystyle=\frac{1}{T}\sum_{t=1}^{T}[1-2\Phi({-r}/{\sigma})]r-\frac{\lambda}{2}\frac{1}{T}\sum_{t=1}^{T}\sigma^{2} (50)
=[12Φ(r/σ)]rλ2σ2.\displaystyle=[1-2\Phi({-r}/{\sigma})]r-\frac{\lambda}{2}\sigma^{2}. (51)

This finishes the proof. \square

C.4 Proof for Additive Gaussian noise

Before we prove the proposition, we first prove that the strategy is indeed the one given in Eq. (52):

Lemma 1.

The maximizer of the utility function in Eq. 4 with additive gaussian noise is

πt(ρ)={rtSt22λρ2,if 1<rtSt22λρ2<1;sgn(rt),otherwise.\pi^{*}_{t}(\rho)=\begin{cases}\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}},&\text{if }-1<\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}}<1;\\ {\rm sgn}(r_{t}),&\text{otherwise.}\end{cases} (52)

Proof. With additive Gaussian noise, we have

{𝔼t[Gt(π)]=πt𝔼t[r~t]=πt𝔼t[St+1+ρt+1ϵt+1StρtϵtSt]=πtSt+1StSt=πtrt;Vart[Gt(π)]=πt2Vart[r~t]=πt2Vart[ρt+1ϵt+1ρtϵtSt]=2ρ2πt2St2,\begin{cases}\mathbb{E}_{t}\left[G_{t}(\pi)\right]=\pi_{t}\mathbb{E}_{t}\left[\tilde{r}_{t}\right]=\pi_{t}\mathbb{E}_{t}\left[\frac{S_{t+1}+\rho_{t+1}\epsilon_{t+1}-S_{t}-\rho_{t}\epsilon_{t}}{S_{t}}\right]=\pi_{t}\frac{S_{t+1}-S_{t}}{S_{t}}=\pi_{t}r_{t};\\ \text{Var}_{t}\left[G_{t}(\pi)\right]=\pi_{t}^{2}\text{Var}_{t}[\tilde{r}_{t}]=\pi_{t}^{2}\text{Var}_{t}\left[\frac{\rho_{t+1}\epsilon_{t+1}-\rho_{t}\epsilon_{t}}{S_{t}}\right]=\frac{2\rho^{2}\pi_{t}^{2}}{S_{t}^{2}},\\ \end{cases} (53)

where the last line follows from the definition for additive Gaussian noise that ρ1=ρT=ρ\rho_{1}=...\rho_{T}=\rho. The training objective becomes

πt\displaystyle\pi^{*}_{t} =argmaxπt{1Tt=1T𝔼t[Gt(π)]λ2Vart[Gt(π)]}\displaystyle=\arg\max_{\pi_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{t}\left[G_{t}(\pi)\right]-\frac{\lambda}{2}\text{Var}_{t}[G_{t}(\pi)]\right\} (54)
=argmaxπt{1Tt=1Tπtrtλρ2πt2St2}.\displaystyle=\arg\max_{\pi_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\pi_{t}r_{t}-\lambda\frac{\rho^{2}\pi_{t}^{2}}{S_{t}^{2}}\right\}. (55)

This maximization problem can be maximized for every tt respectively. Taking derivative and set to 0, we find the condition that πt\pi_{t}^{*} satisfies

πt(πtrtλρ2πt2St2)=0\displaystyle\frac{\partial}{\partial\pi_{t}}\left(\pi_{t}r_{t}-\lambda\frac{\rho^{2}\pi_{t}^{2}}{S_{t}^{2}}\right)=0 (56)
\displaystyle\longrightarrow\quad πt(ρ)=rtSt22λρ2.\displaystyle\pi^{*}_{t}(\rho)=\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}}. (57)

By definition, we also have |πt|1|\pi_{t}|\leq 1, and so

πt(ρ)={rtSt22λρ2,if 1<rtSt22λρ2<1;sgn(rt),otherwise,\pi^{*}_{t}(\rho)=\begin{cases}\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}},&\text{if }-1<\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}}<1;\\ {\rm sgn}(r_{t}),&\text{otherwise,}\end{cases} (58)

which is the desired result. \square

We would like to comment that, although we paid special attention to enforcing the constraint |πt||\pi_{t}|\leq it is often not needed in practice because investors tend to be quite risk-averse, and it is hard to imagine that any investor would invest all his or her money in the financial market such that πt=1\pi_{t}=1. Mathematically, this means that it is often the case that λ|rt|St22λρ2\lambda\geq\frac{|r_{t}|S_{t}^{2}}{2\lambda\rho^{2}}. Therefore, for what comes, we assume that λ|rt|St22λρ2\lambda\geq\frac{|r_{t}|S_{t}^{2}}{2\lambda\rho^{2}} for all tt for notational simplicity; note that, even without assumption, the conclusion that a multiplicative noise is the better kind of data augmentation will not change. Now we are ready to prove the proposition.

Proposition 5.

(Utility of additive Gaussian noise strategy.) Let the strategy be as specified in Eq. (52), and other conditions the same as in Proposition 1, then the true utility is

UAdd=r22λσ2T𝔼St[(trtSt)2t(rtSt)2Θ(trtSt2)].U_{\rm Add}=\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\frac{(\sum_{t}r_{t}S_{t})^{2}}{\sum_{t}(r_{t}S_{t})^{2}}\Theta(\sum_{t}r_{t}S_{t}^{2})\right]. (59)

Proof. The beginning of the proof is similar to the case with no data augmentation. Following the same procedure, we obtain an equation that is the same as Eq. (46):

U(π)=rTt=1Tπtλσ22Tt=1T(πt)2.U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\pi_{t}^{*}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}(\pi_{t}^{*})^{2}. (60)

Plug in the preceding lemma, we have

U(π)=rTt=1TrtSt22λρ2λσ22Tt=1T(rtSt22λρ2)2.U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}\left(\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}}\right)^{2}. (61)

This utility is a function of the data augmentation strength ρ\rho. For a fixed training set, we would like to find the best ρ\rho that maximizes the above utility. Note that the maximizer of the utility is different depending on the sign of rtSt2\sum r_{t}S^{2}_{t}. Taking derivative and set to 0, we obtain that

(ρ)2={λσ22rtT(rtSt2)2tTrtSt2,if tTrtSt2>0,otherwise.(\rho^{*})^{2}=\begin{cases}\frac{\lambda\sigma^{2}}{2r}\frac{\sum_{t}^{T}(r_{t}S_{t}^{2})^{2}}{\sum_{t}^{T}r_{t}S_{t}^{2}},&\text{if }\sum_{t}^{T}r_{t}S_{t}^{2}>0\\ \infty,&\text{otherwise.}\\ \end{cases} (62)

Plug in to the previous lemma, we have

πt(ρ)=rtSt22λ(ρ)2=rrtSt2σ2trtSt2t(rtSt2)2Θ(trtSt2).\pi_{t}^{*}(\rho^{*})=\frac{r_{t}S_{t}^{2}}{2\lambda(\rho^{*})^{2}}=\frac{rr_{t}S_{t}^{2}}{\sigma^{2}}\frac{\sum_{t}r_{t}S_{t}^{2}}{\sum_{t}(r_{t}S_{t}^{2})^{2}}\Theta\left(\sum_{t}r_{t}S_{t}^{2}\right). (63)

One thing to notice is that the optimal strength is independent of λ\lambda, which is an arbitrary value and dependent only on the investor’s psychology. Plug into the utility function and take expectation with respect to the training set, we obtain

Uadd\displaystyle U_{\rm add} =𝔼S1,,ST+1[U(π(ρ))]\displaystyle=\mathbb{E}_{S_{1},...,S_{T+1}}\left[U(\pi^{*}(\rho^{*}))\right] (64)
=U(π)=rTt=1Tπt(ρ)λσ22Tt=1T[πt(ρ2)]2\displaystyle=U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\pi_{t}^{*}(\rho^{*})-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}[\pi_{t}^{*}(\rho^{2})]^{2} (65)
=r22λσ2T𝔼S1,,ST+1[(trtSt)2t(rtSt)2Θ(trtSt2)].\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{1},...,S_{T+1}}\left[\frac{(\sum_{t}r_{t}S_{t})^{2}}{\sum_{t}(r_{t}S_{t})^{2}}\Theta\left(\sum_{t}r_{t}S_{t}^{2}\right)\right]. (66)

This finishes the proof. \square

Remark.

Notice that the term in the expectation generally depends on TT in a non-trivial way and cannot be obtained explicitly. However, it does not cause a problem since the final goal is to compare it with the result in the next section.

C.5 Proof for General Multiplicative Gaussian noise

Before we prove the proposition, we first find the strategy for this case. Note that the term ρt2+ρt+12\rho_{t}^{2}+\rho_{t+1}^{2} appears repetitively in this section, and so we define a shorthand notation for it: 12(ρt2+ρt+12):=γt2\frac{1}{2}(\rho_{t}^{2}+\rho_{t+1}^{2}):=\gamma_{t}^{2}.

Lemma 2.

The maximizer of the utility function in Eq. 4 with multiplicative gaussian noise is

πt(ρ)={rtSt22λγt2=rtSt2λ(ρt2+ρt+12),if 1<rtSt22λγt2<1;sgn(rt),otherwise.\pi^{*}_{t}(\rho)=\begin{cases}\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}=\frac{r_{t}S_{t}^{2}}{\lambda(\rho_{t}^{2}+\rho_{t+1}^{2})},&\text{if }-1<\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}<1;\\ {\rm sgn}(r_{t}),&\text{otherwise.}\end{cases} (67)

Proof. With additive Gaussian noise, we have

{𝔼t[Gt(π)]=πt𝔼t[r~t]=πt𝔼t[St+1+ρt+1ϵt+1StρtϵtSt]=πtSt+1StSt=πtrt;Vart[Gt(π)]=πt2Vart[r~t]=πt2Vart[ρt+1ϵt+1ρtϵtSt]=(ρt2+ρt+1)2πt2St2.\begin{cases}\mathbb{E}_{t}\left[G_{t}(\pi)\right]=\pi_{t}\mathbb{E}_{t}\left[\tilde{r}_{t}\right]=\pi_{t}\mathbb{E}_{t}\left[\frac{S_{t+1}+\rho_{t+1}\epsilon_{t+1}-S_{t}-\rho_{t}\epsilon_{t}}{S_{t}}\right]=\pi_{t}\frac{S_{t+1}-S_{t}}{S_{t}}=\pi_{t}r_{t};\\ \text{Var}_{t}\left[G_{t}(\pi)\right]=\pi_{t}^{2}\text{Var}_{t}[\tilde{r}_{t}]=\pi_{t}^{2}\text{Var}_{t}\left[\frac{\rho_{t+1}\epsilon_{t+1}-\rho_{t}\epsilon_{t}}{S_{t}}\right]=\frac{(\rho_{t}^{2}+\rho_{t+1})^{2}\pi_{t}^{2}}{S_{t}^{2}}.\\ \end{cases} (68)

We see that 12(ρt2+ρt+12):=γt2\frac{1}{2}(\rho_{t}^{2}+\rho_{t+1}^{2}):=\gamma_{t}^{2} replaces the role of 2ρ22\rho^{2} for additive Gaussian noise. The training objective becomes

πt\displaystyle\pi^{*}_{t} =argmaxπt{1Tt=1T𝔼t[Gt(π)]λ2Vart[Gt(π)]}\displaystyle=\arg\max_{\pi_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{t}\left[G_{t}(\pi)\right]-\frac{\lambda}{2}\text{Var}_{t}[G_{t}(\pi)]\right\} (69)
=argmaxπt{1Tt=1Tπtrtλγt2πt2St2}.\displaystyle=\arg\max_{\pi_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\pi_{t}r_{t}-\lambda\frac{\gamma_{t}^{2}\pi_{t}^{2}}{S_{t}^{2}}\right\}. (70)

This maximization problem can be maximized for every tt respectively. Taking derivative and set to 0, we find the condition that πt\pi_{t}^{*} satisfies

πt(πtrtλγt2πt2St2)=0\displaystyle\frac{\partial}{\partial\pi_{t}}\left(\pi_{t}r_{t}-\lambda\frac{\gamma_{t}^{2}\pi_{t}^{2}}{S_{t}^{2}}\right)=0 (71)
\displaystyle\longrightarrow\quad πt(γt)=rtSt22λγt2.\displaystyle\pi^{*}_{t}(\gamma_{t})=\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}. (72)

By definition, we also have |πt|1|\pi_{t}|\leq 1, and so

πt(ρ)={rtSt22λγt2,if 1<rtSt22λγt2<1;sgn(rt),otherwise,\pi^{*}_{t}(\rho)=\begin{cases}\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}},&\text{if }-1<\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}<1;\\ {\rm sgn}(r_{t}),&\text{otherwise,}\end{cases} (73)

which is the desired result. \square

Remark.

For fair comparison with the previous section, we also assume that λ|rt|S2λγt2\lambda\geq\frac{|r_{t}|S^{2}}{\lambda\gamma_{t}^{2}}. Again, this is the same as assume that the investors are reasonably risk-averse and is the correct assumption for all practical circumstances.

Proposition 6.

(Utility of general multiplicative Gaussian noise strategy.) Let the strategy be as specified in Eq. (52), and other conditions the same as in Proposition 1, then the true utility is

Umult=r22λσ2[1Φ(r/σ)].U_{\rm mult}=\frac{r^{2}}{2\lambda\sigma^{2}}[1-\Phi(-r/\sigma)]. (74)

Proof. Most of the proof is similar to the Gaussian case by replaing ρ2\rho^{2} with γt2\gamma_{t}^{2}. Following the same procedure, We have:

U(π)=rTt=1Tπtλσ22Tt=1T(πt)2.U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\pi_{t}^{*}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}(\pi_{t}^{*})^{2}. (75)

Plug in the preceding lemma, we have

U(π)=rTt=1TrtSt22λγt2λσ22Tt=1T(rtSt22λγt2)2.U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}\left(\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}\right)^{2}. (76)

This utility is a function of of the data augmentation strength γt\gamma_{t}, and, unlikely the additive Gaussian case, can be maximized term by term for different tt. For a fixed training set, we would like to find the best γt\gamma_{t} that maximizes the above utility. Note that the maximizer of the utility is different depending on the sign of rtSt2r_{t}S^{2}_{t}. Taking derivative and set to 0, we obtain that

(γt)2={σ22rrtSt2,if rtSt2>0,otherwise.(\gamma_{t}^{*})^{2}=\begin{cases}\frac{\sigma^{2}}{2r}r_{t}S_{t}^{2},&\text{if }r_{t}S_{t}^{2}>0\\ \infty,&\text{otherwise.}\\ \end{cases} (77)

Plug in to the previous lemma, we have

πt(γt)=rtSt22λ(γt)2=rλσ2Θ(rt).\pi_{t}^{*}(\gamma_{t}^{*})=\frac{r_{t}S_{t}^{2}}{2\lambda(\gamma_{t}^{*})^{2}}=\frac{r}{\lambda\sigma^{2}}\Theta(r_{t}). (78)

One thing to notice that the optimal strength is independent of λ\lambda, which is an arbitrary value and dependent only on the psychology of the investor. Plug into the utility function and take expectation with respect to the training set, we obtain

Uadd\displaystyle U_{\rm add} =𝔼S1,,ST+1[U(π(ρ))]\displaystyle=\mathbb{E}_{S_{1},...,S_{T+1}}\left[U(\pi^{*}(\rho^{*}))\right] (79)
=U(π)=rTt=1Tπt(γt)λσ22Tt=1T[πt(γt)]2\displaystyle=U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\pi_{t}^{*}(\gamma_{t}^{*})-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}[\pi_{t}^{*}(\gamma_{t}^{*})]^{2} (80)
=r2λσ2[E]t[Θ(rt)]r22λσ2Θ(rt)[E]t[Θ(rt)]\displaystyle=\frac{r^{2}}{\lambda\sigma^{2}}\mathbb{[}E]_{t}[\Theta(r_{t})]-\frac{r^{2}}{2\lambda\sigma^{2}}\Theta(r_{t})\mathbb{[}E]_{t}[\Theta(r_{t})] (81)
=r22λσ2[1Φ(r/σ)]\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}}[1-\Phi(-r/\sigma)] (82)
=r22λσ2Φ(r/σ)\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}}\Phi(r/\sigma) (83)

This finishes the proof. \square

This result can be directly compared to the results in the previous section, and the following remark shows that the multiplicative noise injection is the best kind of noise.

Remark.

(Infinite augmentation strength.) For all of the theoretical results, there is a corner case when the optimal injection strength is equal to infinity, which leads to a non-investing portfolio π=0\pi=0. This case requires special interpretation. This corner case is due to the fact the underlying model we use has a constant, positive expected price return equal to rr, and so it leads to the bizarre data augmentation which effectively amounts to throwing away all the training points with <0<0 return. This is unnatural for a real market. It is possible for the real market to have short-term negative return when conditioned on the previous prices, and so one should not simply discard the negative points. Therefore, in our algorithm section, we recommend treating the training points with positive and negative return equally by taking the absolute value of the data augmentation strength and ignoring \infty case.

C.6 Proof of Theorem 1

Remark.

Combining the above propositions, one can quickly obtain that, if σ0\sigma\neq 0, then Umult>UaddU_{\rm mult}>U_{\rm add} and Umult>UnoaugU_{\rm mult}>U_{\rm no-aug} with probability 11 (Proof in Appendix).

Proof. We first show that Umult>UaddU_{\rm mult}>U_{\rm add}. Recall that

UAdd\displaystyle U_{\rm Add} =r22λσ2T𝔼St[(trtSt)2t(rtSt)2Θ(trtSt2)]\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\frac{(\sum_{t}r_{t}S_{t})^{2}}{\sum_{t}(r_{t}S_{t})^{2}}\Theta\left(\sum_{t}r_{t}S_{t}^{2}\right)\right] (84)
r22λσ2T𝔼St[(trtStΘ(rt>0))2t(rtSt)2Θ(trtSt2)]\displaystyle\leq\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\frac{\left(\sum_{t}r_{t}S_{t}\Theta(r_{t}>0)\right)^{2}}{\sum_{t}(r_{t}S_{t})^{2}}\Theta\left(\sum_{t}r_{t}S_{t}^{2}\right)\right] (85)
r22λσ2T𝔼St[(trtStΘ(rt>0))2t(rtSt)2]\displaystyle\leq\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\frac{(\sum_{t}r_{t}S_{t}\Theta(r_{t}>0))^{2}}{\sum_{t}(r_{t}S_{t})^{2}}\right] (86)
(Cauchy Inequality)r22λσ2T𝔼St[t(rtStΘ(rt>0))2(rtSt)2]\displaystyle\leq_{\text{(Cauchy Inequality)}}\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\sum_{t}\frac{(r_{t}S_{t}\Theta(r_{t}>0))^{2}}{(r_{t}S_{t})^{2}}\right] (87)
=r22λσ2T𝔼St[tΘ(rt>0)]\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\sum_{t}\Theta(r_{t}>0)\right] (88)
=r22λσ2T[t(rt>0)]\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}T}\left[\sum_{t}\mathbb{P}(r_{t}>0)\right] (89)
=r22λσ2Φ(r/t)=Umult.\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}}\Phi(r/t)=U_{\rm mult}. (90)

The Cauchy equality holds if and only if S1==ST+1S_{1}=...=S_{T+1}; this event has probability measure 0, and so, with probability 11, Uadd<UmultU_{\rm add}<U_{\rm mult}.

Now we prove the second inequality. Recall that

Unoaug=[12Φ(r/σ)]rλ2σ2.U_{\rm no-aug}=[1-2\Phi({-r}/{\sigma})]r-\frac{\lambda}{2}\sigma^{2}. (91)

We divide into 22 subcases. Case 11: λ>rσ2\lambda>\frac{r}{\sigma^{2}}. We have

Unoaug<2rΦ(r/σ)<0<Umult.U_{\rm no-aug}<-2r\Phi({-r}/{\sigma})<0<U_{\rm mult}. (92)

Case 22: 0<λrσ20<\lambda\leq\frac{r}{\sigma^{2}}. We have

Unoaug\displaystyle U_{\rm no-aug} <[12Φ(r/σ)]r\displaystyle<[1-2\Phi({-r}/{\sigma})]r (93)
<Φ(r/σ)r\displaystyle<\Phi({r}/{\sigma})r (94)
r22λσ2Φ(r/t)=Umult.\displaystyle\leq\frac{r^{2}}{2\lambda\sigma^{2}}\Phi(r/t)=U_{\rm mult}. (95)

This finishes the proof. \square

C.7 Augmentation for a naive multiplicative noise

In the discussion and experiment sections in the main text, we also mentioned a “naive” version of the multiplicative noise. The motivation for this kind of noise is simple, since the underlying noise in the theoretical models are all of the form σ2St2\sigma^{2}S_{t}^{2}, and so it is tempting to also inject noise that mimicks the underlying noise. It turns out that this is not a good idea.

In this section, we let ρt=ρ0St\rho_{t}=\rho_{0}S_{t} for some positive, time-independent ρ0\rho_{0}. Our goal is to find the optimal ρ0\rho_{0}. With the same calculation, one finds that the learned portfolio is given by the same formula in Lemma 2:

πt(ρ)=rtSt22λγt2.\pi^{*}_{t}(\rho)=\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}. (96)

With this strategy, one can find the optimal noise injection strength to be given by the following proposition.

Proposition 7.

Let the portfolio be given by Eq. (96) and let the price be generated by the GBM, then the optimal noise strength is

(ρ0)2={tTrt2tTrt,if tTrt>0,otherwise.(\rho_{0}^{*})^{2}=\begin{cases}\frac{\sum_{t}^{T}r_{t}^{2}}{\sum_{t}^{T}r_{t}},&\text{if }\sum_{t}^{T}r_{t}>0\\ \infty,&\text{otherwise.}\\ \end{cases} (97)

Proof. As before, We have:

U(π)=rTt=1Tπtλσ22Tt=1T(πt)2.U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\pi_{t}^{*}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}(\pi_{t}^{*})^{2}. (98)

Plug in the portfolio, we have

U(π)=rTt=1TrtSt22λγt2λσ22Tt=1T(rtSt22λγt2)2.U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}\left(\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}\right)^{2}. (99)

Plug in γt2=ρ02St2\gamma_{t}^{2}=\rho_{0}^{2}S_{t}^{2} and take derivative, we obtain that

(ρ0)2={tTrt2tTrt,if tTrt>0,otherwise.(\rho_{0}^{*})^{2}=\begin{cases}\frac{\sum_{t}^{T}r_{t}^{2}}{\sum_{t}^{T}r_{t}},&\text{if }\sum_{t}^{T}r_{t}>0\\ \infty,&\text{otherwise.}\\ \end{cases} (100)

This finishes the proof. \square

C.8 Data Augmentation for a Stationary Portfolio

While the main theory focused on the case with a dynamic portfolio that is updated through time, we also present a study of the stationary portfolio in this section. While this kind of portfolio is less relevant for deep-learning-based finance, we study this case to show that, even in this setting, there is still strong motivation to inject noise of strength rtSt2r_{t}S_{t}^{2}. Now we state the formal definition of a stationary porfolio.

Definition 1.

A portfolio {πt}t=1T\{\pi_{t}\}_{t=1}^{T} is said to be stationary if πt=π\pi_{t}=\pi for some constant π\pi for all tt.

In the language of machine learning, this corresponds to choosing our model as having only a single parameter, whose output is input independent:

f(x)=π.f(x)=\pi. (101)

In traditional finance theory, stationary portfolios have been very important. In practice, most portfolios are “approximately” stationary, since most portfolio managers tend to not to change their portfolio weight at a very short time-scale unless the market is very unstable due to market failure or external information injection. For a stationary portfolio, one still would like to maximize the utility function given in Eq. 4.

For conciseness, we only examine the case when we inject a general time-dependent noise. The curious readers are encouraged to examine the cases with no data augmentation and with constant data augmentation. As before, the following lemma gives the portfolio of the empirical utility. Again, we use the shorthand: 12(ρt2+ρt+12):=γt2\frac{1}{2}(\rho_{t}^{2}+\rho_{t+1}^{2}):=\gamma_{t}^{2}. For illustrative purpose, we have ignore the corner cases of π\pi being greater than 11 or smaller than 1-1.

Lemma 3.

The stationary portfolio that maximizes the utility function in Eq. 4 with multiplicative gaussian noise is

πt(ρ)=tTrt2λtT(γt2/St2)\pi^{*}_{t}(\rho)=\frac{\sum_{t}^{T}r_{t}}{2\lambda\sum_{t}^{T}(\gamma_{t}^{2}/S_{t}^{2})} (102)

Proof sketch. The proof follows almost the same as the previous sections. With a slight difference that π\pi is no more time-dependent and can be taken out of the sum.

Proposition 8.

Let the portfolio be given in Eq. (102), then an augmentation strength satisfying the relation

(γt)2=crtSt2(\gamma_{t}^{*})^{2}=cr_{t}S_{t}^{2} (103)

is an optimal data augmentation for constant c=2σ2rc=\frac{2\sigma^{2}}{r}.

Proof Sketch. The proof is also simple and very similar to the proofs for the dynamic portfolio case. One first solve for the optimal augmentation strength γt\gamma_{t}^{*} and find that it satisfies the following relation

t(γt)2St2=ctTrt\sum_{t}\frac{(\gamma^{*}_{t})^{2}}{S_{t}^{2}}=c\sum_{t}^{T}r_{t} (104)

with c=2σ2rc=\frac{2\sigma^{2}}{r}, and then it suffices to check that the following is one solution

(γt)2=crtSt2.(\gamma^{*}_{t})^{2}=cr_{t}S_{t}^{2}. (105)

This finishes the proof sketch. \square

The curious readers are encouraged to check the intermediate steps. We see that, even for a stationary portfolio case, there is still strong motivation for using a augmentation with strength proportional to rtSt2r_{t}S_{t}^{2}.

We would like to compare this with the best achievable stationary portfolio, which is solved by the following proposition.

Proposition 9.

(Optimal Stationary Portfolio). The optimal stationary portfolio for GBM is πstat=rλσ2\pi^{*}_{stat}=\frac{r}{\lambda\sigma^{2}}, i.e., for any other portfolio π\pi, U(πstat)U(π)U(\pi^{*}_{stat})\geq U(\pi).

Proof. It suffices to find the maximizer portfolio of the true utility:

πstat=argmaxπ{πrλ2π2σ2}.\pi^{*}_{stat}=\arg\max_{\pi}\left\{\pi r-\frac{\lambda}{2}\pi^{2}\sigma^{2}\right\}. (106)

The solution is simple and given by

πstat=rλσ2.\pi^{*}_{stat}=\frac{r}{\lambda\sigma^{2}}. (107)

This completes the proof. \square

Note that the above optimality result holds for both in-sample counterfactual utility and out-of-sample utility. This proposition can be seen as the discrete-time version of the famous Merton’s portfolio solution (Merton,, 1969), where the optimal stationary portfolio is also found to be rλσ2\frac{r}{\lambda\sigma^{2}}. In fact, it is well-known that, for a static market, the stationary portfolios are optimal, but this is beyond the scope of this work (Cover and Thomas,, 2006).

Combining the above two propositions, one obtain the following theorem.

Theorem 3.

The stationary portfolio obtained by training with data augmentation strength in given in Proposition 8 is optimal, i.e., it is no worse than any other stationary portfolio.

Proof. Plug in γ=crtSt2\gamma=cr_{t}S_{t}^{2}, we have that the trained portfolio is π=rλσ2\pi^{*}=\frac{r}{\lambda\sigma^{2}}, which is equivalent to the optimal stationary portfolio, and we are done. \square

This shows that, even for a stationary portfolio, it is useful to use the proposed data augmentation technique.