Theoretically Motivated Data Augmentation and Regularization for Portfolio Construction

Liu Ziyin₁, Kentaro Minami₂, Kentaro Imajo₂
₁Department of Physics, The University of Tokyo
₂Preferred Networks, Inc.

Abstract

The task we consider is portfolio construction in a speculative market, a fundamental problem in modern finance. While various empirical works now exist to explore deep learning in finance, the theory side is almost non-existent. In this work, we focus on developing a theoretical framework for understanding the use of data augmentation for deep-learning-based approaches to quantitative finance. The proposed theory clarifies the role and necessity of data augmentation for finance; moreover, our theory implies that a simple algorithm of injecting a random noise of strength $\sqrt{|r_{t-1}|}$ to the observed return $r_{t}$ is better than not injecting any noise and a few other financially irrelevant data augmentation techniques.¹¹1This is the full-length version of our work published at the 3rd ACM International Conference on AI in Finance (ICAIF’22). See https://doi.org/10.1145/3533271.3561720 for the shorter published version.
The code is available at: https://github.com/pfnet-research/Finance_data_augmentation_ICAIF2022

1 Introduction

There is an increasing interest in applying machine learning methods to problems in the finance industry. This trend has been expected for almost forty years (Fama,, 1970), when well-documented and fine-grained (minute-level) data of stock market prices became available. In fact, the essence of modern finance is fast and accurate large-scale data analysis (Goodhart and O’Hara,, 1997), and it is hard to imagine that machine learning should not play an increasingly crucial role in this field. In contemporary research, the central theme in machine-learning based finance is to apply existing deep learning models to financial time-series prediction problems (Imajo et al.,, 2020; Buehler et al.,, 2019; Jay et al.,, 2020; Imaki et al.,, 2021; Jiang et al.,, 2017; Fons et al.,, 2020; Lim et al.,, 2019; Zhang et al.,, 2020), which have demonstrated the hypothesized usefulness of deep learning for the financial industry.

However, one major existing gap in this interdisciplinary field of deep-learning finance is the lack of a theory relevant to justify finance-oriented algorithm design. The goal of this work is to propose such a framework, where machine learning practices are analyzed in a traditional financial-economic utility theory setting. Our theory implies that a simple theoretically motivated data augmentation technique is better for the task portfolio construction than not injecting any noise at all or some naive noise injection methods that have no theoretical justification. To summarize, our main contributions are (1) to demonstrate how we can use utility theory to analyze practices of deep-learning-based finance, and (2) to theoretically study the role of data augmentation in the deep-learning-based portfolio construction problem. Organization: the next section discusses the main related works; Section 3 provides the requisite finance background for understanding this work; Section 4 presents our theoretical contributions, which is a framework for understanding machine-learning practices in the portfolio construction problem; Section 5 describes how to practically implement the theoretically motivated algorithm; section 6 validates the theory with experiments on toy and real data.

2 Related Works

Existing deep learning finance methods. In recent years, various empirical approaches to apply state-of-the-art deep learning methods to finance have been proposed (Imajo et al.,, 2020; Ito et al.,, 2020; Buehler et al.,, 2019; Jay et al.,, 2020; Imaki et al.,, 2021; Jiang et al.,, 2017; Fons et al.,, 2020). The interested readers are referred to (Ozbayoglu et al.,, 2020) for detailed descriptions of existing works. However, we notice that one crucial gap is the complete lack of theoretical analysis or motivation in this interdisciplinary field of AI-finance. This work makes one initial step to bridge this gap. One theme of this work is that finance-oriented prior knowledge and inductive bias is required to design the relevant algorithms. For example, Ziyin et al., (2020) shows that incorporating prior knowledge into architecture design is key to the success of neural networks and applied neural networks with periodic activation functions to the problem of financial index prediction. Imaki et al., (2021) shows how to incorporate no-transaction prior knowledge into network architecture design when transaction cost is incorporated.

Refer to caption — Figure 1: Performance (measured by the Sharpe ratio) of various algorithms on MSFT (Microsoft) from 2018-2020. Directly applying generic machine learning methods, such as weight decay, fails to improve the vanilla model. The proposed method show significant improvement.

In fact, most generic and popular machine learning techniques are proposed and have been tested for standard ML tasks such as image classification or language processing. Directly applying the ML methods that work for image tasks is unlikely to work well for financial tasks, where the nature of the data is different. See Figure 1, where we show the performance of a neural network directly trained to maximize wealth return on MSFT during 2019-2020. Using popular, generic deep learning techniques such as weight decay or dropout does not result in any improvement over the baseline. In contrast, our theoretically motivated method does. Combining the proposed method with weight decay has the potential to improve the performance a little further, but the improvement is much lesser than the improvement of using the proposed method over the baseline. This implies that a generic machine learning method is unlikely to capture well the inductive biases required to tackle a financial task. The present work proposes to fill this gap by showing how finance knowledge can be incorporated into algorithm design.

Data augmentation. Consider a training loss function of the additive form $L=\frac{1}{N}\sum_{i}\ell(x_{i},y_{i})$ for $N$ pairs of training data points $\{(x_{i},y_{i})\}_{i=1}^{N}$ . Data augmentation amounts to defining an underlying data-dependent distribution and generating new data points stochastically from this underlying distribution. A general way to define data augmentation is to start with a datum-level training loss and transform it to an expectation over an augmentation distribution $P\left(z|(x_{i},y_{i})\right)$ (Dao et al.,, 2019), $\ell(x_{i},y_{i})\to\mathbb{E}_{(z_{i},g_{i})\sim P(z,g|(x_{i},y_{i}))}[\ell(z_{i},g_{i})]$ , and the total training loss function becomes

L_{\rm aug}=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{(z_{i},g_{i})\sim P(z,g|(x_{i},y_{i}))}[\ell(z_{i},g_{i})].

(1)

One common example of data augmentation is injecting isotropic gaussian noise to the input (Shorten and Khoshgoftaar,, 2019; Fons et al.,, 2020), which is equivalent to setting $P(z,g|(x_{i},y_{i}))\sim\delta(g-y_{i})\exp\left[-(z-x_{i})^{\rm T}(z-x_{i})/(2\sigma^{2})\right]$ for some specified strength $\sigma^{2}$ . Despite the ubiquity of data augmentation in deep learning, existing works are often empirical in nature (Fons et al.,, 2020; Zhong et al.,, 2020; Shorten and Khoshgoftaar,, 2019; Antoniou et al.,, 2017). For a relevant example, Fons et al., (2020) empirically evaluates the effect of different types of data augmentation in a financial series prediction task. Dao et al., (2019) is one major recent theoretical work that tries to understand modern data augmentation theoretically; it shows that data augmentation is approximately learning in a special kernel. He et al., (2019) argues that data augmentation can be seen as an effective regularization. However, no theoretically motivated data augmentation method for finance exists yet. One major challenge and achievement of this work is to develop a theory that bridges the traditional finance theory and machine learning methods. In the next section, we introduce the portfolio theory.

3 Background: Markowitz Portfolio Theory

How to make optimal investment in a financial market is the central concern of portfolio theory. One unfamiliar with the portfolio theory may easily confuse the task the portfolio construction with wealth maximization trading or future price prediction. Before we introduce the portfolio theory, we first stress that the task of portfolio construction is not equivalent to wealth maximization or accurate price prediction. One can construct an optimal portfolio without predicting the price or maximizing the wealth increase.

Consider a market with an equity (a stock) and a fixed-interest rate bond (a government bond). We denote the price of the equity at time step $t$ as $S_{t}$ , and the price return is defined as $r_{t}=\frac{S_{t+1}-S_{t}}{S_{t}}$ , which is a random variable with variance $C_{t}$ , and the expected return $g_{t}:=\mathbb{E}[r_{t}]$ . Our wealth at time step $t$ is $W_{t}=M_{t}+n_{t}S_{t}$ , where $M_{t}$ is the amount of cash we hold, and $n_{i}$ the shares of stock we hold for the $i$ -th stock. As in the standard finance literature, we assume that the shares are infinitely divisible. Usually, a positive $n$ denotes holding (long) and a negative $n$ denotes borrowing (short). The wealth we hold initially is $W_{0}>0$ , and we would like to invest our money on the equity. We denote the relative value of the stock we hold as $\pi_{t}=\frac{n_{t}S_{t}}{W_{t}}$ . $\pi$ is called a portfolio. The central challenge in portfolio theory is to find the best $\pi$ . At time $t$ , our wealth is $W_{t}$ ; after one time step, our wealth changes due to a change in the price of the stock (setting the interest rate to be $0$ ): $\Delta W_{t}:=W_{t+1}-W_{t}=W_{t}\pi_{t}r_{t}$ . The goal is to maximize the wealth return $G_{t}:=\pi_{t}\cdot r_{t}$ at every time step while minimizing risk²²2It is important to not to confuse the price return $r_{t}$ with the wealth return $G_{t}$ .. The risk is defined as the variance of the wealth change:

R_{t}:=R(\pi_{t}):=\text{Var}_{r_{t}}[G_{t}]=\left(\mathbb{E}[r_{t}^{2}]-g_{t}^{2}\right)\pi_{t}^{2}=\pi_{t}^{2}C_{t}.

(2)

The standard way to control risk is to introduce a “risk regularizer” that punishes the portfolios with a large risk (Markowitz,, 1959; Rubinstein,, 2002).³³3In principle, any concave function in $G_{t}$ can be a risk regularizer from classical economic theory (Von Neumann and Morgenstern,, 1947). One common alternative would be $R(G)=\log(G)$ (Kelly Jr,, 2011), and our framework can be easily extended to such cases. Introducing a parameter $\lambda$ for the strength of regularization (the factor of $1/2$ appears for convention), we can now write down our objective:

\pi_{t}^{*}=\arg\max_{\pi}U(\pi):=\arg\max_{\pi}\left[\pi^{\rm T}G_{t}-\frac{\lambda}{2}R(\pi)\right].

(3)

Here, $U$ stands for the utility function; $\lambda$ can be set to be the desired level of risk-aversion. When $g_{t}$ and $C_{t}$ is known, this problem can be explicitly solved. However, one main problem in finance is that its data is highly limited and we only observe one particular realized data trajectory, and $g_{t}$ and $C_{t}$ are hard to estimate. This fact motivates for the necessity of data augmentation and synthetic data generation in finance (Assefa,, 2020). In this paper, we treat the case where there is only one asset to trade in the market, and the task of utility maximization amounts to finding the best balance between cash-holding and investment. The equity we are treating is allowed to be a weighted combination of multiple stocks (a portfolio of some public fund manager, for example), and so our formalism is not limited to single-stock situations. In section C.1, we discuss portfolio theory with multiple stocks.

4 Portfolio Construction as a Training Objective

Recent advances have shown that the financial objectives can be interpreted as training losses for an appropriately inserted neural-network model (Ziyin et al.,, 2019; Buehler et al.,, 2019). It should come as no surprise that the utility function (3) can be interpreted as a loss function. When the goal is portfolio construction, we parametrize the portfolio $\pi_{t}=\pi_{\mathbf{w}}(x_{t})$ by a neural network with weights $\mathbf{w}$ , and the utility maximization problem becomes a maximization problem over the weights of the neural network. The time-dependence is modeled through the input to the network $x_{t}$ , which possibly consists of the available information at time $t$ for determining the future price.⁴⁴4It is helpful to imagine $x_{t}$ as, for example, the prices of the stocks in the past $10$ days. The objective function (to be maximized) plus a pre-specified data augmentation transform $x_{t}\to z_{t}$ with underlying distribution $p(z|x_{t})$ is then

\pi^{*}_{t}=\arg\max_{\mathbf{w}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{t}\left[G_{t}(\pi_{\mathbf{w}}(z_{t}))\right]-\lambda\text{Var}_{t}[G_{t}(\pi_{\mathbf{w}}(z_{t}))]\right\},

(4)

where $\mathbb{E}_{t}:=\mathbb{E}_{z_{t}\sim p(z|x_{t})}$ . In this work, we abstract away the details of the neural network to approximate $\pi$ . We instead focus on studying the maximizers of this equation, which is a suitable choice when the underlying model is a neural network because one primary motivation for using neural networks in finance is that they are universal approximators and are often expected to find such maximizers (Buehler et al.,, 2019; Imaki et al.,, 2021).

The ultimate financial goal is to construct $\pi^{*}$ such that the utility function is maximized with respect to the true underlying distribution of $S_{t}$ , which can be used as the generalization loss (to be maximized):

\pi^{*}_{t}=\arg\max_{\pi_{t}}\left\{\mathbb{E}_{S_{t}}\left[G_{t}(\pi)\right]-\lambda\text{Var}_{S_{t}}[G_{t}(\pi)]\right\}.

(5)

Note the difference in taking the expectation between Eq (4) and (5) is that $\mathbb{E}_{t}$ is computed with respect to the training set we hold, while $\mathbb{E}_{S_{t}}:=\mathbb{E}_{S_{t}\sim p(S_{t})}$ is computed with respect to the underlying distribution of $S_{t}$ given its previous prices. We used the same short-hands for $\text{Var}_{t}$ and $\text{Var}_{S_{t}}$ . Technically, the true utility we defined is an in-sample counterfactual objective, which roughly evaluates the expected utility to be obtained if we restart from yesterday, which is a relevant measure for financial decision making. In Section 4.5, we also analyze the out-of-sample performance when the portfolio is static.

4.1 Standard Models of Stock Prices

The expectations in the true objective Equation (5) need to be taken with respect to the true underlying distribution of the stock price generation process. In general, the price follows the following stochastic process $\Delta S_{t}=f(\{S_{i}\}_{i=1}^{t})+g(\{S_{i}\}_{i=1}^{t})\eta_{t}$ for a zero-mean and unit variance random noise $\eta_{t}$ ; the term $f$ reflects the short-term predictability of the stock price based on past prices, and $g$ reflects the extent of unpredictability in the price. A key observation in finance is that $g$ is non-stationary (heteroskedastic) and price-dependent (multiplicative). One model is the geometric Brownian motion (GBM)

S_{t+1}=(1+r)S_{t}+\sigma_{t}S_{t}\eta_{t},

(6)

which is taken as the minimal standard model of the motion of stock prices (Mandelbrot,, 1997; Black and Scholes,, 1973); this paper also assumes the GBM model as the underlying model. Here, we note that the theoretical problem we consider can be seen as a discrete-time version of the classical Merton’s portfolio problem (Merton,, 1969). The more flexible Heston model (Heston,, 1993) takes the form $dS_{t}=rS_{t}dt+\sqrt{\nu_{t}}S_{t}dW_{t}$ , where $\nu_{t}$ is the instantaneous volatility that follows its own random walk, and $dW_{t}$ is drawn from a Gaussian distribution. Despite the simplicity of these models, the statistical properties of these models agree well with the known statistical properties of the real financial markets (Drǎgulescu and Yakovenko,, 2002). The readers are referred to (Karatzas et al.,, 1998) for a detailed discussion about the meaning and financial significance of these models.

4.2 No Data Augmentation

In practice, there is no way to observe more than one data point for a given stock at a given time $t$ . This means that it can be very risky to directly train on the raw observed data since nothing prevents the model from overfitting to the data. Without additional assumptions, the risk is zero because there is no randomness in the training set conditioning on the time $t$ . To control this risk, we thus need data augmentation. One can formalize this intuition through the following proposition, whose proof is given in Section C.3.

Proposition 1.

(Utility of no-data-augmentation strategy.) Let the price trajectory be generated with GBM in Eq. (6) with initial price $S_{0}$ , then the true utility for the no-data-augmentation strategy is

U_{\rm no-aug}=[1-2\Phi({-r}/{\sigma})]r-\frac{\lambda}{2}\sigma^{2}

(7)

where $U(\pi)$ is the utility function defined in Eq. (3); $\Phi$ is the c.d.f. of a standard normal distribution.

This means that, the larger the volatility $\sigma$ , the smaller is the utility of the no-data-augmentation strategy. This is because the model may easily overfit to the data when no data augmentation is used. In the next section, we discuss the case when a simple data augmentation is used.

4.3 Additive Gaussian Noise

While it is still far from clear how the stock price is correlated with the past prices, it is now well-recognized that $\text{Var}_{S_{t}}[S_{t}|S_{t-1}]\neq 0$ (Mandelbrot,, 1997; Cont,, 2001). This motivates a simple data augmentation technique to add some randomness to the financial sequence we observe, $\{S_{1},...,S_{T+1}\}$ . This section analyzes a vanilla version of data augmentation of injecting simple Gaussian noise, compared to a more sophisticated data augmentation method in the next section. Here, we inject random Gaussian noises $\epsilon_{t}\sim\mathcal{N}(0,\rho^{2})$ to $S_{t}$ during the training process such that $z_{t}=S_{t}+\epsilon$ . Note that the noisified return needs to be carefully defined since noise might also appear in the denominator, which may cause divergence; to avoid this problem, we define the noisified return to be $\tilde{r}_{t}:=\frac{z_{t+1}-z_{t}}{S_{t}}$ , i.e., we do not add noise to the denominator. Theoretically, we can find the optimal strength $\rho^{*}$ of the gaussian data augmentation to be such that the true utility function is maximized for a fixed training set. The result can be shown to be

(\rho^{*})^{2}=\frac{\sigma^{2}}{2r}\frac{\sum_{t}(r_{t}S_{t}^{2})^{2}}{\sum_{t}r_{t}S_{t}^{2}}.

(8)

The fact the $\rho^{*}$ depends on the prices of the whole trajectory reflects the fact that time-independent data augmentation is not suitable for a stock price dynamics prescribed by Eq. (6), whose inherent noise $\sigma S_{t}\eta_{t}$ is time-dependent through the dependence on $S_{t}$ . Finally, we can plug in the optimal $\rho^{*}$ to obtain the optimal achievable strategy for the additive Gaussian noise augmentation. As before, the above discussion can be formalized, with the true utility given in the next proposition (proof in Section C.4).

Proposition 2.

(Utility of additive Gaussian noise strategy.) Under additive Gaussian noise strategy, and let other conditions the same as in Proposition 1, the true utility is

U_{\rm Add}=\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\frac{(\sum_{t}r_{t}S_{t})^{2}}{\sum_{t}(r_{t}S_{t})^{2}}\Theta\left(\sum_{t}r_{t}S_{t}^{2}\right)\right],

(9)

where $\Theta$ is the Heaviside step function.

4.4 Multiplicative Gaussian Noise

In this section, we derive a general kind of data augmentation for the price trajectories specified by the GBM and the Heston model. From the previous discussions, one might expect that a better kind of augmentation should have $\rho=\rho_{0}S_{t}$ , i.e., the injected noise should be multiplicative; however, we do not start from imposing $\rho\to\rho S_{t}$ ; instead, we consider $\rho\to\rho_{t}$ , i.e., a general time-dependent noise. In the derivation, one can find an interesting relation for the optimal augmentation strength:

(\rho_{t+1}^{*})^{2}+(\rho_{t}^{*})^{2}=\frac{\sigma^{2}}{2r}r_{t}S_{t}^{2}.

(10)

The following proposition gives the true utility of using this data augmentation (derivations in Section C.5).

Proposition 3.

(Utility of general multiplicative Gaussian noise strategy.) Under general multiplicative noise augmentation strategy, and let other conditions the same as in Proposition 1, then the true utility is

U_{\rm mult}=\frac{r^{2}}{2\lambda\sigma^{2}}[1-\Phi(-r/\sigma)].

(11)

Combining the above propositions, we can prove the main theorem of this work ((Proof in Section C.6)), which shows that the mean-variance utility of the proposed method is strictly higher than that of no data-augmention and that of additive Gaussian noise.

Theorem 1.

If $\sigma\neq 0$ , then $U_{\rm mult}>U_{\rm add}$ and $U_{\rm mult}>U_{\rm no-aug}$ with probability $1$ .

Heston Model and Real Price Augmentation. We also consider the more general Heston model. The derivation proceeds similarly by replacing $\sigma^{2}\to\nu_{t}^{2}$ ; one arrives at the relation for optimal augmentation: $(\rho_{t+1}^{*})^{2}+(\rho_{t}^{*})^{2}=\frac{1}{2r}\nu^{2}_{t}r_{t}S_{t}^{2}$ . One quantity we do not know is the volatility $\nu_{t}$ , which has to be estimated by averaging over the neighboring price returns. One central message from the above results is that one should add noises with variance proportional to $r_{t}S_{t}^{2}$ to the observed prices for augmenting the training set.

4.5 Stationary Portfolio

In the previous sections, we have discussed the case when the portfolio is dynamic (time-dependent). One slight limitation of the previous theory is that one can only compare the in-sample counterfactual performance of a dynamic portfolio. Here, we alternatively motivate the proposed data augmentation technique when the model is a stationary portfolio. One can show that, for a stationary portfolio, the proposed data augmentation technique gives the overall optimal performance.

Theorem 2.

Under the multiplicative data augmentation strategy, the in-sample counterfactual utility and the out-of-sample utility is optimal among all stationary portfolios.

Remark.

See Section C.8 for a detailed discussion and the proof. Stationary portfolios are important in financial theory and can be shown to be optimal even among all dynamic portfolios in some situations (Cover and Thomas,, 2006; Merton,, 1969). While restricting to stationary portfolios allows us to also compare on out-of-sample performance, the limitation is that a stationary portfolio is less relevant for a deep learning model than the dynamical portfolios considered in the previous sections.

4.6 General Framework

So far, we have been analyzing the data augmentation for specific examples of the utility function and the data augmentation distribution to argue that certain types of data augmentation is preferable. Now we outline how this formulation can be generalized to deal with a wider range of problems, such as different utility functions and different data augmentations. This general framework can be used to derive alternative data augmentations schemes if one wants to maximize other financial metrics other than the Sharpe ratio, such as the Sortino ratio (Estrada,, 2006), or to incorporate regularization effect that into account of the heavy tails of the prices distribution.

For a general utility function $U=U(x,\pi)$ for some data point $x$ that describes the current state of the market, and $\pi$ that describes our strategy in this market state, we would like to ultimately maximize

\max_{\pi}V(\pi),\quad\text{for }V(\pi)=\mathbb{E}_{x}[U(x,\pi)]

(12)

However, only observing finitely many data points, we can only optimize the empirical loss with respect to some $\theta-$ parametrized augmentation distribution $P_{\theta}$ :

\hat{\pi}(\theta)=\arg\max_{\pi}\frac{1}{N}\sum_{i}^{N}\mathbb{E}_{z_{i}\sim p_{\theta}(z|x_{i})}[U(z_{i},\pi_{i})].

(13)

The problem we would like to solve is to find the effect of using such data augmentation on the true utility $V$ , and then, if possible, compare different data augmentations and identify the better one. Surprisingly, this is achievable since $V=V(\hat{\pi}(\theta))$ is now also dependent on the parameter $\theta$ of the data augmentation. Note that the true utility has to be found with respect to both the sampling over the test points and the sampling over the $N$ -sized training set:

V(\hat{\pi}(\theta))=\mathbb{E}_{x\sim p(x)}\mathbb{E}_{\{x_{i}\}^{N}\sim p^{N}(x)}[U(x,\hat{\pi}(\theta))]

(14)

In principle, this allows one to identify the best data augmentation for the problem at hand:

	$\displaystyle\theta^{*}=\arg\max_{\theta}V(\hat{\pi}(\theta))\arg\max_{\theta}\mathbb{E}_{x\sim p(x)}\mathbb{E}_{\{x_{i}\}^{N}\sim p^{N}(x)}$
	$\displaystyle\left[U\left(x,\arg\max_{\pi}\frac{1}{N}\sum_{i}^{N}\mathbb{E}_{z_{i}\sim p_{\theta}(z\|x_{i})}[U(z_{i},\pi_{i})]\right)\right],$		(15)

and the analysis we performed in the previous sections is simply a special case of obtaining solutions to this maximization problem. Moreover, one can also compare two different parametric augmentation distributions; let their parameter be denoted as $\theta_{\alpha}$ and $\theta_{\beta}$ respectively, then we can say that data augmentation $\alpha$ is better than $\beta$ if and only if $\max_{\theta_{\alpha}}V(\hat{\pi}(\theta_{\alpha}))>\max_{\theta_{\beta}}V(\hat{\pi}(\theta_{\beta})).$ This general formulation can also have applicability outside the field of finance because one can interpret the utility $U$ as a standard machine learning loss function and $\pi$ as the model output. This procedure also mimics the procedure of finding a Bayes estimator in the statistical decision theory (Wasserman,, 2013), with $\theta$ being the estimator we want to find; we outline an alternative general formulation to find the “minimax” augmentation in Section C.2.

5 Algorithms

Our results strongly motivate for a specially designed data augmentation for financial data. For a data point consisting purely of past prices $(S_{t},...,S_{t+L},S_{t+L+1})$ and the associated returns $(r_{t},...,r_{t+L-1},r_{t+L})$ , we use $x=(S_{t},...,S_{t+L})$ as the input for our model $f$ , possibly a neural network, and use $S_{t+L+1}$ as the unseen future price for computing the training loss. Our results suggests that we should randomly noisify both the input $x$ and $S_{t+L+1}$ at every training step by

\begin{cases}S_{i}\to S_{i}+c\sqrt{\hat{\sigma}^{2}_{i}|r_{i}|S_{i}^{2}}\epsilon_{i}\quad\quad\text{for }S_{i}\in x;\\ S_{t+L+1}\to S_{t+L+1}+c\sqrt{\hat{\sigma}^{2}_{i}|r_{t+L}|S_{t+L}^{2}}\epsilon_{t+L+1};\end{cases}

(16)

where $\epsilon_{i}$ are i.i.d. samples from $\mathcal{N}(0,1)$ , and $c$ is a hyperparameter to be tuned. While the theory suggests that $c$ should be $1/2$ , it is better to make it a tunable-parameter in algorithm design for better flexibility; $\hat{\sigma}_{t}$ is the instantaneous volatility, which can be estimated using standard methods in finance (Degiannakis and Floros,, 2015). One might also assume $\hat{\sigma}$ into $c$ .

5.1 Using return as inputs

Practically and theoretically, it is better and standard to use the returns $x=(r_{t},...,r_{t+L-1},r_{t+L})$ as the input, and the algorithm can be applied in a simpler form:

\begin{cases}r_{i}\to r_{i}+c\sqrt{\hat{\sigma}^{2}_{i}|r_{i}|}\epsilon_{i}&\text{for }r_{i}\in x;\\ r_{t+L}\to r_{t+L}+c\sqrt{\hat{\sigma}^{2}_{i}|r_{t+L}|}\epsilon_{t+L+1}.\end{cases}

(17)

5.2 Equivalent Regularization on the output

One additional simplification can be made by noticing the effect of injecting noise to $r_{t+L}$ on the training loss is equivalent to a regularization. We show in Section B that, under the GBM model, the training objective can be written as

\arg\max_{b_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{z}\left[G_{t}(\pi)\right]-\lambda c^{2}\hat{\sigma}_{t}^{2}|r_{t}|\pi_{t}^{2}\right\},

(18)

where the expectation over $x$ is now only taken with respect to the input. This means that the noise injection on the $r_{t+L}$ is equivalent to adding a $L_{2}$ regularization on the model output $\pi_{t}$ . This completes the main proposed algorithm of this work. We discuss a few potential variants in Section B. Also, it is well known that the magnitude of $|r_{t}|$ has strong time-correlation (i.e., a large $|r_{t}|$ suggests a large $|r_{t+1}|$ ) (Lux and Marchesi,, 2000; Cont and Bouchaud,, 1997; Cont,, 2007), and this suggests that one can also use the average of the neighboring returns to smooth the $|r_{t}|$ factor in the last term for some time-window of width $\tau$ : $|r_{t}|\to|\hat{r}_{t}|=\frac{1}{\tau}\sum_{0}^{\tau}|r_{t-\tau}|$ . In our S&P500 experiments, we use this smoothing technique with $\tau=20$ .

Table 1: Sharpe ratio on S&P 500 by sectors; the larger the better. Best performances in Bold.

Industry Sectors	# Stock	Merton	no aug.	weight decay	additive aug.	naive mult.	proposed
Communication Services	9	$-0.06_{\pm 0.04}$	$-0.06_{\pm 0.04}$	$-0.06_{\pm 0.27}$	$\mathbf{0.22_{\pm 0.18}}$	$\mathbf{0.20_{\pm 0.21}}$	$\mathbf{0.33_{\pm 0.16}}$
Consumer Discretionary	39	$-0.01_{\pm 0.03}$	$-0.07_{\pm 0.03}$	$-0.06_{\pm 0.10}$	$0.48_{\pm 0.10}$	$0.41_{\pm 0.09}$	$\mathbf{0.64_{\pm 0.08}}$
Consumer Staples	27	$0.05_{\pm 0.03}$	$0.24_{\pm 0.03}$	$0.23_{\pm 0.11}$	$\mathbf{0.36_{\pm 0.08}}$	$\mathbf{0.34_{\pm 0.09}}$	$\mathbf{0.35_{\pm 0.07}}$
Energy	17	$0.07_{\pm 0.03}$	$0.03_{\pm 0.03}$	$-0.02_{\pm 0.12}$	$0.70_{\pm 0.09}$	$0.52_{\pm 0.10}$	$\mathbf{0.91_{\pm 0.10}}$
Financials	46	$-0.57_{\pm 0.04}$	$-0.61_{\pm 0.03}$	$-0.61_{\pm 0.09}$	$-0.06_{\pm 0.10}$	$-0.13_{\pm 0.09}$	$\mathbf{0.18_{\pm 0.08}}$
Health Care	44	$0.23_{\pm 0.04}$	$0.60_{\pm 0.04}$	$0.61_{\pm 0.11}$	$\mathbf{0.86_{\pm 0.09}}$	$\mathbf{0.81_{\pm 0.09}}$	$\mathbf{0.83_{\pm 0.07}}$
Industrials	44	$-0.09_{\pm 0.03}$	$-0.11_{\pm 0.03}$	$-0.11_{\pm 0.08}$	$0.36_{\pm 0.08}$	$0.28_{\pm 0.08}$	$\mathbf{0.48_{\pm 0.08}}$
Information Technology	41	$0.41_{\pm 0.04}$	$0.41_{\pm 0.04}$	$0.41_{\pm 0.11}$	$0.67_{\pm 0.10}$	$\mathbf{0.74_{\pm 0.11}}$	$\mathbf{0.79_{\pm 0.09}}$
Materials	19	$0.07_{\pm 0.03}$	$0.06_{\pm 0.03}$	$0.03_{\pm 0.14}$	$\mathbf{0.47_{\pm 0.13}}$	$\mathbf{0.43_{\pm 0.13}}$	$\mathbf{0.53_{\pm 0.10}}$
Real Estate	22	$-0.14_{\pm 0.04}$	$-0.39_{\pm 0.03}$	$-0.40_{\pm 0.12}$	$0.05_{\pm 0.10}$	$0.05_{\pm 0.09}$	$\mathbf{0.19_{\pm 0.07}}$
Utilities	24	$-0.29_{\pm 0.02}$	$-0.29_{\pm 0.02}$	$-0.28_{\pm 0.07}$	$-0.01_{\pm 0.06}$	$-0.00_{\pm 0.06}$	$\mathbf{0.15_{\pm 0.04}}$
S&P500 Avg.	365	$-0.02_{\pm 0.04}$	$-0.00_{\pm 0.04}$	$-0.01_{\pm 0.04}$	$0.39_{\pm 0.03}$	$0.35_{\pm 0.03}$	$\mathbf{0.51_{\pm 0.03}}$

6 Experiments

We validate our theoretical claim that using multiplicative noise with strength $\sqrt{r}$ is better than not using any data augmentation or using a data augmentation that is not suitable for the nature of portfolio construction (such as an additive Gaussian noise). We emphasize that the purpose of this section is for demonstrating the relevance of our theory to real financial problems, not for establishing the proposed method as a strong competitive method in the industry. We start with a toy dataset that follows the theoretical assumptions and then move on to real data with S&P500 prices. The detailed experimental settings are given in Section A. Unless otherwise specified, we use a feedforward neural network with the number of neurons $10\to 64\to 64\to 1$ with ReLU activations. Training proceeds with the Adam optimizer with a minibatch size of $64$ for 100 epochs with the default parameter settings.⁵⁵5In our initial experiments, we also experimented with different architectures (different depth or width of the FNN, RNN, LSTM), and our conclusion that the proposed augmentation outperforms the specified baselines remain unchanged.

We use the Sharpe ratio as the performance metric (the larger the better). Sharpe ratio is defined as $SR_{t}=\frac{\mathbb{E}[\Delta W_{t}]}{\sqrt{\text{Var}[\Delta W_{t}]}}$ , which is a measure of the profitability per risk. We choose this metric because, in the framework of portfolio theory, it is the only theoretically motivated metric of success (Sharpe,, 1966). In particular, our theory is based on the maximization of the mean-variance utility in Eq. (3) and it is well-known that the maximization of the mean-variance utility is equivalent to the maximization of the Sharpe ratio. In fact, it is a classical result in classical financial research that all optimal strategies must have the same Sharpe ratio (Sharpe,, 1964) (also called the efficient capital frontier). For the synthetic tasks, we can generate arbitrarily many test points to compare the Sharpe ratios unambiguously. We then move to experiments on real stock price series; the limitation is that the Sharpe ratio needs to be estimated and involves one additional source of uncertainty.⁶⁶6We caution the readers to not to confuse the problem of portfolio construction with the problem of financial price prediction. Portfolio construction is the primary focus of our work and is fundamentally different from the problem of financial price prediction. Our method is not relevant and cannot be used directly for predicting future prices. As in real life, one does not need to be able to predict prices to decide which stock to purchase.

6.1 Geometric Brownian Motion

We first start from experimenting with stock prices generated with a GBM, as specified in Eq. (6), and we generate a fixed price trajectory with length $T=400$ for training; each training point consists of a sequence of past prices $(S_{t},...,S_{t+9},S_{t+10})$ where the first ten prices are used as the input to the model, and $S_{t+10}$ is used for computing the loss.

Results and discussion. See Figure 2. The proposed method is plotted in blue. The right figure compares the proposed method with the other two baseline data augmentations we studied in this work. As the theory shows, the proposed method is optimal for this problem, achieving the optimal Sharpe ratio across a 600-step trading period. This directly confirms our theory.

6.2 S&P 500 Prices

This section demonstrates the relevance of the proposed algorithm to real market data. In particular, We use the data from S&P500 from 2016 to 2020, with $1000$ days in total. We test on the $365$ stocks that existed on S&P500 from $2000$ to $2020$ . We use the first $800$ days as the training set and the last $200$ days for testing. The model and training setting is similar to the previous experiment. We treat each stock as a single dataset and compare on all of the $365$ stocks (namely, the evaluation is performed independently on $365$ different datasets). Because the full result is too long, We report the average Sharpe ratio per industrial sector (categorized according to GISC) and the average Sharpe ratio of all 365 datasets. See Section A.1 and A.4 for more detail.

Results and discussion. See Table 1. We see that, without data augmentation, the model works poorly due to its incapability of assessing the underlying risk. We also notice that weight decay does not improve the performance (if it is not deteriorating the performance). We hypothesize that this is because weight decay does not correctly capture the inductive bias that is required to deal with a financial series prediction task. Using any kind of data augmentation seems to improve upon not using data augmentation. Among these, the proposed method works the best, possibly due to its better capability of risk control. In this experiment, we did not allow for short selling; when short selling is allowed, the proposed method also works the best; see Section A.4. In Section A.5.1, we also perform a case study to demonstrate the capability of the learned portfolio to avoid a market crash in 2020. We also compare with the Merton’s portfolio (Merton,, 1969), which is the classical optimal stationary portfolio constructed from the training data; this method does not perform well either. This is because the market during the time $2019-2020$ is volatile and quite different from the previous years, and a stationary portfolio cannot capture the nuances in the change of the market condition. This shows that it is also important to leverage the flexibility and generalization property of the modern neural networks, along side the financial prior knowledge.

6.3 Comparison with Data Generation Method

One common alternative to direct data augmentation in the field is to generate additional realistic synthetic data using a GAN. While it is not the purpose of this work to propose an industrial level method, nor do we claim that the proposed method outperforms previous methods, we provide one experimental comparison in Section A.5 for the task of portfolio construction. We compare our theoretically motivated technique with QuantGAN (Wiese et al.,, 2020), a major and recent technique in the field of financial data augmentation/generation. The experiment setting is the same as the S&P500 experiment. The result shows that directly applying QuantGAN to the portfolio construction problem in our setting does not significantly improve over the baseline without any augmentation and achieves a much lower Sharpe ratio than our suggested method. This underperformance is possibly because QuantGAN is not designed for Sharpe ratio maximization.

6.4 Market Capital Lines

In this section, we link the result we obtained in the previous section with the concept of market capital line (MCL) in the capital asset pricing model (Sharpe,, 1964), a foundational theory in classical finance. The MCL of a set of portfolios denotes the line of the best return-risk combinations when these portfolios are combined with a risk-free asset such as the government bond; an MCL with smaller slope means better return and lower risk and is considered to be better than an MCL that is to the upper left in the return-risk plane. See Figure 3. The risk-free rate $r_{0}$ is set to be $0.01$ , roughly equal to the average 1-year treasury yield from 2018 to 2020. We see that the learned portfolios achieves a better MCL than the original stocks. The slope of the SP500 MCL is roughly $0.53$ , while that of the proposed method is $0.35$ , i.e., much better return-risk combinations can be achieved using the proposed method. For example, if we specify the acceptable amount of risk to be $0.1$ , then the proposed method can result in roughly $10\%$ more gain in annual return than investing in the best stock in the market. This example also shows that how tools in classical finance theory can be used to visualize and better understand the machine learning methods that are applied to finance, a crucial point that many previous works lack.

6.5 Case Study

For completeness, we also present the performance of the proposed method during the Market crush in Feb. 2020 for the interested readers. See Section A.5.1.

7 Outlook

In this work, we have presented a theoretical framework relevant to finance and machine learning to understand and analyze methods related to deep-learning-based finance. The result is a machine learning algorithm incorporating prior knowledge about the underlying financial processes. The good performance of the proposed method agrees with the standard expectation in machine learning that performance can be improved if the right inductive biases are incorporated. We have thus succeeded in showing that building machine learning algorithms that is rooted firmly in financial theories can have a considerable and yet-to-be achieved benefit. We hope that our work can help motivating for more works that approaches the theoretical aspects of machine learning algorithms that are used for finance.

The limitation of the present work is obvious; we only considered the kinds of data augmentation that takes the form of noise injection. Other kinds of data augmentation may also be useful to the finance; for example, (Fons et al.,, 2020) empirically finds that magnify (Um et al.,, 2017), time warp (Kamycki et al.,, 2020), and SPAWNER (Le Guennec et al.,, 2016) are helpful for financial series prediction, and it is interesting to apply our theoretical framework to analyze these methdos as well; a correct theoretical analysis of these methods is likely to advance both the deep-learning based techniques for finance and our fundamental understanding of the underlying financial and economic mechanisms. Meanwhile, our understanding of the underlying financial dynamics is also rapidly advancing; we foresee better methods to be designed, and it is likely that the proposed method will be replaced by better algorithms soon. There is potentially positive social effects of this work because it is widely believed that designing better financial prediction methods can make the economy more efficient by eliminating arbitrage (Fama,, 1970); the cautionary note is that this work is only for the purpose of academic research, and should not be taken as an advice for monetary investment, and the readers should evaluate their own risk when applying the proposed method.

References

Antoniou et al., (2017) Antoniou, A., Storkey, A., and Edwards, H. (2017). Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340.
Assefa, (2020) Assefa, S. (2020). Generating synthetic data in finance: opportunities, challenges and pitfalls. Challenges and Pitfalls (June 23, 2020).
Black and Scholes, (1973) Black, F. and Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of political economy, 81(3):637–654.
Bouchaud and Potters, (2009) Bouchaud, J. P. and Potters, M. (2009). Financial applications of random matrix theory: a short review.
Buehler et al., (2019) Buehler, H., Gonon, L., Teichmann, J., and Wood, B. (2019). Deep hedging. Quantitative Finance, 19(8):1271–1291.
Cont, (2001) Cont, R. (2001). Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance.
Cont, (2007) Cont, R. (2007). Volatility clustering in financial markets: empirical facts and agent-based models. In Long memory in economics, pages 289–309. Springer.
Cont and Bouchaud, (1997) Cont, R. and Bouchaud, J.-P. (1997). Herd behavior and aggregate fluctuations in financial markets. arXiv preprint cond-mat/9712318.
Cover and Thomas, (2006) Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, New York, NY, USA.
Dao et al., (2019) Dao, T., Gu, A., Ratner, A., Smith, V., De Sa, C., and Ré, C. (2019). A kernel theory of modern data augmentation. In International Conference on Machine Learning, pages 1528–1537. PMLR.
Degiannakis and Floros, (2015) Degiannakis, S. and Floros, C. (2015). Methods of volatility estimation and forecasting. Modelling and Forecasting High Frequency Financial Data, pages 58–109.
Drǎgulescu and Yakovenko, (2002) Drǎgulescu, A. A. and Yakovenko, V. M. (2002). Probability distribution of returns in the heston model with stochastic volatility. Quantitative finance, 2(6):443–453.
Estrada, (2006) Estrada, J. (2006). Downside risk in practice. Journal of Applied Corporate Finance, 18(1):117–125.
Fama, (1970) Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417.
Fons et al., (2020) Fons, E., Dawson, P., jun Zeng, X., Keane, J., and Iosifidis, A. (2020). Evaluating data augmentation for financial time series classification.
Goodhart and O’Hara, (1997) Goodhart, C. A. and O’Hara, M. (1997). High frequency data in financial markets: Issues and applications. Journal of Empirical Finance, 4(2-3):73–114.
He et al., (2019) He, Z., Xie, L., Chen, X., Zhang, Y., Wang, Y., and Tian, Q. (2019). Data augmentation revisited: Rethinking the distribution gap between clean and augmented data.
Heston, (1993) Heston, S. L. (1993). A closed-form solution for options with stochastic volatility with applications to bond and currency options. The review of financial studies, 6(2):327–343.
Imajo et al., (2020) Imajo, K., Minami, K., Ito, K., and Nakagawa, K. (2020). Deep portfolio optimization via distributional prediction of residual factors.
Imaki et al., (2021) Imaki, S., Imajo, K., Ito, K., Minami, K., and Nakagawa, K. (2021). No-transaction band network: A neural network architecture for efficient deep hedging. Available at SSRN 3797564.
Ito et al., (2020) Ito, K., Minami, K., Imajo, K., and Nakagawa, K. (2020). Trader-company method: A metaheuristic for interpretable stock price prediction.
Jay et al., (2020) Jay, P., Kalariya, V., Parmar, P., Tanwar, S., Kumar, N., and Alazab, M. (2020). Stochastic neural networks for cryptocurrency price prediction. IEEE Access, 8:82804–82818.
Jiang et al., (2017) Jiang, Z., Xu, D., and Liang, J. (2017). A deep reinforcement learning framework for the financial portfolio management problem. arXiv preprint arXiv:1706.10059.
Kamycki et al., (2020) Kamycki, K., Kapuscinski, T., and Oszust, M. (2020). Data augmentation with suboptimal warping for time-series classification. Sensors, 20(1):98.
Karatzas et al., (1998) Karatzas, I., Shreve, S. E., Karatzas, I., and Shreve, S. E. (1998). Methods of mathematical finance, volume 39. Springer.
Kelly Jr, (2011) Kelly Jr, J. L. (2011). A new interpretation of information rate. In The Kelly capital growth investment criterion: theory and practice, pages 25–34. World Scientific.
Le Guennec et al., (2016) Le Guennec, A., Malinowski, S., and Tavenard, R. (2016). Data augmentation for time series classification using convolutional neural networks. In ECML/PKDD workshop on advanced analytics and learning on temporal data.
Lim et al., (2019) Lim, B., Zohren, S., and Roberts, S. (2019). Enhancing time-series momentum strategies using deep neural networks. The Journal of Financial Data Science, 1(4):19–38.
Lux and Marchesi, (2000) Lux, T. and Marchesi, M. (2000). Volatility clustering in financial markets: a microsimulation of interacting agents. International journal of theoretical and applied finance, 3(04):675–702.
Mandelbrot, (1997) Mandelbrot, B. B. (1997). The variation of certain speculative prices. In Fractals and scaling in finance, pages 371–418. Springer.
Markowitz, (1959) Markowitz, H. (1959). Portfolio selection.
Merton, (1969) Merton, R. C. (1969). Lifetime portfolio selection under uncertainty: The continuous-time case. The review of Economics and Statistics, pages 247–257.
Ozbayoglu et al., (2020) Ozbayoglu, A. M., Gudelek, M. U., and Sezer, O. B. (2020). Deep learning for financial applications: A survey. Applied Soft Computing, page 106384.
Rubinstein, (2002) Rubinstein, M. (2002). Markowitz’s” portfolio selection”: A fifty-year retrospective. The Journal of finance, 57(3):1041–1045.
Sharpe, (1964) Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium under conditions of risk. The journal of finance, 19(3):425–442.
Sharpe, (1966) Sharpe, W. F. (1966). Mutual fund performance. The Journal of business, 39(1):119–138.
Shorten and Khoshgoftaar, (2019) Shorten, C. and Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):1–48.
Um et al., (2017) Um, T. T., Pfister, F. M., Pichler, D., Endo, S., Lang, M., Hirche, S., Fietzek, U., and Kulić, D. (2017). Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 216–220.
Von Neumann and Morgenstern, (1947) Von Neumann, J. and Morgenstern, O. (1947). Theory of games and economic behavior, 2nd rev.
Wasserman, (2013) Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.
Wiese et al., (2020) Wiese, M., Knobloch, R., Korn, R., and Kretschmer, P. (2020). Quant gans: Deep generation of financial time series. Quantitative Finance, 20(9):1419–1440.
Zhang et al., (2020) Zhang, Z., Zohren, S., and Roberts, S. (2020). Deep learning for portfolio optimization. The Journal of Financial Data Science, 2(4):8–20.
Zhong et al., (2020) Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y. (2020). Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13001–13008.
Ziyin et al., (2020) Ziyin, L., Hartwig, T., and Ueda, M. (2020). Neural networks fail to learn periodic functions and how to fix it. arXiv preprint arXiv:2006.08195.
Ziyin et al., (2019) Ziyin, L., Wang, Z., Liang, P., Salakhutdinov, R., Morency, L., and Ueda, M. (2019). Deep gamblers: Learning to abstain with portfolio theory. In Proceedings of the Neural Information Processing Systems Conference.

Appendix A Experiments

This section describes the additional experiments and the experimental details in the main text. The experiments are all done on a single TITAN RTX GPU. The S&P500 data is obtained from Alphavantage.⁷⁷7https://www.alphavantage.co/documentation/ The code will be released on github.

A.1 Dataset Construction

For all the tasks, we observe a single trajectory of a single stock prices $S_{1},...,S_{T}$ . For the toy tasks, $T=400$ ; for the S&P500 task, $T=800$ . We then transform this into $T-L$ input-target pairs $\{(x_{i},y_{i})\}_{i=1}^{T-L}$ , where

\begin{cases}x_{i}=(S_{i},...,S_{L-1});\\ y_{i}=S_{L}.\end{cases}

(19)

$x_{i}$ is used as the input to the model for training; $y_{i}$ is used as the unseen future price for calculating the loss function. For the toy tasks, $L=10$ ; for the S&P500 task, $L=15$ . In simple words, we use the most recent $L$ prices for constructing the next-step portfolio.

A.2 Sharpe Ratio for S&P500

The empirical Sharpe Ratios are calculated in the standard way (for example, it is the same as in (Ito et al.,, 2020; Imajo et al.,, 2020)). Given a trajectory of wealth $W_{1},...,W_{T}$ of a strategy $\pi$ , the empirical Sharpe ratio is estimated as

R_{i}=\frac{W_{i+1}}{W_{i}}-1;

(20)

\hat{M}=\frac{1}{T}\sum_{i=1}^{T-1}R_{i};

(21)

\widehat{SR}=\frac{\hat{M}}{\sqrt{\frac{1}{T}\sum_{i=1}^{T}R_{i}^{2}-\hat{M}^{2}}}=\frac{\text{average wealth return}}{\text{std. of wealth return}},

(22)

and $\widehat{SR}$ is the reported Sharpe Ratio for S&P500 experiments.

A.3 Variance of Sharpe Ratio

We do not report an uncertainty for the single stock Sharpe Ratios, but one can easily estimate the uncertainties. The Sharpe Ratio is estimated across a period of $T$ time steps. For the $S\&P500$ stocks, $T=200$ , and by the law of large numbers, the estimated mean $\hat{M}$ has variance roughly $\sigma^{2}/T$ , where $\sigma$ is the true volatility, and so is the estimated standard deviation. Therefore, the estimated Sharpe Ratio can be written as

$\displaystyle\widehat{SR}$	$\displaystyle=\frac{\hat{M}}{\sqrt{\frac{1}{T}\sum_{i=1}^{T}R_{i}^{2}-\hat{M}^{2}}}$	(23)
	$\displaystyle=\frac{M+\frac{\sigma}{\sqrt{T}}\epsilon}{\sigma+\frac{c}{\sqrt{T}}\eta}$	(24)
	$\displaystyle\approx\frac{M+\frac{\sigma}{\sqrt{T}}\epsilon}{\sigma}=\frac{M}{\sigma}+\frac{1}{\sqrt{T}}\epsilon$	(25)

where $\epsilon$ and $\eta$ are zero-mean random variables with unit variance. This shows that the uncertainty in the estimated $\widehat{SR}$ is approximately $1/\sqrt{T}\approx 0.07$ for each of the single stocks, which is often much smaller than the difference between different methods.

A.4 S&P500 Experiment

This section gives more results and discussion of the $S\&P500$ experiment.

A.4.1 Underperformance of Weight Decay

This section gives the detail of the comparison made in Figure 1. The experimental setting is the same as the $S\&P500$ experiments. For illustration and motivation, we only show the result on MSFT (Microsoft). Choosing most of the other stocks would give a qualitatively similar plot.

See Figure 1, where we show the performance of directly training a neural network to maximize wealth return on MSFT during 2018-2020. Using popular, generic deep learning techniques such as weight decay or dropout does not improve the baseline. In contrast, our theoretically motivated method does. Combining the proposed method with weight decay has the potential to improve the performance a little further, but the improvement is much lesser than the improvement of using the proposed method over the baseline. This implies that generic machine learning is unlikely to capture the inductive bias required to process a financial task.

In the plot, we did not interpolate the dropout method between a large $p$ and a small $p$ . The result is similar to the case of weight decay in our experiments.

A.5 Comparison with GAN

Sector	Q-GAN Aug.	Ours
Comm. Services	$0.07_{\pm 0.02}$	$0.33_{\pm 0.16}$
Consumer Disc.	$0.00_{\pm 0.03}$	$0.64_{\pm 0.08}$
Consumer Staples	$0.10_{\pm 0.02}$	$0.35_{\pm 0.07}$
Energy	$0.08_{\pm 0.05}$	$0.91_{\pm 0.10}$
Financials	$-0.22_{\pm 0.03}$	$0.18_{\pm 0.08}$
Health Care	$0.35_{\pm 0.03}$	$0.83_{\pm 0.07}$
Industrials	$-0.12_{\pm 0.03}$	$0.48_{\pm 0.08}$
Information Tech.	$0.28_{\pm 0.04}$	$0.79_{\pm 0.09}$
Materials	$-0.03_{\pm 0.03}$	$0.53_{\pm 0.10}$
Real Estate	$-0.35_{\pm 0.03}$	$0.19_{\pm 0.07}$
Utilities	$-0.20_{\pm 0.02}$	$0.15_{\pm 0.04}$
S&P500 Avg.	$0.05_{\pm 0.03}$	$\mathbf{0.51}_{\pm 0.03}$

Figure 5: Performance (Sharpe Ratio) of the training set augmented with QuantGAN.

We stress that it is not the goal of this paper to compare methods but to understand why certain methods work or do not work from the perspective of the classical finance theory. With this caveat clearly stated, we compare our theoretically motivated technique with QuantGAN, a major and recent technique in the field of financial data augmentation/generation. We first train a QuantGAN on each stock trajectory and use the trained QuantGAN to generate 10000 additional data points to augment the original training set (containing roughly 1000 data points for each stock) and train the same feedforward net as described in the main text. This feedforward net is then used for evaluation. QuantGAN is implemented as close to the original paper as possible, using a temporal convolutional network trained with RMSProp for 50 epochs. Other experimental settings are the same as that of S&P500 experiment in the manuscript. See the right Table. Also, compare with Table 1 in the manuscript. We see that directly applying QuantGAN to the portfolio construction problem in our setting does not significantly improve over the baseline without any augmentation and achieves a much lower Sharpe ratio than our suggested method. This underperformance is possibly because the QuantGAN is not designed for Sharpe ratio maximization.

A.5.1 Case Study

In this section, we qualitatively study the behavior of the learned portfolio of the proposed method. The model is trained as in the other S&P500 experiments. See Figure 4. We see that the model learns to invest less and less as the stock price rises to an excessive level, thus avoiding the high risk of the market crash in February 2020. This avoidance demonstrates the effectiveness of the proposed method qualitatively.

A.5.2 List of Symbols for S&P500

The following are the symbols we used for the S&P500 experiments, separated by quotation marks.

[’A’ ’AAPL’ ’ABC’ ’ABT’ ’ADBE’ ’ADI’ ’ADM’ ’ADP’ ’ADSK’ ’AEE’ ’AEP’ ’AES’ ’AFL’ ’AIG’ ’AIV’ ’AJG’ ’AKAM’ ’ALB’ ’ALL’ ’ALXN’ ’AMAT’ ’AMD’ ’AME’ ’AMG’ ’AMGN’ ’AMT’ ’AMZN’ ’ANSS’ ’AON’ ’AOS’ ’APA’ ’APD’ ’APH’ ’ARE’ ’ATVI’ ’AVB’ ’AVY’ ’AXP’ ’AZO’ ’BA’ ’BAC’ ’BAX’ ’BBT’ ’BBY’ ’BDX’ ’BEN’ ’BIIB’ ’BK’ ’BKNG’ ’BLK’ ’BLL’ ’BMY’ ’BRK.B’ ’BSX’ ’BWA’ ’BXP’ ’C’ ’CAG’ ’CAH’ ’CAT’ ’CB’ ’CCI’ ’CCL’ ’CDNS’ ’CERN’ ’CHD’ ’CHRW’ ’CI’ ’CINF’ ’CL’ ’CLX’ ’CMA’ ’CMCSA’ ’CMI’ ’CMS’ ’CNP’ ’COF’ ’COG’ ’COO’ ’COP’ ’COST’ ’CPB’ ’CSCO’ ’CSX’ ’CTAS’ ’CTL’ ’CTSH’ ’CTXS’ ’CVS’ ’CVX’ ’D’ ’DE’ ’DGX’ ’DHI’ ’DHR’ ’DIS’ ’DLTR’ ’DOV’ ’DRE’ ’DRI’ ’DTE’ ’DUK’ ’DVA’ ’DVN’ ’EA’ ’EBAY’ ’ECL’ ’ED’ ’EFX’ ’EIX’ ’EL’ ’EMN’ ’EMR’ ’EOG’ ’EQR’ ’EQT’ ’ES’ ’ESS’ ’ETFC’ ’ETN’ ’ETR’ ’EW’ ’EXC’ ’EXPD’ ’F’ ’FAST’ ’FCX’ ’FDX’ ’FE’ ’FFIV’ ’FISV’ ’FITB’ ’FL’ ’FLIR’ ’FLS’ ’FMC’ ’FRT’ ’GD’ ’GE’ ’GILD’ ’GIS’ ’GLW’ ’GPC’ ’GPS’ ’GS’ ’GT’ ’GWW’ ’HAL’ ’HAS’ ’HBAN’ ’HCP’ ’HD’ ’HES’ ’HIG’ ’HOG’ ’HOLX’ ’HON’ ’HP’ ’HPQ’ ’HRB’ ’HRL’ ’HRS’ ’HSIC’ ’HST’ ’HSY’ ’HUM’ ’IBM’ ’IDXX’ ’IFF’ ’INCY’ ’INTC’ ’INTU’ ’IP’ ’IPG’ ’IRM’ ’IT’ ’ITW’ ’IVZ’ ’JBHT’ ’JCI’ ’JEC’ ’JNJ’ ’JNPR’ ’JPM’ ’JWN’ ’K’ ’KEY’ ’KLAC’ ’KMB’ ’KMX’ ’KO’ ’KR’ ’KSS’ ’KSU’ ’L’ ’LB’ ’LEG’ ’LEN’ ’LH’ ’LLY’ ’LMT’ ’LNC’ ’LNT’ ’LOW’ ’LRCX’ ’LUV’ ’M’ ’MAA’ ’MAC’ ’MAR’ ’MAS’ ’MAT’ ’MCD’ ’MCHP’ ’MCK’ ’MCO’ ’MDT’ ’MET’ ’MGM’ ’MHK’ ’MKC’ ’MLM’ ’MMC’ ’MMM’ ’MNST’ ’MO’ ’MOS’ ’MRK’ ’MRO’ ’MS’ ’MSFT’ ’MSI’ ’MTB’ ’MTD’ ’MU’ ’MYL’ ’NBL’ ’NEE’ ’NEM’ ’NI’ ’NKE’ ’NKTR’ ’NOC’ ’NOV’ ’NSC’ ’NTAP’ ’NTRS’ ’NUE’ ’NVDA’ ’NWL’ ’O’ ’OKE’ ’OMC’ ’ORCL’ ’ORLY’ ’OXY’ ’PAYX’ ’PBCT’ ’PCAR’ ’PCG’ ’PEG’ ’PEP’ ’PFE’ ’PG’ ’PGR’ ’PH’ ’PHM’ ’PKG’ ’PKI’ ’PLD’ ’PNC’ ’PNR’ ’PNW’ ’PPG’ ’PPL’ ’PRGO’ ’PSA’ ’PVH’ ’PWR’ ’PXD’ ’QCOM’ ’RCL’ ’RE’ ’REG’ ’REGN’ ’RF’ ’RJF’ ’RL’ ’RMD’ ’ROK’ ’ROP’ ’ROST’ ’RRC’ ’RSG’ ’SBAC’ ’SBUX’ ’SCHW’ ’SEE’ ’SHW’ ’SIVB’ ’SJM’ ’SLB’ ’SLG’ ’SNA’ ’SNPS’ ’SO’ ’SPG’ ’SRCL’ ’SRE’ ’STT’ ’STZ’ ’SWK’ ’SWKS’ ’SYK’ ’SYMC’ ’SYY’ ’T’ ’TAP’ ’TGT’ ’TIF’ ’TJX’ ’TMK’ ’TMO’ ’TROW’ ’TRV’ ’TSCO’ ’TSN’ ’TTWO’ ’TXN’ ’TXT’ ’UDR’ ’UHS’ ’UNH’ ’UNM’ ’UNP’ ’UPS’ ’URI’ ’USB’ ’UTX’ ’VAR’ ’VFC’ ’VLO’ ’VMC’ ’VNO’ ’VRSN’ ’VRTX’ ’VTR’ ’VZ’ ’WAT’ ’WBA’ ’WDC’ ’WEC’ ’WFC’ ’WHR’ ’WM’ ’WMB’ ’WMT’ ’WY’ ’XEL’ ’XLNX’ ’XOM’ ’XRAY’ ’XRX’ ’YUM’ ’ZION’]

Appendix B Additional Discussion of the Proposed Algorithm

B.1 Derivation

We first derive Equation 18. The original training loss is

\displaystyle\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{t}\left[G_{t}(\pi)\right]-\lambda\text{Var}_{t}[G_{t}(\pi)].

(26)

The last term can be written as

$\displaystyle\lambda\text{Var}_{t}[G_{t}(\pi)]$	$\displaystyle=\mathbb{E}_{z_{1},...,z_{t}}[z_{t}^{2}\pi_{t}^{2}]-\mathbb{E}_{z_{1},...,z_{t}}[z_{t}\pi_{t}]^{2}$	(27)
	$\displaystyle=\mathbb{E}_{z_{t}}[z_{t}^{2}]\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}^{2}]-\mathbb{E}_{z_{t}}[z_{t}]^{2}\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}]^{2}$	(28)
	$\displaystyle=\lambda r_{t}^{2}\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}^{2}]+\lambda c^{2}\hat{\sigma}_{t}^{2}\|r_{t}\|\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}^{2}]-\lambda r_{t}^{2}\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}]^{2}$	(29)
	$\displaystyle=\lambda r_{t}^{2}\text{Var}_{z_{1},...,z_{t-1}}[\pi_{t}]+\lambda c^{2}\hat{\sigma}_{t}^{2}\|r_{t}\|\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}^{2}]$	(30)

Plug in, this leads to the following maximization problem, which is the desired equation.

\arg\max_{b_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\underbrace{\mathbb{E}_{x}\left[G_{t}(\pi)\right]}_{A:\text{ wealth gain}}-\underbrace{\lambda r_{t}^{2}\text{Var}_{z_{1},...,z_{t-1}}[\pi_{t}]}_{B:\text{ Risk due to uncertainty in past price}}-\underbrace{\lambda c^{2}\hat{\sigma}_{t}^{2}|r_{t}|\mathbb{E}_{z_{1},...,z_{t-1}}[\pi_{t}^{2}]}_{C:\text{ Risk due to Future Price}}\right\},

(31)

where we have given each term a name for reference; the expectation is taken with respect to the augmented data points $z_{i}$ :

r_{i}\to z_{i}=r_{i}+c\sqrt{\hat{\sigma}^{2}_{i}|r_{i}|}\epsilon_{i}\quad\text{for }r_{i}\in x.

(32)

Under the GBM model (or when the optimal portfolio only weakly depends on $z_{1},...,z_{t-1}$ ), the optimal $\pi_{t}$ does not depend on $z_{1},...,z_{t-1}$ , and so the objective can be further simplified to be

\arg\max_{b_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{z}\left[G_{t}(\pi)\right]-\lambda c^{2}\hat{\sigma}_{t}^{2}|r_{t}|\pi_{t}^{2}\right\};

(33)

the first term is the data augmentation for wealth gain, and the second term is a regularization for risk control. Most of the experiments in this paper use this equation for training. When it does not work well, the readers are encouraged to try the full training objective in Equation 31.

B.2 Extension to Multi-Asset Setting

It is possible and interesting to derive the data augmentation for a multi-asset setting. However, this is hindered by the lack of a standard model to describe the co-motion of multiple stocks. For example, it is unsure what the geometric Brownian motion should be for a multi-stock setting. In this case, we tentatively suggest the following form of the formula for the injected noise, whose effectiveness and theoretical justification are left for future work. Let $\mathbf{S}_{t}=(S_{1,t},...,S_{N,t})$ be the prices of the stocks viewed as an $N$ -dimensional vector. The return is assumed to have covariance $\Sigma$ , then, by analogy with the discovery of this work:

{S}_{i,t}\to{S}_{i,t}+c\sqrt{\sum_{j}\Sigma_{ij}|r_{j}|S_{j}^{2}}\epsilon_{t}

(34)

for some white gaussian noise $\epsilon$ and some tunable parameter $c$ . The matrix $\Sigma$ has to be estimated by the data using standard methods of estimating multi-stock volatility.

B.3 Non-Gaussian Noise

While the proposed algorithm proposes to use Gaussian noise for injection; the theory developed in this work only requires the noise to have a finite second moment and, therefore, any other distribution works as well for the particular kind of utility function we specified in Eq. (5). This is because this utility function only contains the second moments of the wealth return and is indifferent to higher moments. Therefore, there is one caveat to the present theory: when the utility function involves higher moments of the wealth return, the utility of a certain type of noise injection is not indifferent to the choice of the injection distribution. The practitioners are recommended to analyze the specific utility function they use in our framework and decide on the strength and distribution of the injected noise.

Appendix C Additional Theory and Proofs

This section contains all the additional theoretical discussions and the proofs.

C.1 Background: Classical Solution to the Portfolio Construction Problem

Consider a market with $N$ stocks, we denote the price of all these stocks as ${S}_{t}\in\mathbb{R}^{N}$ , and one can likewise define the stock price return as

r_{t}=\frac{S_{t+1}-S_{t}}{S_{t}}.

(35)

which is a random variable with the covariance matrix $C_{t}:=\text{Cov}[r_{t}]$ and the expected return $g_{t}:=\mathbb{E}[r_{t}]$ . Our wealth at time step $t$ is defined as $W_{t}=M_{t}+\sum_{i}^{N}n_{t,i}S_{t,i}:=n^{\rm T}_{t}S_{t}$ where $M_{t}$ is the amount of cash we hold, and $n_{i}$ the shares of stock we hold for the $i$ -th stock; we also defined vectors $n_{t}:=(n_{t,i},..,n_{t,N},1)$ and $S_{t}:=(S_{t,1},...,S_{t,N},M_{t})$ where the cash is included in the definition of $n_{t}$ . As in the standard finance literature, we assume that the shares are infinitely divisible; usually, a positive $n_{i}$ denotes holding (long) and a negative $n$ denotes borrowing (short). The wealth we hold initially is $W_{0}>0$ , and we would like to invest our money on $N$ stocks; we denote the relative value of each stock we hold also as a vector $\pi_{t,i}=\frac{n_{t,i}S_{t,i}}{W_{t}}\in\mathbb{R}^{N+1}$ ; $\pi$ is called a portfolio; the central challenge in portfolio theory is to find the best $\pi$ . At time $t$ , our relative wealth is $W_{t}$ ; after one time step, our wealth changes due to a change in the price of the stocks: $\Delta W_{t}:=W_{t+1}-W_{t}=W_{t}\pi_{t}\cdot r_{t}$ . The standard goal is to maximize the wealth return $G_{t}:=\pi_{t}\cdot r_{t}$ at every time step while minimizing risk⁸⁸8It is important to not to confuse the price return $r_{t}$ with the wealth return $G_{t}$ .. The risk is defined as the variance of the wealth change⁹⁹9In principle, any concave function in $G_{t}$ can be a risk regularizer from classical economic theory (Von Neumann and Morgenstern,, 1947), and our framework can be easily extended to such cases; one common alternative would be $R(G)=\log(G)$ .:

R_{t}:=R(\pi_{t}):=\text{Var}_{r_{t}}[G_{t}]=\pi_{t}^{\rm T}\left(\mathbb{E}[r_{t}r_{t}^{\rm T}]-g_{t}g_{t}^{\rm T}\right)\pi_{t}=\pi_{t}^{\rm T}C_{t}\pi_{t}.

(36)

The standard way to control risk is to introduce a “risk regularizer” that punishes the portfolios with a large risk (Markowitz,, 1959; Rubinstein,, 2002). Introducing a parameter $\lambda$ for the strength of regularization (the factor of $1/2$ appears for convention), we can now write down our objective:

\pi_{t}^{*}=\arg\max_{\pi}U(\pi):=\arg\max_{\pi}\left[\pi^{\rm T}G_{t}-\frac{\lambda}{2}R(\pi)\right].

(37)

Here, $U$ stands for the utility function; $\lambda$ can be set to be the desired level of risk-aversion. When $g_{t}$ and $C_{t}$ is known, this problem can be explicitly solved (see Section C.1). However, one main problem in finance is that its data is very limited and we only observe one particular realized data trajectory, and, therefore, $g_{t}$ and $C_{t}$ cannot be accurately estimated.

Eq. (37) can be solved directly by taking derivative and set to $0$ ; we can obtain

\begin{cases}\pi_{t}^{*}=\frac{1}{\lambda}C_{t}^{-1}g_{t};\\ R_{t}(\pi_{t}^{*})=\frac{1}{\lambda^{2}}g_{t}^{\rm T}C_{t}^{-1}g_{t}.\end{cases}

(38)

This formula is the standard formula to use when both $g_{t}$ and $C_{t}$ are known or can be accurately estimated (Bouchaud and Potters,, 2009). Meanwhile, when one finds difficulty estimating $g_{t}$ or, more importantly, $C_{t}$ , then the above formula can go arbitrarily wrong. Let $\hat{C}$ denote our estimated covariance and $\hat{g}$ the estimated mean¹⁰¹⁰10For example, using some Bayesian machine learning model., then the in-sample risk and the true is respectively given by

\begin{cases}\hat{R}_{t}(\pi_{t}^{*})=\frac{1}{\lambda^{2}}\hat{g}_{t}^{\rm T}\hat{C}_{t}^{-1}\hat{g}_{t};\\ {R}_{t}(\pi_{t}^{*})=\frac{1}{\lambda^{2}}\hat{g}_{t}^{\rm T}\hat{C}_{t}^{-1}C_{t}\hat{C}_{t}^{-1}\hat{g}_{t}.\end{cases}

(39)

The readers are encouraged to examine the differences between the two equations carefully.

C.2 Analogy to Statistical Decision Theory and the Minimax Formulation

In Section 4.6, we mentioned that the procedure we used is analogous to the process of finding a Bayesian estimator in a statistical decision theory. Here, we explain this analogy a little more (but keep in mind that this is only an analogy, not a rigorous equivalence relation). Equation 14 of empirical utility can be seen as an equivalent of the statistical risk function $R$ ; finding the optimal data augmentation strength is similar to finding the best Bayesian estimator. To make an exact agreement with the statistical decision theory, we also need to define a prior over the risk in Equation 14:

r:=\mathbb{E}_{p(\Omega)}[V(\hat{\pi}(\theta))]=\mathbb{E}_{p(\Omega)}\mathbb{E}_{x\sim p(x;\Omega)}\mathbb{E}_{\{x_{i}\}^{N}\sim p^{N}(x)}[U(x,\hat{\pi}(\theta))]

(40)

where we have written the distributions $p(x;\Omega)$ as a function of the true parameters in our underlying model.¹¹¹¹11For example, in the GBM model, the true parameters are the growth rate $r$ and the volatility $\sigma$ , and so $\Omega=(r,\sigma)$ , and $p(\Omega)=p(r,\sigma)$ . In the main text, we have effectively assumed that $p(\Omega)$ is a Dirac delta distribution, but, in the more general case, it is possible that the true parameter is not known or cannot be accurately estimated, and it makes sense to assign a prior distribution to them. One can then find the optimal data augmentation with respect to $r$ : $\theta^{*}=\arg\max_{\theta}r$ .

Financial Terms	Statistical Terms
Utility $U$	Loss function $L$
Expected Utility $V$	Risk $R$
Data Augmentation Parameter $\theta$	Estimator $\hat{\theta}$
True Parameter $\Omega$	True Parameter $\theta$
Prior $p(\Omega)$	Prior $p(\theta)$

Table 2: Correspondence table between the concepts in our general theory and the classical statistical decision theory.

See table 2 for the list of correspondences. This analogy breaks down at the following point: the Bayesian estimator tries to find $\hat{\theta}$ that is as close to $\theta$ as possible, while, in our formulation, the goal is not to make data augmentation $\theta$ as close as possible to $\Omega$ .

One might also hope to establish an analogous “minimax” theory for the portfolio construction problem. This can be done simply by replacing the expectation over $p(\Omega)$ with a minimization over $\Omega$ :

r_{minimax}:=\min_{\Omega}V(\hat{\pi}(\theta))

(41)

and the best augmentation parameter can be found as the maximizer of this risk: $\theta^{*}=\max_{\theta}\min_{\Omega}V$ .

C.3 Proof for no data augmentation

when there is no data augmentation, $\mathbb{E}_{t}\left[G_{t}(\pi)\right]=b_{t}r_{t}$ and $\text{Var}_{t}\left[G_{t}(\pi)\right]=0$ . One immediately see that the utility is then maximized at

\pi_{t}^{*}=\begin{cases}1,&\text{if }r_{t}\geq 0\\ -1,&\text{if }r_{t}<0.\end{cases}

(42)

We restate the theorem here.

Proposition 4.

(Utility of no-data augmentation strategy.) Let the strategy be as specified in Eq. (42), and let the price trajectory be generated with GBM in Eq. (6) with initial price $S_{0}$ , then the true utility is

U=[1-2\Phi({-r}/{\sigma})]r-\frac{\lambda}{2}\sigma^{2}

(43)

where $U(\pi)$ is the utility function defined in Eq. (3).

Proof. For a time-dependent strategy $\pi^{*}_{t}$ , the true utility is defined as¹²¹²12While we mainly use $\Theta(x)$ as the Heaviside step function, we overload this notation a little. When we write $\Theta(x>0)$ , $\Theta$ is defined as the indicator function. We think that this is harmless because the difference is clearly shown by the argument to the function.

U(\pi^{*})=\mathbb{E}_{S_{0}^{\prime},S_{1}^{\prime},...,S_{T}^{\prime},S_{T+1}^{\prime}}\left[\frac{1}{T}\sum_{t=1}^{T+1}\pi_{t}^{*}r_{t}^{\prime}-\left(\frac{\lambda}{2T}\sum_{t=1}^{T}(\pi_{t}^{*}r_{t}^{\prime})^{2}-\mathbb{E}_{S_{0}^{\prime},S_{1}^{\prime},...,S_{T}^{\prime},S_{T+1}^{\prime}}[\pi_{t}^{*}r_{t}^{\prime}]^{2}\right)\right]

(44)

where $S_{1}^{\prime},...,S_{T}^{\prime},S_{T+1}^{\prime}$ is an independently sampled distribution for testing, and $r_{t}^{\prime}:=\frac{S_{t+1}^{\prime}-S_{t}^{\prime}}{S_{t}^{\prime}}$ are their respective returns. Now, we note that we can write the price update equation (the GBM model) in terms of the returns:

S_{t+1}=(1+r)S_{t}+\sigma_{t}S_{t}\eta_{t}\to r_{t}=r+\sigma\eta_{t}

(45)

which means that $r_{t}\sim\mathcal{N}(r,\sigma^{2})$ obeys a Gaussian distribution. Therefore,

U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\pi_{t}^{*}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}(\pi_{t}^{*})^{2}.

(46)

Now we would like to average over $\pi^{*}_{t}$ , because we also want to average over the realizations of the training set to make the true utility independent of the sampling of the training set (see Section 4.6 for an explanation).

Recall that the strategy is defined as

\pi_{t}^{*}={\begin{cases}1,&\text{if }r_{t}\geq 0\\ -1,&\text{if }r_{t}<0.\end{cases}}=\Theta(r_{t}\geq 1)-\Theta(r_{t}<1)

(47)

for a training set $\{S_{0},...,S_{T}\}$ , and $\Theta$ is the Heaviside step function. We thus have that

\begin{cases}(\pi^{*}_{t})^{2}=1;\\ \mathbb{E}_{S_{1},...S_{T+1}}[\pi_{t}^{*}]=\mathbb{E}_{S_{1},...S_{T+1}}[\Theta(r_{t}\geq 0)-\Theta(r_{t}<0)]=1-2\Phi({-r}/{\sigma})\end{cases}

(48)

where $\Phi$ is the Gauss c.d.f. We can use this to average the utility over the training set; noticing that the training set and the test set are independent, we can obtain

$\displaystyle U$	$\displaystyle=\mathbb{E}_{S_{1},...S_{T+1}}[U(\pi^{*})]$	(49)
	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}[1-2\Phi({-r}/{\sigma})]r-\frac{\lambda}{2}\frac{1}{T}\sum_{t=1}^{T}\sigma^{2}$	(50)
	$\displaystyle=[1-2\Phi({-r}/{\sigma})]r-\frac{\lambda}{2}\sigma^{2}.$	(51)

This finishes the proof. $\square$

C.4 Proof for Additive Gaussian noise

Before we prove the proposition, we first prove that the strategy is indeed the one given in Eq. (52):

Lemma 1.

The maximizer of the utility function in Eq. 4 with additive gaussian noise is

\pi^{*}_{t}(\rho)=\begin{cases}\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}},&\text{if }-1<\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}}<1;\\ {\rm sgn}(r_{t}),&\text{otherwise.}\end{cases}

(52)

Proof. With additive Gaussian noise, we have

\begin{cases}\mathbb{E}_{t}\left[G_{t}(\pi)\right]=\pi_{t}\mathbb{E}_{t}\left[\tilde{r}_{t}\right]=\pi_{t}\mathbb{E}_{t}\left[\frac{S_{t+1}+\rho_{t+1}\epsilon_{t+1}-S_{t}-\rho_{t}\epsilon_{t}}{S_{t}}\right]=\pi_{t}\frac{S_{t+1}-S_{t}}{S_{t}}=\pi_{t}r_{t};\\ \text{Var}_{t}\left[G_{t}(\pi)\right]=\pi_{t}^{2}\text{Var}_{t}[\tilde{r}_{t}]=\pi_{t}^{2}\text{Var}_{t}\left[\frac{\rho_{t+1}\epsilon_{t+1}-\rho_{t}\epsilon_{t}}{S_{t}}\right]=\frac{2\rho^{2}\pi_{t}^{2}}{S_{t}^{2}},\\ \end{cases}

(53)

where the last line follows from the definition for additive Gaussian noise that $\rho_{1}=...\rho_{T}=\rho$ . The training objective becomes

	$\displaystyle\pi^{*}_{t}$	$\displaystyle=\arg\max_{\pi_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{t}\left[G_{t}(\pi)\right]-\frac{\lambda}{2}\text{Var}_{t}[G_{t}(\pi)]\right\}$		(54)
		$\displaystyle=\arg\max_{\pi_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\pi_{t}r_{t}-\lambda\frac{\rho^{2}\pi_{t}^{2}}{S_{t}^{2}}\right\}.$		(55)

This maximization problem can be maximized for every $t$ respectively. Taking derivative and set to $0$ , we find the condition that $\pi_{t}^{*}$ satisfies

		$\displaystyle\frac{\partial}{\partial\pi_{t}}\left(\pi_{t}r_{t}-\lambda\frac{\rho^{2}\pi_{t}^{2}}{S_{t}^{2}}\right)=0$		(56)
	$\displaystyle\longrightarrow\quad$	$\displaystyle\pi^{*}_{t}(\rho)=\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}}.$		(57)

By definition, we also have $|\pi_{t}|\leq 1$ , and so

\pi^{*}_{t}(\rho)=\begin{cases}\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}},&\text{if }-1<\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}}<1;\\ {\rm sgn}(r_{t}),&\text{otherwise,}\end{cases}

(58)

which is the desired result. $\square$

We would like to comment that, although we paid special attention to enforcing the constraint $|\pi_{t}|\leq$ it is often not needed in practice because investors tend to be quite risk-averse, and it is hard to imagine that any investor would invest all his or her money in the financial market such that $\pi_{t}=1$ . Mathematically, this means that it is often the case that $\lambda\geq\frac{|r_{t}|S_{t}^{2}}{2\lambda\rho^{2}}$ . Therefore, for what comes, we assume that $\lambda\geq\frac{|r_{t}|S_{t}^{2}}{2\lambda\rho^{2}}$ for all $t$ for notational simplicity; note that, even without assumption, the conclusion that a multiplicative noise is the better kind of data augmentation will not change. Now we are ready to prove the proposition.

Proposition 5.

(Utility of additive Gaussian noise strategy.) Let the strategy be as specified in Eq. (52), and other conditions the same as in Proposition 1, then the true utility is

U_{\rm Add}=\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\frac{(\sum_{t}r_{t}S_{t})^{2}}{\sum_{t}(r_{t}S_{t})^{2}}\Theta(\sum_{t}r_{t}S_{t}^{2})\right].

(59)

Proof. The beginning of the proof is similar to the case with no data augmentation. Following the same procedure, we obtain an equation that is the same as Eq. (46):

U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\pi_{t}^{*}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}(\pi_{t}^{*})^{2}.

(60)

Plug in the preceding lemma, we have

U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}\left(\frac{r_{t}S_{t}^{2}}{2\lambda\rho^{2}}\right)^{2}.

(61)

This utility is a function of the data augmentation strength $\rho$ . For a fixed training set, we would like to find the best $\rho$ that maximizes the above utility. Note that the maximizer of the utility is different depending on the sign of $\sum r_{t}S^{2}_{t}$ . Taking derivative and set to $0$ , we obtain that

(\rho^{*})^{2}=\begin{cases}\frac{\lambda\sigma^{2}}{2r}\frac{\sum_{t}^{T}(r_{t}S_{t}^{2})^{2}}{\sum_{t}^{T}r_{t}S_{t}^{2}},&\text{if }\sum_{t}^{T}r_{t}S_{t}^{2}>0\\ \infty,&\text{otherwise.}\\ \end{cases}

(62)

Plug in to the previous lemma, we have

\pi_{t}^{*}(\rho^{*})=\frac{r_{t}S_{t}^{2}}{2\lambda(\rho^{*})^{2}}=\frac{rr_{t}S_{t}^{2}}{\sigma^{2}}\frac{\sum_{t}r_{t}S_{t}^{2}}{\sum_{t}(r_{t}S_{t}^{2})^{2}}\Theta\left(\sum_{t}r_{t}S_{t}^{2}\right).

(63)

One thing to notice is that the optimal strength is independent of $\lambda$ , which is an arbitrary value and dependent only on the investor’s psychology. Plug into the utility function and take expectation with respect to the training set, we obtain

$\displaystyle U_{\rm add}$	$\displaystyle=\mathbb{E}_{S_{1},...,S_{T+1}}\left[U(\pi^{}(\rho^{}))\right]$	(64)
	$\displaystyle=U(\pi^{})=\frac{r}{T}\sum_{t=1}^{T}\pi_{t}^{}(\rho^{})-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}[\pi_{t}^{}(\rho^{2})]^{2}$	(65)
	$\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{1},...,S_{T+1}}\left[\frac{(\sum_{t}r_{t}S_{t})^{2}}{\sum_{t}(r_{t}S_{t})^{2}}\Theta\left(\sum_{t}r_{t}S_{t}^{2}\right)\right].$	(66)

This finishes the proof. $\square$

Remark.

Notice that the term in the expectation generally depends on $T$ in a non-trivial way and cannot be obtained explicitly. However, it does not cause a problem since the final goal is to compare it with the result in the next section.

C.5 Proof for General Multiplicative Gaussian noise

Before we prove the proposition, we first find the strategy for this case. Note that the term $\rho_{t}^{2}+\rho_{t+1}^{2}$ appears repetitively in this section, and so we define a shorthand notation for it: $\frac{1}{2}(\rho_{t}^{2}+\rho_{t+1}^{2}):=\gamma_{t}^{2}$ .

Lemma 2.

The maximizer of the utility function in Eq. 4 with multiplicative gaussian noise is

\pi^{*}_{t}(\rho)=\begin{cases}\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}=\frac{r_{t}S_{t}^{2}}{\lambda(\rho_{t}^{2}+\rho_{t+1}^{2})},&\text{if }-1<\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}<1;\\ {\rm sgn}(r_{t}),&\text{otherwise.}\end{cases}

(67)

Proof. With additive Gaussian noise, we have

\begin{cases}\mathbb{E}_{t}\left[G_{t}(\pi)\right]=\pi_{t}\mathbb{E}_{t}\left[\tilde{r}_{t}\right]=\pi_{t}\mathbb{E}_{t}\left[\frac{S_{t+1}+\rho_{t+1}\epsilon_{t+1}-S_{t}-\rho_{t}\epsilon_{t}}{S_{t}}\right]=\pi_{t}\frac{S_{t+1}-S_{t}}{S_{t}}=\pi_{t}r_{t};\\ \text{Var}_{t}\left[G_{t}(\pi)\right]=\pi_{t}^{2}\text{Var}_{t}[\tilde{r}_{t}]=\pi_{t}^{2}\text{Var}_{t}\left[\frac{\rho_{t+1}\epsilon_{t+1}-\rho_{t}\epsilon_{t}}{S_{t}}\right]=\frac{(\rho_{t}^{2}+\rho_{t+1})^{2}\pi_{t}^{2}}{S_{t}^{2}}.\\ \end{cases}

(68)

We see that $\frac{1}{2}(\rho_{t}^{2}+\rho_{t+1}^{2}):=\gamma_{t}^{2}$ replaces the role of $2\rho^{2}$ for additive Gaussian noise. The training objective becomes

	$\displaystyle\pi^{*}_{t}$	$\displaystyle=\arg\max_{\pi_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{t}\left[G_{t}(\pi)\right]-\frac{\lambda}{2}\text{Var}_{t}[G_{t}(\pi)]\right\}$		(69)
		$\displaystyle=\arg\max_{\pi_{t}}\left\{\frac{1}{T}\sum_{t=1}^{T}\pi_{t}r_{t}-\lambda\frac{\gamma_{t}^{2}\pi_{t}^{2}}{S_{t}^{2}}\right\}.$		(70)

This maximization problem can be maximized for every $t$ respectively. Taking derivative and set to $0$ , we find the condition that $\pi_{t}^{*}$ satisfies

		$\displaystyle\frac{\partial}{\partial\pi_{t}}\left(\pi_{t}r_{t}-\lambda\frac{\gamma_{t}^{2}\pi_{t}^{2}}{S_{t}^{2}}\right)=0$		(71)
	$\displaystyle\longrightarrow\quad$	$\displaystyle\pi^{*}_{t}(\gamma_{t})=\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}.$		(72)

By definition, we also have $|\pi_{t}|\leq 1$ , and so

\pi^{*}_{t}(\rho)=\begin{cases}\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}},&\text{if }-1<\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}<1;\\ {\rm sgn}(r_{t}),&\text{otherwise,}\end{cases}

(73)

which is the desired result. $\square$

Remark.

For fair comparison with the previous section, we also assume that $\lambda\geq\frac{|r_{t}|S^{2}}{\lambda\gamma_{t}^{2}}$ . Again, this is the same as assume that the investors are reasonably risk-averse and is the correct assumption for all practical circumstances.

Proposition 6.

(Utility of general multiplicative Gaussian noise strategy.) Let the strategy be as specified in Eq. (52), and other conditions the same as in Proposition 1, then the true utility is

U_{\rm mult}=\frac{r^{2}}{2\lambda\sigma^{2}}[1-\Phi(-r/\sigma)].

(74)

Proof. Most of the proof is similar to the Gaussian case by replaing $\rho^{2}$ with $\gamma_{t}^{2}$ . Following the same procedure, We have:

U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\pi_{t}^{*}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}(\pi_{t}^{*})^{2}.

(75)

Plug in the preceding lemma, we have

U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}\left(\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}\right)^{2}.

(76)

This utility is a function of of the data augmentation strength $\gamma_{t}$ , and, unlikely the additive Gaussian case, can be maximized term by term for different $t$ . For a fixed training set, we would like to find the best $\gamma_{t}$ that maximizes the above utility. Note that the maximizer of the utility is different depending on the sign of $r_{t}S^{2}_{t}$ . Taking derivative and set to $0$ , we obtain that

(\gamma_{t}^{*})^{2}=\begin{cases}\frac{\sigma^{2}}{2r}r_{t}S_{t}^{2},&\text{if }r_{t}S_{t}^{2}>0\\ \infty,&\text{otherwise.}\\ \end{cases}

(77)

Plug in to the previous lemma, we have

\pi_{t}^{*}(\gamma_{t}^{*})=\frac{r_{t}S_{t}^{2}}{2\lambda(\gamma_{t}^{*})^{2}}=\frac{r}{\lambda\sigma^{2}}\Theta(r_{t}).

(78)

One thing to notice that the optimal strength is independent of $\lambda$ , which is an arbitrary value and dependent only on the psychology of the investor. Plug into the utility function and take expectation with respect to the training set, we obtain

$\displaystyle U_{\rm add}$	$\displaystyle=\mathbb{E}_{S_{1},...,S_{T+1}}\left[U(\pi^{}(\rho^{}))\right]$	(79)
	$\displaystyle=U(\pi^{})=\frac{r}{T}\sum_{t=1}^{T}\pi_{t}^{}(\gamma_{t}^{})-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}[\pi_{t}^{}(\gamma_{t}^{*})]^{2}$	(80)
	$\displaystyle=\frac{r^{2}}{\lambda\sigma^{2}}\mathbb{[}E]_{t}[\Theta(r_{t})]-\frac{r^{2}}{2\lambda\sigma^{2}}\Theta(r_{t})\mathbb{[}E]_{t}[\Theta(r_{t})]$	(81)
	$\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}}[1-\Phi(-r/\sigma)]$	(82)
	$\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}}\Phi(r/\sigma)$	(83)

This finishes the proof. $\square$

This result can be directly compared to the results in the previous section, and the following remark shows that the multiplicative noise injection is the best kind of noise.

Remark.

(Infinite augmentation strength.) For all of the theoretical results, there is a corner case when the optimal injection strength is equal to infinity, which leads to a non-investing portfolio $\pi=0$ . This case requires special interpretation. This corner case is due to the fact the underlying model we use has a constant, positive expected price return equal to $r$ , and so it leads to the bizarre data augmentation which effectively amounts to throwing away all the training points with $<0$ return. This is unnatural for a real market. It is possible for the real market to have short-term negative return when conditioned on the previous prices, and so one should not simply discard the negative points. Therefore, in our algorithm section, we recommend treating the training points with positive and negative return equally by taking the absolute value of the data augmentation strength and ignoring $\infty$ case.

C.6 Proof of Theorem 1

Remark.

Combining the above propositions, one can quickly obtain that, if $\sigma\neq 0$ , then $U_{\rm mult}>U_{\rm add}$ and $U_{\rm mult}>U_{\rm no-aug}$ with probability $1$ (Proof in Appendix).

Proof. We first show that $U_{\rm mult}>U_{\rm add}$ . Recall that

$\displaystyle U_{\rm Add}$	$\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\frac{(\sum_{t}r_{t}S_{t})^{2}}{\sum_{t}(r_{t}S_{t})^{2}}\Theta\left(\sum_{t}r_{t}S_{t}^{2}\right)\right]$	(84)
	$\displaystyle\leq\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\frac{\left(\sum_{t}r_{t}S_{t}\Theta(r_{t}>0)\right)^{2}}{\sum_{t}(r_{t}S_{t})^{2}}\Theta\left(\sum_{t}r_{t}S_{t}^{2}\right)\right]$	(85)
	$\displaystyle\leq\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\frac{(\sum_{t}r_{t}S_{t}\Theta(r_{t}>0))^{2}}{\sum_{t}(r_{t}S_{t})^{2}}\right]$	(86)
	$\displaystyle\leq_{\text{(Cauchy Inequality)}}\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\sum_{t}\frac{(r_{t}S_{t}\Theta(r_{t}>0))^{2}}{(r_{t}S_{t})^{2}}\right]$	(87)
	$\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}T}\mathbb{E}_{S_{t}}\left[\sum_{t}\Theta(r_{t}>0)\right]$	(88)
	$\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}T}\left[\sum_{t}\mathbb{P}(r_{t}>0)\right]$	(89)
	$\displaystyle=\frac{r^{2}}{2\lambda\sigma^{2}}\Phi(r/t)=U_{\rm mult}.$	(90)

The Cauchy equality holds if and only if $S_{1}=...=S_{T+1}$ ; this event has probability measure $0$ , and so, with probability $1$ , $U_{\rm add}<U_{\rm mult}$ .

Now we prove the second inequality. Recall that

U_{\rm no-aug}=[1-2\Phi({-r}/{\sigma})]r-\frac{\lambda}{2}\sigma^{2}.

(91)

We divide into $2$ subcases. Case $1$ : $\lambda>\frac{r}{\sigma^{2}}$ . We have

U_{\rm no-aug}<-2r\Phi({-r}/{\sigma})<0<U_{\rm mult}.

(92)

Case $2$ : $0<\lambda\leq\frac{r}{\sigma^{2}}$ . We have

$\displaystyle U_{\rm no-aug}$	$\displaystyle<[1-2\Phi({-r}/{\sigma})]r$	(93)
	$\displaystyle<\Phi({r}/{\sigma})r$	(94)
	$\displaystyle\leq\frac{r^{2}}{2\lambda\sigma^{2}}\Phi(r/t)=U_{\rm mult}.$	(95)

This finishes the proof. $\square$

C.7 Augmentation for a naive multiplicative noise

In the discussion and experiment sections in the main text, we also mentioned a “naive” version of the multiplicative noise. The motivation for this kind of noise is simple, since the underlying noise in the theoretical models are all of the form $\sigma^{2}S_{t}^{2}$ , and so it is tempting to also inject noise that mimicks the underlying noise. It turns out that this is not a good idea.

In this section, we let $\rho_{t}=\rho_{0}S_{t}$ for some positive, time-independent $\rho_{0}$ . Our goal is to find the optimal $\rho_{0}$ . With the same calculation, one finds that the learned portfolio is given by the same formula in Lemma 2:

\pi^{*}_{t}(\rho)=\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}.

(96)

With this strategy, one can find the optimal noise injection strength to be given by the following proposition.

Proposition 7.

Let the portfolio be given by Eq. (96) and let the price be generated by the GBM, then the optimal noise strength is

(\rho_{0}^{*})^{2}=\begin{cases}\frac{\sum_{t}^{T}r_{t}^{2}}{\sum_{t}^{T}r_{t}},&\text{if }\sum_{t}^{T}r_{t}>0\\ \infty,&\text{otherwise.}\\ \end{cases}

(97)

Proof. As before, We have:

U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\pi_{t}^{*}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}(\pi_{t}^{*})^{2}.

(98)

Plug in the portfolio, we have

U(\pi^{*})=\frac{r}{T}\sum_{t=1}^{T}\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}-\frac{\lambda\sigma^{2}}{2T}\sum_{t=1}^{T}\left(\frac{r_{t}S_{t}^{2}}{2\lambda\gamma_{t}^{2}}\right)^{2}.

(99)

Plug in $\gamma_{t}^{2}=\rho_{0}^{2}S_{t}^{2}$ and take derivative, we obtain that

(\rho_{0}^{*})^{2}=\begin{cases}\frac{\sum_{t}^{T}r_{t}^{2}}{\sum_{t}^{T}r_{t}},&\text{if }\sum_{t}^{T}r_{t}>0\\ \infty,&\text{otherwise.}\\ \end{cases}

(100)

This finishes the proof. $\square$

C.8 Data Augmentation for a Stationary Portfolio

While the main theory focused on the case with a dynamic portfolio that is updated through time, we also present a study of the stationary portfolio in this section. While this kind of portfolio is less relevant for deep-learning-based finance, we study this case to show that, even in this setting, there is still strong motivation to inject noise of strength $r_{t}S_{t}^{2}$ . Now we state the formal definition of a stationary porfolio.

Definition 1.

A portfolio $\{\pi_{t}\}_{t=1}^{T}$ is said to be stationary if $\pi_{t}=\pi$ for some constant $\pi$ for all $t$ .

In the language of machine learning, this corresponds to choosing our model as having only a single parameter, whose output is input independent:

f(x)=\pi.

(101)

In traditional finance theory, stationary portfolios have been very important. In practice, most portfolios are “approximately” stationary, since most portfolio managers tend to not to change their portfolio weight at a very short time-scale unless the market is very unstable due to market failure or external information injection. For a stationary portfolio, one still would like to maximize the utility function given in Eq. 4.

For conciseness, we only examine the case when we inject a general time-dependent noise. The curious readers are encouraged to examine the cases with no data augmentation and with constant data augmentation. As before, the following lemma gives the portfolio of the empirical utility. Again, we use the shorthand: $\frac{1}{2}(\rho_{t}^{2}+\rho_{t+1}^{2}):=\gamma_{t}^{2}$ . For illustrative purpose, we have ignore the corner cases of $\pi$ being greater than $1$ or smaller than $-1$ .

Lemma 3.

The stationary portfolio that maximizes the utility function in Eq. 4 with multiplicative gaussian noise is

\pi^{*}_{t}(\rho)=\frac{\sum_{t}^{T}r_{t}}{2\lambda\sum_{t}^{T}(\gamma_{t}^{2}/S_{t}^{2})}

(102)

Proof sketch. The proof follows almost the same as the previous sections. With a slight difference that $\pi$ is no more time-dependent and can be taken out of the sum.

Proposition 8.

Let the portfolio be given in Eq. (102), then an augmentation strength satisfying the relation

(\gamma_{t}^{*})^{2}=cr_{t}S_{t}^{2}

(103)

is an optimal data augmentation for constant $c=\frac{2\sigma^{2}}{r}$ .

Proof Sketch. The proof is also simple and very similar to the proofs for the dynamic portfolio case. One first solve for the optimal augmentation strength $\gamma_{t}^{*}$ and find that it satisfies the following relation

\sum_{t}\frac{(\gamma^{*}_{t})^{2}}{S_{t}^{2}}=c\sum_{t}^{T}r_{t}

(104)

with $c=\frac{2\sigma^{2}}{r}$ , and then it suffices to check that the following is one solution

(\gamma^{*}_{t})^{2}=cr_{t}S_{t}^{2}.

(105)

This finishes the proof sketch. $\square$

The curious readers are encouraged to check the intermediate steps. We see that, even for a stationary portfolio case, there is still strong motivation for using a augmentation with strength proportional to $r_{t}S_{t}^{2}$ .

We would like to compare this with the best achievable stationary portfolio, which is solved by the following proposition.

Proposition 9.

(Optimal Stationary Portfolio). The optimal stationary portfolio for GBM is $\pi^{*}_{stat}=\frac{r}{\lambda\sigma^{2}}$ , i.e., for any other portfolio $\pi$ , $U(\pi^{*}_{stat})\geq U(\pi)$ .

Proof. It suffices to find the maximizer portfolio of the true utility:

\pi^{*}_{stat}=\arg\max_{\pi}\left\{\pi r-\frac{\lambda}{2}\pi^{2}\sigma^{2}\right\}.

(106)

The solution is simple and given by

\pi^{*}_{stat}=\frac{r}{\lambda\sigma^{2}}.

(107)

This completes the proof. $\square$

Note that the above optimality result holds for both in-sample counterfactual utility and out-of-sample utility. This proposition can be seen as the discrete-time version of the famous Merton’s portfolio solution (Merton,, 1969), where the optimal stationary portfolio is also found to be $\frac{r}{\lambda\sigma^{2}}$ . In fact, it is well-known that, for a static market, the stationary portfolios are optimal, but this is beyond the scope of this work (Cover and Thomas,, 2006).

Combining the above two propositions, one obtain the following theorem.

Theorem 3.

The stationary portfolio obtained by training with data augmentation strength in given in Proposition 8 is optimal, i.e., it is no worse than any other stationary portfolio.

Proof. Plug in $\gamma=cr_{t}S_{t}^{2}$ , we have that the trained portfolio is $\pi^{*}=\frac{r}{\lambda\sigma^{2}}$ , which is equivalent to the optimal stationary portfolio, and we are done. $\square$

This shows that, even for a stationary portfolio, it is useful to use the proposed data augmentation technique.