An intelligent algorithmic trading based on a risk-return reinforcement learning algorithm
Abstract
It is a challenging problem to automatically generate trading signals based on historical transaction information and the financial status of assets. This scientific paper proposes a novel portfolio optimization model using an improved deep reinforcement learning algorithm. The optimization model’s objective function considers both a financial portfolio’s risks and returns. The proposed algorithm is based on actor-critic architecture, in which the main task of the critic network is to learn the distribution of cumulative return using quantile regression, and the actor network outputs the optimal portfolio weight by maximizing the objective function mentioned above. Meanwhile, we exploit a linear transformation function to realize asset short selling. Finally, A multi-process method, called Ape-x, is used to accelerate the speed of deep reinforcement learning training. To validate our proposed approach, we conduct backtesting for two representative portfolios and observe that the proposed model in this work is superior to the benchmark strategies.
Keywords: Deep reinforcement learning; Algorithmic trading; Actor-critic architecture; Trading strategy
1 Introduction
Algorithmic trading, which has been widely used in the financial market, is a technique that uses a computer program to automate buying and selling stocks, options, futures, cryptocurrency, etc. Institutional investors such as a pension, mutual, and hedge funds usually use algorithmic trading to find the most favorable execution price, reduce the impact of severe market fluctuations and improve execution efficiency. Algorithmic trading has a far-reaching effect on the overall efficiency and microstructure of the capital market. Therefore, asset pricing, portfolio investment, and risk measurement may undergo revolutionary changes.
Classical algorithmic strategies includes arrival price strategy[1, 2]; volume weighted average price strategy[3, 4]; time-weighted average price strategy[5], implementation shortfall strategy[6, 7]; guerrilla strategy[8],etc. In recent years, with the rapid development of artificial intelligence(AI), more researchers have begun utilizing deep reinforcement learning to implement algorithmic trading. Deep reinforcement learning mainly uses the excellent feature representation ability of the deep neural network to fit the state, action, value, and other reinforcement learning functions to maximize portfolio return.
Currently, researchers exploited different reinforcement learning algorithms to study financial trading problems, including the actor-only method(also called the value-based method), the critic-only method(also called the policy-based method), and the actor-critic method. Nevertheless, we believe these studies are not yet mature in practical application for the following reasons. First, most studies maximize the average cumulative rewards, which is also the expectation of cumulative rewards, of a single asset or a portfolio by training value function(such as policy gradient) or action-value function(such as Q-learning). However, these algorithms do not consider the risk and are only suitable for investors as a neutral risk type. A large number of studies[9, 10, 11] have proved that the stock return has fat tail characteristics, so the low-probability tail risk needs to be highly concerned.
Additionally, short selling is allowed in many stock markets. Short selling can not only make a profit in a bear market but also reduce the speculation and volatility of the stock market. Although [12] and [13] have considered short selling, these studies focus on single asset trading. Most traders generally hold multiple securities. Unfortunately, most existing studies do not consider the short-selling problem in this common case.
Finally, the interaction between agent and environment is often very time-consuming. However, profit opportunities are fleeting in the stock market. Therefore, how to improve the training speed is also worth further study.
The main contribution of this paper is as follows:
Firstly, this study proposes a new algorithm for the actor-critic framework, called risk-return reinforcement learning(R3L). Motivated by the modern portfolio theory proposed by [14], we construct an optimization model based on risk and return. The objective function of the optimization model is the weighted sum of mean and value at risk(VaR) of portfolio cumulative return. The main goal of the actor network is to learn the portfolio weights by maximizing the objective function mentioned above. The primary purpose of the critic network is to learn the cumulative portfolio return distribution. Inspired by distributional reinforcement learning proposed by [15], we use quantile regression to estimate the parameters of the critic network.
Secondly, similar to the approach proposed by [16], we use a soft-max function to transform the output of the actor network into a new variable to meet the self-financing constraint of the portfolio. Then, a linear transformation is implemented to convert the variable mentioned above into the final output, i.e., portfolio weight. This transformation still meets the self-financing constraint. Meanwhile, it can realize short selling, control the scale of short selling and prevent radical investment strategies.
Thirdly, we leverage a multi-process algorithm, called Ape-x algorithm proposed by [17], to speed up the training. Ape-x algorithm separates data collection from strategy learning, use multiple parallel agents to collect experience data, share a large experience data buffer, and sends it to learners for learning.
The rest of this paper is organized as follows. In Section 2, we review the related literature and present the differences between our study and previous studies. Section 3 describes the definition of our problem, and Section 4 introduces our proposed R3L algorithm. In Section 5, we detail the setups of our experiments. In Section 6, we provide the experiment result. Finally, Section 7 presents the conclusions and directions of future work.
2 Related works
In this section, we review literature works regarding deep reinforcement learning in financial trading. As mentioned in the introduction, the algorithms used in financial trading mainly include the critic-only method, the actor-only method, and the actor-critic method. Then, we will elaborate on the application of the three methods.
2.1 The critic-only method
The critic-only method is also called the value-based method. DQN(Deep Q-Learning Network) is a well-known value-based deep reinforcement learning algorithm, the core of which is to use a neural network to replace the q-label, i. e. the action value function. The input of the neural network is state information, and the output is the value of each action. Therefore, The DQN algorithm can be used to solve the problem of continuous state space and discrete action space but can not solve the problem of continuous action space.
DQN and its improved algorithms have received extensive attention in the research of financial trading. [18] proposed the league championship algorithm (LCA) that can extract and save various stock trading rules according to different stock market conditions. Every agent in LCA is a sports team member and can learn from the weaknesses and strengths of others to improve their performance. They proved that their algorithm outperforms Buy-and-Hold and GNP-RA strategies. [19]trained an LSTM network using double Q-learning [20] and obtained positive rewards in a cryptocurrency market with a decreasing tendency. [21] proposed a Deep Recurrent Q- Network(DRQN), which can process sequences better. They found that the proposed algorithm outperformed the buy and hold strategy in the S&P500 ETF history dataset. [22] also adopted a DRQN-based algorithm, which was applied to the foreign exchange market from 2012 to 2017, and developed a motion enhancement technology to mitigate the lack of exploration in Q-learning. Experimental results showed that the annual return of some currencies was 60%, with an average of about 10%. [23] proposed a novel deep Q-learning portfolio management strategy. The framework consists of two main components: a group of local agents responsible for trading a single asset and a global agent that rewards each local agent based on the same reward function. The results showed that the proposed approach was a promising method for dynamic portfolio optimization. The deep Q-learning network proposed by [24] is based on an improved DNN structure consisting of two branches, one of which learned action values while the other learned the number of shares to take to maximize the objective function. [25] used a LSTM based Q network to implement portfolio optimization. In addition, to consider the risk factors in portfolio management, the volatility of portfolio return was pulsed to the rewards function. [12] discretized the action space by fixing the trading amount, in which trader can buy or sell a specific asset for a fixed amount. Short selling is not allowed, fixed the trading amount for each asset, and a mapping function is designed to transform the infeasible action into feasible action.
2.2 Actor-only method
In the actor-only method, the action taken by the agent can be learned directly by a neural network. The advantage of this method is that it makes it available for continuous action space. [26] constructed a recursive deep neural network for environmental perception and recursive decision-making. Then, deep learning is used for feature extraction and combined with fuzzy learning to reduce the uncertainty of input data. Another novelty of this paper is a variant of Backpropagation through time(BTT) to handle the vanishing gradient problem. [27] used Policy Gradient(PG) approach for financial trading. Its main contribution is to analyze the advantages of the LSTM network structure over the fully connected network and to compare the impact of some combination of technical indicators on revenue. The drawback of the Actor-only method is that, due to the policy strategy, the number of interactions between agents and the environment is greatly increased compared with a value-based method. As a result, the training is very time-consuming.
2.3 Actor-Critic method
The actor-critic method contains two parts: an actor, which selects action based on probability or certainty, and a critic, which evaluates the score of the action taken by the actor. The actor-critic method combines the advantages of the value-based and policy-based methods, which can not only deal with continuous and discrete problems but also carry out the one-step update to improve learning efficiency. [28] exploited Gated Recurrent Unit(GRU) to extract financial features and proposed a critic-only method called Gated Deep Q-learning trading strategy(GDQN) and Actor-Critic method called Gated Deterministic Policy Gradient trading strategy(GDPG). Experimental results show that GDQN and GDPG outperformed the Turtle trading strategy[29] and DRL trading strategy[26], and the performance of GDPG is more stable than the GDQN in the ever-evolving stock market. [30] adopted three versions of the RL algorithm based on Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), and Policy Gradient (PG) for portfolio management. A so-called Adversarial Training was proposed to reach a robust result and to consider possible risks in optimizing the portfolio more carefully. [16] proposed a novel reinforcement learning algorithm for portfolio optimization using Proximal Policy Optimization (PPO). An essential characteristic of the proposed model is that the number of assets is dynamic. Experiments on a cryptocurrency market showed that the proposed algorithm outperformed three state-of-the-art algorithms presented by [19, 31, 32]. [33] compared the performance of deep double Q-learning and proximal policy optimization (PPO) to several benchmark execution policies and found that PPO realizes the lowest total implementation short-fall across exchanges and currency pairs.
To sum up, these studies used various RL algorithms for financial trading problems in a different settings. The existing literature also has the following problem. First, most literature determines the optimal action(portfolio decision) based on the mean of the portfolio cumulative returns(for example, Q- value) and pays less attention to the risk. Although [23, 28, 16] consider the risk factor in the reward function, these authors can not measure the risk of portfolio cumulative return. Second, short selling is allowed in many strategies[34, 26, 13, 24], but the above strategies consider trading for only one asset, which is often inconsistent with the facts; most of the investors often hold multiple assets to diversify risk. Although some strategies[31, 13, 12] consider trading various assets, short selling was not allowed, which means that investors will not be able to make profits in the bear market. Third, the interaction between agent and environment is very time-consuming in training. Most existing literature does not consider how to improve the training speed. Thus, the algorithm proposed in this paper can effectively overcome the shortcomings mentioned above.
3 Preliminaries
In this section, we introduce some preliminaries about MDP and elaborate state space, action space, and rewards function of the R3L algorithm.
3.1 Markov decision process
Markov decision process (MDP) is a mathematical model for sequential decision problems, which is used to simulate the random strategies and rewards that agents can realize in the environment where the system state has Markov properties. Portfolio optimization is a typical sequential decision problem. Investors dynamically adjust portfolio weights according to market information. Thus, we can consider the portfolio allocation problems as a Markov Decision Process (MDP). If a state transition conforms to Markov, the next state only depends on its current state and has nothing to do with the state before its current state. A finite MDP (as considered here) is a four tuple, denoted by .
where:
S is a finite set of states.
A is a finite set of actions (and As is the finite set of actions available from state s).
is reward function.
is the state transition probability.
The interaction between an agent and financial environment will produce a trajectory . is the discounted cumulative reward, which the agent can obtain at time t expressed as follows:
(1) |
where is the discounted rate.
To learn the optimal strategy, we use the value function. There are two types of value functions in reinforcement learning: state value function, denoted by , and action-value function, denoted by . The state value function, shown in Eq.(2), represents the expectation of cumulative rewards from a certain state s.
(2) |
The action value function, given in Eq.(3), is the expected return obtained after the agent executes an action in the current state s.
(3) |
3.2 State space
An important and difficult problem in algorithmic trading is the low observability of the market environment. The information available to investors is extremely limited compared to the complex market environment. Therefore, how to deal with this limited information is very important. In this paper, the state at the period t, denoted by , consists of three types of variables: historical data for the assets(), portfolio weights() and trading time step(t). The input of the actor network is historical data, and the input of the critic network is the concatenation of historical data, portfolio weight, and the time index. The historical data of the selected portfolio consists of raw data and technical indicators. The raw data includes the price open, close, high, low, and volume(OCHLV). Technical indicators consist of a list of candlestick patterns, including bearish, bullish, significant, hammer, inverse hammer, bullish engulfing, piercing line, morning star, bullish harami, hanging man, shooting star, bearish engulfing, evening star, three black crows, dark cloud cover, bearish harami, doji, spinning top. See [35] for details. All historical data can be represented by a tensor with dimension , where n is the number of risky assets, g is the number of features per asset, h denotes a window size which is the number of latest feature value the agent can observe. Portfolio weight, also called portfolio vector, is the percentage of a total portfolio represented by a single asset. The rationality of adding a time index to the critic network is that it reflects the time value of money.
(4) |
3.3 Action space
A problem in algorithmic trading is the low observability of the market environment. Compared with the complexity of the market environment, the information available to investors is extremely limited. Therefore, how to deal with little details is very important. At each time step t, the agent executes a trading action resulting from the actor network. Specifically, the agent needs to redetermine the optimal portfolio weight according to the updated state at the end of each period. In addition, to realize short selling, the actor network needs to be modified as follows: Firstly, to meet self-financing conditions, we use the softmax function to transform the output of the actor network into a new variable, denoted by . This transformation function is given in Eq.(5).
(5) |
where is the initial output of the actor network.
The default of this transformation is that the portfolio weight of each asset is greater than zero, which makes short selling impossible. Short selling occurs when an investor borrows a security and sells it on the open market, planning to repurchase it later for less money. It allows investors to benefit not only from a bear market but also to use capital proceeds to overweight the portfolio’s long-only component of the portfolio. Different from the existing literature, we realize short selling of assets through the following transformation in Eq.(6), in which represents the proportion of asset i after transformation and (called Delta) represents adjustment parameter.
(6) |
The self-financing constraint can be satisfied through the above linear transformation, and this transformation can also realize short selling. In particular, indicate no change of portfolio weight, indicate that the portfolio weight of each asset is equal to . Furthermore, in this paper, our portfolio includes four risky assets and a risk-free asset(n=5); the adjustment parameter is set to 3(), so the portfolio weight of a single asset ranges from -40% to 260%, which means the maximum proportion of short selling in total assets is 160%.
3.4 rewards
The reward reflects the performance of an agent’s action. Let denote portfolio value at the end of period . The reward, denoted by , is defined as portfolio return in period t, computed as the following equation:
(7) |
If we do not take into account transaction cost, portfolio value at the end of period equals portfolio value at the end of period plus portfolio return, expressed as follows:
(8) |
If we do not take into account transaction cost, portfolio value at the end of period equals portfolio value at the end of period plus portfolio return, expressed as follows
(9) |
If we consider transaction cost, portfolio value at the end of period equals plus portfolio return minus transaction costs, computed as follows:
(10) |
where:
c1 and c2 is the transaction cost for selling and buying.
and are dummy variables, indicating whether or not to buy or sell asset at the end of period t.
and denote the trade size of asset i at the end of period t, which should be greater than or equal to zero.
In addition, the share of asset i held by an investor at the end of period t, represented by , satisfies the following equation:
(11) |
Obviously, equals to the cumulative value of in period t plus the shares bought() minus the shares sold(). Since it does not make sense to buy and sell an asset simultaneously, which increases transaction costs, one of and must be zero.
If , and are given, we could adjust trading strategies to maximize the portfolio value . This optimization problem can be formulated as a nonlinear programming, the objective function of which is to maximize , and the decision variables are , , and . The nonlinear programming model can be expressed as follows:
(12) |
4 methodology
In the first part of this section, we describe the risk-return reinforcement learning(R3L) algorithm in detail. The second part of this paper introduces the architecture of the neural network. The third part of this section is Ape-x.
4.1 Proposed algorithm
The algorithm adopted by this paper is based on actor-critic architecture, which includes an actor network and a critic network. In addition, each network has its corresponding target network, so the algorithm includes four networks, namely the actor network, denoted by , and the critical network, denoted by , the target actor network, denoted by and the target critical network, denoted by . In classical actor-critic architectures such as A3C, TD3, and DDPG, the actor network updates by maximizing the expectation of cumulative rewards, and the critic network updates by minimizing the error between the evaluation value and the target value. The algorithm proposed in this paper is different from the above algorithms. Inspired by distributional reinforcement learning(DRL) initially proposed by[36], we estimate the distribution, rather than the expectation, of cumulative rewards by the critic network.
Distributional reinforcement learning is a new kind of reinforcement learning algorithm, mainly learning the distribution of cumulative rewards. The distributional Bellman operator() is shown in Eq.(13).
(13) |
where represents the cumulative return obtained by taking action in state , which is a random variable, is the rewards function.
Under Wasserstein metric, [36] proved that the distributional Bellman operator is a -contract operator. The learning task of DRL is to make the distribution and the target distribution as similar as possible. Following [15], we utilize quantile regression to estimate network parameters. Quantile regression, projecting the update of distributional Bellman to the quantile distribution, uses a parameterized quantile distribution to approximate the value distribution. Let , which is the output of critic network, denotes the N quantiles of . The target distribution is shown in Eq.(14), which can be regarded as ground truth.
(14) |
The loss function of critic network is defined in Eq.(15).
(15) |
where
(16) |
Because is not differentiable at zero, we take the Huber loss function, given in Eq.(17), instead of .
(17) |
Thus, we get a new loss function, also called the quantile Huber loss function, expressed as follows:
(18) |
Since portfolio weight is a continuous variable, we choose a deterministic policy to generate actions, given the Eq.(19).
(19) |
As mentioned in the introduction section, the objective function of portfolio optimization is the weighted sum of the mean and VaR of portfolio cumulative return. If we take N uniform quantiles, the mean of portfolio cumulative return, denoted by , can be approximately equal to the average of quantiles of cumulative return, shown in Eq.(20).
(20) |
VaR is used to measure risk in this paper. VaR refers to the maximum portfolio loss for a given confidence level in a specific period. It can be described in Eq.(21), in which and denote portfolio cumulative return and confidence level respectively.
(21) |
Because is a quantile of portfolio cumulative return, in this paper, it can also be expressed as the following:
(22) |
Given the value of N and , we can easily get . For example, if , , if , .
According to classical modern portfolio theory, portfolio selection aims to construct an optimal portfolio model that maximizes expected returns under a given acceptable risk level(). This optimal model is shown in the following:
(23) |
According to the above optimization model, different risk levels correspond to optimal weights. In other words, investors have various portfolio choices, which undoubtedly increases the difficulty of decision-making. To obtain a single optimal solution, The above programming problem can be transformed into a single objective programming problem through the following function:
(24) |
where (called Zeta) denotes the risk attitude of the investor, the higher , the higher the risk aversion of investors, and the more conservative investment strategies will be adopted.
Exploration is crucial for agents, and deterministic strategies cannot explore, so we need to artificially add noise to the output actions. In DDPG algorithm, [37] uses Ornstein Uhlenbeck process(OU) as action noise. [38] found noise drawn from the OU process offered no performance benefits, so in this paper, following [38], we add Gaussian noise, which is given in Eq.(25), to each action. The standard deviation of Gaussian noise decreases exponentially as the number of parameter updates continues.
(25) |

4.2 Neural network
We need to model the neural network structure to explore functional patterns and extract informative features. Based on the time series nature of stock data, the gated recurrent unit(GRU) is utilized to construct an informative feature representation. The GRU has two gates, i.e., a reset gate and an update gate. These two gates determine which information can ultimately be used as the output of the gating recurrent unit. They are unique in that they can preserve information in long-term sequences and will not be cleared over time or removed because they are unrelated to prediction. Intuitively, the reset gate determines how to combine the new input information with the previous memory. The update gate defines the amount of prior memory saved to the current time step. The actor and critic network structure is shown in Fig.1 and Fig.2. Since GRU, LSTM, and RNN have similar structural features, we also used LSTM and RNN networks in the sensitivity analysis.

In addition, Several authors have leveraged Convolutional Neural Networks(CNN) to perform financial trading strategy[39, 40], treating the stock trading problem as a computer vision application by using time-series representative images as input. The convolutional neural network has unique advantages in speech recognition and image processing with its particular structure of local weight sharing. Its design is closer to the actual biological neural network. Weight sharing reduces the complexity of the network, especially the feature that the images of multi-dimensional input vectors can be directly input into the network, which avoids the complexity of data reconstruction in the feature extraction and classification process. Because of CNN’s robust feature representation ability, we studied the impact of CNN on the performance of the proposed algorithm in the sensitivity analysis.
4.3 Distributed framework
In this paper, to make the training result robust, the actor must interact with the environment at different epochs(or iterations) several times, which is very time-consuming. We utilize Ape-X architecture, proposed by [17], to speed up the training procedure. This algorithm decouples acting from learning and decomposes the learning process into three parts. In the first part, there are multiple actors. Each actor interacts with its respective environment based on the shared neural network, accumulating experience and putting it into the shared experience replay memory. We refer to this part, running on CPUs, as acting. In the second part, the (single) learner samples data from the replay memory and then updates the neural network. We refer to this part as the learning part running on a GPU. The third part is mainly responsible for data transmission. We refer to this part, running on CPUs, as communication.
We use a multiprocessing method to implement Ape-X. Specifically, there are 22 parallel processes, of which 20 are responsible for interacting with the environment, one process is accountable for updating network parameters, and one process is responsible for data exchange. The general architecture of the proposed method and the pseudocode for the algorithm are shown in Fig.3 and Table 1.

bath size M, number of actors K, replay size R, exploration constant, learning rate |
1: Initialize network weights |
2: Launch K actors and replicate network weights to each actor |
4: t=1,2,…,T, |
5: Sample M transitions of length N from replay buffer |
6: Compute the N quantiles of : |
6: Construct the target distributions |
7: Update the parameters of critic network() by minimize: |
8: Update the parameters of critic network() by gradient ascend: |
9: If mod , update the target network: |
, |
10: If mod , replicate network weights to the actors |
11: |
12: actor network parameters |
1: |
2: Sample action |
3: Execute action , observe reward r and state |
4: Store in replay buffer |
5: learner finishes |
5 Experimental setup
This section details the setups of our experiments, including datasets, performance measures, benchmark models, and technical details. In addition, we also need to make the following assumptions: Firstly, all transactions are made at the close price at the end of each trading period; secondly, the market size is large enough that the price of security and market environment is not affected by the transactions; Thirdly, since frequent adjustment of portfolio weights will generate a lot of transaction costs, we adjust the portfolio weights once a week; finally, the investment period is set to one year.
5.1 Datasets
We experiment with two different portfolios. The first portfolio consists of four famous exchange trading funds (ETFs) in the US market and a risk-free asset. The ETFs portfolio includes SPDR S&P 500 ETF Trust (SPY), Invesco QQQ Trust ETF(QQQ), SPDR Dow Jones Industrial Average ETF(DIA), and iShares Russell 2000 ETF(IWM). The second portfolio consists of stocks of four technology companies and a risk-free asset. The four technology companies are ORCL, AAPL, TSLA, and GOOG. All data used in this paper is available on Yahoo Finance. The trading horizon of 13 years is divided into both training and testing sets as following:
5.2 Performance measures
Risk and return are inseparable in the investment decision process. Under normal circumstances, the risk is highly correlated with the return, and a high return means high risk. Therefore, the performance measures must include these two aspects. In this article, we use four types of performance measures to evaluate the proposed trading strategy. The first type of metric measures the profitability of the investment strategy, i.e., total return. The second type of metrics measures investment risk, including variance and VaR. The third metric type considers risk and returns, including sharpe ratio and Sortino Ratio. The last type of metric is average turnover. More detail about performance measures are as following.
The total return is the rate of return over a period. Total return, computed using Eq.(26), includes capital gains, interest, realized distributions, and dividends.
(26) |
Where is the value of the initial investment; is the value of the portfolio at the end of the investment period.
As mentioned above, VaR calculates the maximum loss of a portfolio over a given period on specified confidence level.
The Sharpe ratio, first proposed by citesharpe1998sharpe, is a measure of risk-adjusted return. This ratio, computed using Eq.(27), reflects how much the return on risk per unit exceeds the risk-free return. If the Sharpe ratio is positive, a portfolio’s average net value growth rate exceeds the risk-free interest rate.
(27) |
where is the expectation of portfolio return, is risk-free rate, is standard deviation of portfolio return.
The Sortino ratio is a risk-adjustment metric used to determine the additional return for each unit of downside risk. It is similar to the Sharpe ratio, but the Sortino ratio uses the lower partial standard deviation rather than the total standard deviation to distinguish between adverse and favorable fluctuations. The Sortino ratio can be expressed as follows:
(28) |
where is the lower partial standard deviation of portfolio return.
The standard deviation of the portfolio variance can be calculated as the square root of the portfolio variance.
Average turnover represents the average level of change in portfolio weight, which is defined in Eq.(29).
(29) |
where is the investment horizon, denote weight parameter of asset i in investment period t.
5.3 Benchmark models
To analyze the effectiveness of the proposed strategy, some benchmark strategies, summarised hereafter, are selected for comparison.
B&H is used as a benchmark strategy by many researchers compared with their proposed strategies. Suppose that the holding proportion of all five assets is 20% in the B&H strategy and remains unchanged throughout the investment period.
S&H is also widely used as a benchmark strategy. Assuming that in the S&H strategy, the holding proportion of all four risky assets is -25%, the proportion of risky-free assets is 200%, all of which remain unchanged throughout the investment period.
According to the efficient market hypothesis (EMH), all valuable information has been timely, accurate, and fully reflected in the stock price trend, including the current and future value of the enterprise. Without market manipulation, investors cannot obtain excess profits higher than the market average by analyzing past prices. Any trading strategy based on historical data differs from the randomly selected strategy.
The mean-variance model, introduced by Markowitz in 1952, aims to find the best portfolio only by the first two moments of cumulative return. Suppose there are n kinds of assets, represents the expected return of a portfolio, is the weight vector, is the covariance vector of return, is risk aversion coefficient, represents dimensional unit vector, we establish the following optimization model based on utility maximization:
(30) | |||||
(31) |
5.4 technique detail
We obtain time window size and other hyper-parameters, including replay buffer size, batch size, discount factor, etc., through several rounds of tuning. In addition, we need to set some other hyper-parameters before training. Suppose the risk-free rate of return is 2%, and the equivalent weekly return is 0.038%. The transaction cost of buying and selling an asset is set to 0.02%. Risk aversion parameter() and short selling parameter()are set to 0.5 and 3; because these two parameters have significant impacts on investment decisions and portfolio returns, we perform sensitivity analysis for them in subsection 6.3. The networks were trained through the ADAM optimization procedure with a learning rate of . The activation function is set as the Leaky Relu function. All parameter are summarised in Table 2. Finally, the algorithms proposed in the paper are implemented in python3.7.11 using pytorch1.10.2 and were run on a PC that contains a sixteen-core 2.50GHz CPU with 16GB RAM and NVIDIA Geforce RTX 3060 GPU.
hyper-parameter | value | hyper-parameter | value |
time window size(n) | 60 | replay memory | 2000 |
learning rate(lr) | 1e-5 | number of parameter update | 80000 |
batch size | 32 | discount factor() | 0.9 |
confidence level of VaR() | 0.95 | parameter of Huber Loss() | 1.0 |
short selling parameter() | 3.0 | soft update parameter() | 0.5 |
initial money | 10000. | risk-free rate | 3.8e-4 |
risk attitude parameter() | 0.5 | num of layer for GRU | 2 |
Output number of critic network | 200 | transaction cost | 0.0020 |
6 Result and discussion
6.1 The ETFs portfolio
The first detailed analysis concerns the execution of the proposed algorithm on the ETFs portfolio. Fig.4 illustrates the average performance of the proposed algorithms for both the training and testing sets as the parameter update step continues. It can be noticed that the cumulative return of the ETFs portfolio tends to converge in the training and testing set after about 20000 parameter updates. Moreover, we note that the convergence value of the cumulative return in the testing set is greater than in the training set. On the other hand, the risk levels measured by VaR are almost identical in training and testing sets. One possible explanation is that, as pointed out by[41], because the training and testing set does not share identical distributions, it simply indicates an easier-to-trade and more profitable market for the testing set. Although the training and testing sets do not share identical distributions, the training set is representative enough for the R3L algorithm to achieve good results in testing, so we can still use historical data to train the model and then use it for future trading. Finally, from the perspective of the Sharpe ratio, the performance of the R3L algorithm in the testing set is still better than that in the training set.

Fig.5 illustrates the portfolio value trend when applying the proposed algorithm and benchmark strategies in the testing set. We observe that the final portfolio value of the proposed algorithm is 19.51% higher than the S&H strategy, 18.0% higher than the RN strategy, 8.1% higher than the MV strategy, and 7.9% higher than the B&H strategy.

Table 3 further presents output performance measures results when using the R3L algorithm and the benchmark strategies. The R3L algorithm is optimal regarding risk, return, and overall performance. The Sharpe ratio of the proposed algorithm is 19.51% higher than the S&H strategy, 18.0% higher than the RN strategy, 8.1% higher than the MV strategy, and 18.0% higher than the B&H strategy. In addition, as the stock market was primarily bullish throughout the test period, the B&H strategy outperformed other benchmark strategies most of the time, while the performance of the S&H strategy was the worst. The performance of the RN strategy is abysmal. Since we did not consider the company’s financial and internal non-public information, it does not mean that the efficient market hypothesis is not tenable.
delta | TR | SD | SR1 | VAR | SR2 | AT | |
---|---|---|---|---|---|---|---|
B&H | 5.00% | 0.0264 | 0.0889 | 0.0374 | 0.1260 | 0.0000 | |
S&H | -5.24% | 0.0330 | -0.0870 | 0.0419 | -0.1392 | 0.0000 | |
RN | -4.00% | 0.0329 | -0.0414 | 0.0435 | -0.0541 | 0.3291 | |
MV | 4.86% | 0.0605 | 0.0968 | 0.0875 | 0.1128 | 0.0929 | |
R3L | 13.32% | 0.0266 | 0.1100 | 0.0338 | 0.1681 | 0.0966 | |
B&H | 10.09% | 0.0317 | 0.0780 | 0.0203 | 0.1173 | 0.0000 | |
S&H | -20.62% | 0.0396 | -0.1280 | 0.0575 | -0.1657 | 0.0000 | |
RN | -8.76% | 0.0423 | -0.0469 | 0.0584 | -0.0536 | 0.3046 | |
MV | 6.75% | 0.0947 | 0.0893 | 0.0115 | 0.1718 | 0.0929 | |
R3L | 14.47% | 0.0298 | 0.0943 | 0.0369 | 0.1571 | 0.0849 |
6.2 The stock portfolio
The same detailed analysis is performed on the stock portfolio, which shows different characteristics compared to the ETFs portfolio. Fig.6 illustrates the average performance of the proposed algorithm for both the training and testing sets. We observe that the cumulative return of the stock portfolio is almost identical in the training and testing set after 20000 parameter updates. At the same time, the risk level(VaR) is higher in training sets than in testing sets, which is very different from the ETFs portfolio. Finally, from the sharpe ratio perspective, the algorithm’s performance in the training set is better than in the testing set. Fig.5 illustrates the portfolio value trend of different strategies. Similar to the ETFs portfolio, the R3L strategy has the highest cumulative value at the end of the investment period. Table 3 presents output performance measures results when using different trading strategies. It can be noticed that from the perspective of the sharpe ratio, sortino ratio, and VaR, the R3L algorithm is also optimal.

6.3 Sensitive analysis
In this paper, delta() and zeta() are two critical parameters that affect the overall performance of the proposed algorithm. Delta represents the maximum ratio of short selling allowed; the larger the delta, the more profit opportunities investors have in the bear market. Zeta represents investors’ risk attitude; the larger zeta is, the higher the risk tolerance is, and investors tend to adopt more radical investment strategies. Finally, as mentioned in subsection 4.2, different network structures have different feature extraction abilities. Therefore, it is necessary to analyze the impact of network structures on the algorithm’s effectiveness to select the best one. We conduct the sensitive analysis of delta, theta, and network structure on the R3L strategy. The performance of the proposed model is evaluated concerning different values for delta, theta, and network structures.
Fig.7 illustrates the portfolio value trend when applying the R3L algorithm with a different delta. Portfolio value shows an upward trend over time for the ETFs and stock portfolios when delta equals 1 and 3. In this case, investors can obtain positive returns at the end of the investment period. In contrast, in other cases, the portfolio value shows significant volatility over time, and investors receive negative returns at the end of the investment period. Table 4 further shows the overall performance of the proposed strategy concerning different delta values. From the perspective of total return, Sharpe ratio, and Sortino ratio, is optimal for the ETFs portfolio, and is optimal for the stock portfolio.
It can be seen that the maximum amount of short selling allowed is not the bigger, the better, which is explained by the fact that, although short selling can provide investors with profit opportunities in a bear market, it also brings more risks. Therefore, it is essential to control the scale of short selling appropriately.

delta | TR | SD | SR1 | VAR | SR2 | AT | |
---|---|---|---|---|---|---|---|
1 | 9.70% | 0.0266 | 0.0815 | 0.0367 | 0.1177 | 0.0247 | |
3 | 11.16% | 0.0270 | 0.0912 | 0.0362 | 0.1391 | 0.1029 | |
6 | -0.05% | 0.0323 | 0.0060 | 0.0484 | 0.0183 | 0.2362 | |
9 | -3.59% | 0.0349 | -0.0346 | 0.0605 | -0.0452 | 0.3681 | |
15 | -15.74% | 0.0383 | -0.1034 | 0.0620 | -0.1302 | 0.6059 | |
1 | 18.85% | 0.0332 | 0.1117 | 0.0432 | 0.1962 | 0.0288 | |
3 | 15.26% | 0.0288 | 0.0991 | 0.0346 | 0.1834 | 0.0941 | |
6 | 5.24% | 0.0276 | 0.0220 | 0.0344 | 0.0827 | 0.1995 | |
9 | 3.66% | 0.0350 | 0.0300 | 0.0478 | 0.0514 | 0.2721 | |
15 | 0.83% | 0.0532 | -0.0095 | 0.0845 | -0.0170 | 0.4229 |
Fig.8 illustrates the portfolio value trend when applying the R3L algorithm with different zeta. Table 5 further shows the overall performance of the proposed strategy. It can be noticed that the change in portfolio value over time is similar for different theta. At the end of the investment period, the portfolio value is the greatest for the ETFs portfolio when theta equals 3. From the Sharpe and Sortino ratios perspective, is also optimal. The portfolio value is the greatest for the stock portfolio when theta equals 4. From the standpoint of Sharpe ratio and Sortino ratio, is also optimal. Nevertheless, the VaR is the lowest when theta equals 2 for the ETFs and stock portfolios.
From the above analysis, it can be seen that a higher risk aversion coefficient does not necessarily lead to a smaller return and risk level, which contradicts the classical portfolio theory. One possible explanation is that our algorithm can predict the stock price based on historical data to improve the portfolio’s return and reduce the risk. To sum up, the algorithm proposed in this article can enhance portfolio returns and minimize risk.

zeta | TR | SD | SR1 | VAR | SR2 | AT | |
---|---|---|---|---|---|---|---|
0.5 | 12.58% | 0.0267 | 0.1029 | 0.0341 | 0.1576 | 0.0991 | |
1.0 | 5.23% | 0.0275 | 0.0453 | 0.0383 | 0.0703 | 0.1122 | |
2.0 | 5.10% | 0.0270 | 0.0486 | 0.0395 | 0.0814 | 0.1141 | |
3.0 | 5.26% | 0.0305 | 0.0521 | 0.0432 | 0.0854 | 0.1019 | |
4.0 | 7.37% | 0.0264 | 0.0570 | 0.0401 | 0.0887 | 0.0955 | |
0.5 | 14.11% | 0.0297 | 0.0898 | 0.0372 | 0.1548 | 0.0979 | |
1.0 | 10.53% | 0.0281 | 0.0649 | 0.0383 | 0.1169 | 0.1156 | |
2.0 | 18.09% | 0.0253 | 0.1295 | 0.0344 | 0.2470 | 0.0788 | |
3.0 | 22.66% | 0.0338 | 0.1236 | 0.0448 | 0.2623 | 0.0867 | |
4.0 | 22.83% | 0.0351 | 0.1247 | 0.0445 | 0.2450 | 0.0835 |
Fig. 9 illustrates the portfolio value trend when applying the R3L algorithm with a different network structure. It can be observed the changing trend of portfolio value is consistent under different network structures, and only the final cumulative value is slightly different. GRU and RNN obtained the maximum cumulative value for the ETFs and stock portfolios. Table 6 further shows the overall performance. We find that GRU performs best in the ETFs portfolio from the perspective of Sharpe ratio and Sortino ratio, and RNN performs best in the stock portfolio from the perspective of Sortino ratio. In contrast, GRU performs best from the perspective of the sharpe ratio.

Network | TR | SD | SR1 | VAR | SR2 | AT | |
---|---|---|---|---|---|---|---|
GRU | 17.82% | 0.0342 | 0.1089 | 0.0457 | 0.2118 | 0.0865 | |
LSTM | 8.65% | 0.0337 | 0.0433 | 0.0531 | 0.0723 | 0.1060 | |
RNN | 12.88% | 0.0335 | 0.0660 | 0.0455 | 0.1173 | 0.1093 | |
COONV2D | 10.50% | 0.0321 | 0.0680 | 0.0355 | 0.1381 | 0.1258 | |
CONV3D | 14.33% | 0.0352 | 0.0758 | 0.0526 | 0.1584 | 0.0985 | |
GRU | 14.47% | 0.0298 | 0.0943 | 0.0369 | 0.1571 | 0.0849 | |
LSTM | 7.34% | 0.0357 | 0.0594 | 0.0534 | 0.0917 | 0.1110 | |
RNN | 14.75% | 0.0300 | 0.0907 | 0.0431 | 0.2078 | 0.1125 | |
CONV2D | 3.76% | 0.0283 | 0.0377 | 0.0326 | 0.0840 | 0.1541 | |
CONV3D | 13.67% | 0.0360 | 0.0710 | 0.0521 | 0.1460 | 0.0903 |
7 Conclusion and future work
We propose a novel deep reinforcement learning algorithm, called risk-return reinforcement learning(R3L), to derive a portfolio trading strategy. Compared with the existing literature, the main innovations of this paper are as the following: firstly, we construct a portfolio optimization model which is solved by an improved deep reinforcement learning algorithm based on actor-critic architecture; secondly, we realize short selling of portfolio through a linear transformation; thirdly, we leverage Ape-x algorithm to speed up the training process. Experiments carried out on the performance of the R3L algorithm demonstrate that the proposed R3L is superior to the traditional benchmark strategies, such as buy-and-hold, sell-and-hold, random select, and mean-variance strategies. In addition, although short selling allows investors to profit in a bear market, the maximum short selling permitted ratio is not the bigger, the better. Therefore, investors should choose optimal short selling parameters according to investment objectives, asset types, and other factors to maximize portfolio return. Similarly, we need to choose appropriate risk attitude parameters and network structure to optimize the portfolio’s overall performance.
Future research can be carried out from the following aspects: First, according to to[42], VaR is not a coherent risk measure, so we could consider using other risk measurements, such as conditional value at risk(CVaR), to construct the portfolio optimization problem in future research; Second, this paper assumes that the trading volume is small compared to the market size, so the trading behavior of a single agent does not affect the market environment and stock prices. However, the trading behavior, even with a small volume, still has a subtle impact on the stock price and market environment. How to model the market environment is one of the future research directions; Finally, the decision-making process of algorithmic trading can be divided into two parts: selecting asset types and determining optimal portfolio weight. Therefore, subsequent research can be used hierarchical deep reinforcement learning(HDRL) to handle portfolio optimization problems.
Conflict of interest
The authors declare that they have no conflict of interests regarding the publication of this paper.
Acknowledgments
Here are the acknowledgments. Note the asterisk \section*{Acknowledgments} that signifies that this section is unnumbered.
References
- Perold [1988] A. F. Perold, The implementation shortfall: Paper versus reality, Journal of Portfolio Management 14, 4 (1988).
- Almgren and Lorenz [2007] R. Almgren and J. Lorenz, Adaptive arrival price, Trading 2007, 59 (2007).
- Berkowitz et al. [1988] S. A. Berkowitz, D. E. Logue, and E. A. Noser Jr, The total cost of transactions on the nyse, The Journal of Finance 43, 97 (1988).
- Madhavan and Panchapagesan [2002] A. N. Madhavan and V. Panchapagesan, The first price of the day, The Journal of Portfolio Management 28, 101 (2002).
- Kolm and Maclin [2011] P. N. Kolm and L. Maclin, Algorithmic trading, optimal execution, and dyna mic port folios (2011).
- Kritzman [2006] M. Kritzman, Are optimizers error maximizers?, The Journal of Portfolio Management 32, 66 (2006).
- Hendershott and Riordan [2013] T. Hendershott and R. Riordan, Algorithmic trading and the market for liquidity, Journal of Financial and Quantitative Analysis 48, 1001 (2013).
- Eriksson and Swartling [2012] E. Eriksson and A. Swartling, in Human Work Interaction Design-HWID2012 (2012), 116–126.
- Jondeau and Rockinger [2003] E. Jondeau and M. Rockinger, Testing for differences in the tails of stock-market returns, Journal of Empirical Finance 10, 559 (2003).
- Tsiakas [2006] I. Tsiakas, Periodic stochastic volatility and fat tails, Journal of Financial Econometrics 4, 90 (2006).
- Franke [2009] R. Franke, Applying the method of simulated moments to estimate a small agent-based asset pricing model, Journal of Empirical Finance 16, 804 (2009).
- Park et al. [2020] H. Park, M. K. Sim, and D. G. Choi, An intelligent financial portfolio trading strategy using deep q-learning, Expert Systems with Applications 158, 113573 (2020).
- Almahdi and Yang [2019] S. Almahdi and S. Y. Yang, A constrained portfolio trading system using particle swarm algorithm and recurrent reinforcement learning, Expert Systems with Applications 130, 145 (2019).
- Markowitz [1952] H. Markowitz, Harry m. markowitz, Portfolio selection, Journal of Finance 7, 77 (1952).
- Dabney et al. [2018] W. Dabney, M. Rowland, M. Bellemare, and R. Munos, in Proceedings of the AAAI Conference on Artificial Intelligence (2018), vol. 32.
- Betancourt and Chen [2021] C. Betancourt and W.-H. Chen, Deep reinforcement learning for portfolio management of markets with a dynamic number of assets, Expert Systems with Applications 164, 114002 (2021).
- Horgan et al. [2018] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Van Hasselt, and D. Silver, Distributed prioritized experience replay, arXiv preprint arXiv:1803.00933 (2018).
- Alimoradi and Kashan [2018] M. R. Alimoradi and A. H. Kashan, A league championship algorithm equipped with network structure and backward q-learning for extracting stock trading rules, Applied soft computing 68, 478 (2018).
- Bu and Cho [2018] S.-J. Bu and S.-B. Cho, in International Conference on Intelligent Data Engineering and Automated Learning (Springer, 2018), 468–480.
- Van Hasselt et al. [2016] H. Van Hasselt, A. Guez, and D. Silver, in Proceedings of the AAAI conference on artificial intelligence (2016), vol. 30.
- Chen and Gao [2019] L. Chen and Q. Gao, in 2019 IEEE 10th International Conference on Software Engineering and Service Science (ICSESS) (IEEE, 2019), 29–33.
- Huang [2018] C. Y. Huang, Financial trading as a game: A deep reinforcement learning approach, arXiv preprint arXiv:1807.02787 (2018).
- Lucarelli and Borrotti [2020] G. Lucarelli and M. Borrotti, A deep q-learning portfolio management framework for the cryptocurrency market, Neural Computing and Applications 32, 17229 (2020).
- Jeong and Kim [2019] G. Jeong and H. Y. Kim, Improving financial trading decisions using deep q-learning: Predicting the number of shares, action strategies, and transfer learning, Expert Systems with Applications 117, 125 (2019).
- Zhang et al. [2020] Z. Zhang, S. Zohren, and S. Roberts, Deep reinforcement learning for trading, The Journal of Financial Data Science 2, 25 (2020).
- Deng et al. [2016] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, Deep direct reinforcement learning for financial signal representation and trading, IEEE transactions on neural networks and learning systems 28, 653 (2016).
- Wang et al. [2016] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, in International conference on machine learning (PMLR, 2016), 1995–2003.
- Wu et al. [2020] X. Wu, H. Chen, J. Wang, L. Troiano, V. Loia, and H. Fujita, Adaptive stock trading strategies with deep reinforcement learning methods, Information Sciences 538, 142 (2020).
- Vezeris et al. [2019] D. Vezeris, I. Karkanis, and T. Kyrgos, Adturtle: An advanced turtle trading system, Journal of Risk and Financial Management 12, 96 (2019).
- Liang et al. [2018] Z. Liang, H. Chen, J. Zhu, K. Jiang, and Y. Li, Adversarial deep reinforcement learning in portfolio management, arXiv preprint arXiv:1808.09940 (2018).
- Jiang et al. [2017] Z. Jiang, D. Xu, and J. Liang, A deep reinforcement learning framework for the financial portfolio management problem, arXiv preprint arXiv:1706.10059 (2017).
- Pendharkar and Cusatis [2018] P. C. Pendharkar and P. Cusatis, Trading financial indices with reinforcement learning agents, Expert Systems with Applications 103, 1 (2018).
- Schnaubelt [2022] M. Schnaubelt, Deep reinforcement learning for the optimal placement of cryptocurrency limit orders, European Journal of Operational Research 296, 993 (2022).
- Bertoluzzo and Corazza [2012] F. Bertoluzzo and M. Corazza, Testing different reinforcement learning configurations for financial trading: Introduction and applications, Procedia Economics and Finance 3, 68 (2012).
- Taghian et al. [2022] M. Taghian, A. Asadi, and R. Safabakhsh, Learning financial asset-specific trading rules via deep reinforcement learning, Expert Systems with Applications 195, 116523 (2022).
- Bellemare et al. [2017] M. G. Bellemare, W. Dabney, and R. Munos, in International Conference on Machine Learning (PMLR, 2017), 449–458.
- Lillicrap et al. [2015] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, Continuous control with deep reinforcement learning, arXiv preprint arXiv:1509.02971 (2015).
- Fujimoto et al. [2018] S. Fujimoto, H. Hoof, and D. Meger, in International conference on machine learning (PMLR, 2018), 1587–1596.
- Carta et al. [2021] S. Carta, A. Corriga, A. Ferreira, A. S. Podda, and D. R. Recupero, A multi-layer and multi-ensemble stock trader using deep learning and deep reinforcement learning, Applied Intelligence 51, 889 (2021).
- Barra et al. [2020] S. Barra, S. M. Carta, A. Corriga, A. S. Podda, and D. R. Recupero, Deep learning and time series-to-image encoding for financial forecasting, IEEE/CAA Journal of Automatica Sinica 7, 683 (2020).
- Théate and Ernst [2021] T. Théate and D. Ernst, An application of deep reinforcement learning to algorithmic trading, Expert Systems with Applications 173, 114632 (2021).
- Rockafellar et al. [2000] R. T. Rockafellar, S. Uryasev, et al., Optimization of conditional value-at-risk, Journal of risk 2, 21 (2000).