This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\history

Date of publication xxxx 00, 0000, date of current version November 28, 2021. \historyReceived November 28, 2021, accepted December 27, 2021, date of publication xxx, date of current version December 27, 2021. 10.1109/ACCESS.2021.3139510

\corresp

Corresponding author: Gabriel Borrageiro (e-mail: [email protected]).

Reinforcement Learning for Systematic FX Trading

GABRIEL BORRAGEIRO[Uncaptioned image]    NICK FIROOZYE[Uncaptioned image] AND PAOLO BARUCCA[Uncaptioned image] Department of Computer Science, University College London, Gower Street, London, WC1E 6BT
Abstract

We explore online inductive transfer learning, with a feature representation transfer from a radial basis function network formed of Gaussian mixture model hidden processing units to a direct, recurrent reinforcement learning agent. This agent is put to work in an experiment, trading the major spot market currency pairs, where we accurately account for transaction and funding costs. These sources of profit and loss, including the price trends that occur in the currency markets, are made available to the agent via a quadratic utility, who learns to target a position directly. We improve upon earlier work by targeting a risk position in an online transfer learning context. Our agent achieves an annualised portfolio information ratio of 0.52 with a compound return of 9.3%, net of execution and funding cost, over a 7-year test set; this is despite forcing the model to trade at the close of the trading day at 5 pm EST when trading costs are statistically the most expensive.

Index Terms:
policy gradients, recurrent reinforcement learning, online learning, transfer learning, financial time series
\titlepgskip

=-15pt

I Introduction

Forecasters of financial time series commonly make use of supervised learning. For example, [1] apply both parametric approaches such as nonlinear state-space models and non-parametric approaches such as local learning to nonlinear time series analysis. [2] applies learning algorithms to decision making with financial time series. He notes that the traditional approach in this domain is to train a model using a prediction criterion, such as minimising mean-square prediction error or maximising the likelihood of a conditional model of the dependent variable. He finds that with noisy time series, better results are obtained when the model is trained directly to maximise the financial criterion of interest, here gains and losses (including those due to transactions) incurred during trading.

In this spirit, we extend the earlier work of [3] and [4], where direct, recurrent reinforcement learning agents are put to work in financial trading strategies. Rather than optimising for an intermediate performance measure such as maximal forecast accuracy or minimal forecast error, which is still the traditional approach in this domain, we maximise a more direct performance measure such as quadratic economic utility. An advantage of the approach is that we can use the risk-adjusted returns of the trading strategy, execution cost and funding cost to influence the learning of the model and update model parameters accordingly.

Whereas the focus of [3] was on the use of the differential Sharpe ratio as a performance measure, we adopt the quadratic utility of [5]. This utility ameliorates the undesirable property of the Sharpe ratio in that it penalises a model that produces returns larger than 𝔼[rt2]𝔼[rt]\frac{\mathbb{E}[r^{2}_{t}]}{\mathbb{E}[r_{t}]}, that is, the ratio of the expectation of squared returns to the expectation of returns [6]. For this reason, along with the use of relatively weak features and shared backtest hyper-parameters, [4] obtained mixed results when experimenting with cash currency pairs. In contrast, our experiment with the major cash currency pairs sees our recurrent reinforcement learning trader achieve an annualised portfolio information ratio of 0.52 with a compound return of 9.3%, net of execution and funding cost, over a seven-year test set. This return is achieved despite forcing the model to trade at the close of the trading day at 5 pm EST when trading costs are statistically the most expensive.

Aside from the different utility functions, we put these improved experiment results down to a combination of several factors. Firstly, we use more powerful feature engineering in the shape of radial basis function networks. The hidden processing units of these networks have means, covariances and structures that are determined by an unsupervised learning procedure for finite Gaussian mixture models [7]. The approach is a form of continual learning, explicitly inductive, feature representation transfer learning [8], where the knowledge of the mixture model is transferred to upstream models. Secondly, when optimising our utility function with respect to the recurrent reinforcement learner’s parameters, we do so sequentially online during the test set, using an extended Kalman filter optimisation procedure [9]. The earlier work uses less powerful offline batch gradient ascent methods. These methods cope less well with non-stationary financial time series.

[10] modelled the dynamics of financial assets as a jump-diffusion process, which is commonly used in financial econometrics. The jump-diffusion process implies that financial time series should observe small changes over time, so-called continuous changes, as well as occasional jumps. A sensible approach for coping with nonstationarity is to allow models to learn continuously.

We finish this section with a description of the layout of this paper. Section II provides preliminary introductions to transfer learning and reinforcement learning via policy gradients and ends with an overview of trading in the foreign exchange market. Section III introduces the experiment methods of this paper, including the targeting of financial risk positions with direct recurrent reinforcement and feature representation transfer via radial basis function networks. The section ends with a description of the baseline models used to compare the results of the marquis model. The marquis model is a feature representation transfer from a radial basis function network to a direct recurrent reinforcement learning agent and is shown visually in figure III-B.

Section IV details the design of the experiment that we conduct on daily sampled foreign exchange pairs. The data is obtained from Refinitiv. We evaluate performance using the annualised information ratio, which is computed on daily returns that are net of transaction and funding costs. The section completes a brief description of the hyper-parameters set for the various models. The experiment results are described in section V and are discussed in section VI. Concluding remarks are given in section VII.

II Preliminaries

This section introduces the policy gradient form of reinforcement learning and how it has been put to work empirically in quantitative finance, particularly with automated trading strategies. Finally, we finish the section with a short review of more recent work.

II-A Transfer Learning

Transfer learning refers to the machine learning paradigm in which an algorithm extracts knowledge from one or more application scenarios to help boost the learning performance in a target scenario [8]. Typically, traditional machine learning requires significant amounts of training data. Transfer learning copes better with data sparsity by looking at related learning domains where data is sufficient. Even in a big data scenario such as with streaming high-frequency data, transfer learning benefits by learning the adaptive statistical relationship of the predictors and the response. An increasing number of papers focus on online transfer learning [11, 12, 13]. Following Pan and Yang [14], we define transfer learning as:

Definition 1 (transfer learning).

Given a source domain 𝒟S\mathcal{D}_{S} and learning task 𝒯S\mathcal{T}_{S}, a target domain 𝒟T\mathcal{D}_{T} and learning task 𝒯T\mathcal{T}_{T}, transfer learning aims to help improve the learning of the target predictive function fT(.)f_{T}(.) in 𝒟T\mathcal{D}_{T} using the knowledge in 𝒟S\mathcal{D}_{S} and 𝒯S\mathcal{T}_{S}, where 𝒟S𝒟T\mathcal{D}_{S}\neq\mathcal{D}_{T}, or 𝒯S𝒯T\mathcal{T}_{S}\neq\mathcal{T}_{T}.

In the context of this paper, the source domain 𝒟S\mathcal{D}_{S} represents the feature space, which consists of the daily returns of the 36 currency pairs that are used in our experiment. The source learning task 𝒯S\mathcal{T}_{S} is the unsupervised compression of this feature space into a clustered form that learns its intrinsic nature. The clusters are formed via Gaussian mixture models, and we transfer their output via radial basis function networks to currency pairs that we wish to trade in the target domain 𝒟T\mathcal{D}_{T}. The target learning task 𝒯T\mathcal{T}_{T} is to take financial risk positions in these currency pairs for economic utility maximisation via direct recurrent reinforcement learning.

II-B Policy Gradient Reinforcement Learning

Williams [15] was one of the first to introduce policy gradient methods in a reinforcement learning context. Whereas most reinforcement learning algorithms focus on action-value estimation, learning the value of actions and selecting them based on their estimated values, policy gradient methods learn a parameterised policy that can select actions without using a value function. Williams also introduced his reinforce algorithm

Δ𝜽ij=ηij(rbij)ln(πi/𝜽ij),\Delta\boldsymbol{\theta}_{ij}=\eta_{ij}(r-b_{ij})\ln(\partial{\pi_{i}}/\partial{\boldsymbol{\theta}}_{ij}),

where 𝜽ij\boldsymbol{\theta}_{ij} is the model weight going from the jthj^{\prime}th input to the ithi^{\prime}th output, and 𝜽i\boldsymbol{\theta}_{i} is the weight vector for the ithi^{\prime}th hidden processing unit of a network of such units, whose goal it is to adapt in such a way as to maximise the scalar reward rr. For the moment, we exclude the dependence on the time of the weight update to make the notation clearer. Furthermore, ηij\eta_{ij} is a learning rate, typically applied with gradient ascent, bijb_{ij} is a reinforcement baseline, conditionally independent of the model outputs yiy_{i}, given the network parameters 𝜽\boldsymbol{\theta} and inputs 𝐱i\mathbf{x}_{i}. ln(πi/𝜽ij)\ln(\partial{\pi_{i}}/\partial{\boldsymbol{\theta}}_{ij}) is known as the characteristic eligibility of 𝜽ij\boldsymbol{\theta}_{ij}, where πi(yi=c,𝜽i,𝐱i)\pi_{i}(y_{i}=c,\boldsymbol{\theta}_{i},\mathbf{x}_{i}) , is a probability mass function determining the value of yiy_{i} as a function of the parameters of the unit and its input. Baseline subtraction rbijr-b_{ij} plays a vital role in reducing the variance of gradient estimators. Sugiyama [16] shows that the optimal baseline is given as

b=𝔼p(r|𝜽)[rtt=1Tlnπ(at|st,𝜽)2]𝔼p(r|𝜽)[t=1Tlnπ(at|st,𝜽)2],b^{*}=\frac{\mathbb{E}_{p(r|\boldsymbol{\theta})}\big{[}r_{t}\|\sum_{t=1}^{T}\nabla\ln{\pi(a_{t}|s_{t},\boldsymbol{\theta})\|^{2}}\big{]}}{\mathbb{E}_{p(r|\boldsymbol{\theta})}\big{[}\|\sum_{t=1}^{T}\nabla\ln{\pi(a_{t}|s_{t},\boldsymbol{\theta})\|^{2}}\big{]}},

where the policy function π(at|st,𝜽)\pi(a_{t}|s_{t},\boldsymbol{\theta}) denotes the probability of taking action ata_{t} at time t given state sts_{t}, parameterised by 𝜽\boldsymbol{\theta}. The expectation 𝔼p(r|𝜽)\mathbb{E}_{p(r|\boldsymbol{\theta})}, is distributed over the probability of rewards given the model parameterisation.

The main result of Williams’ paper is

Theorem 1.

For any reinforce algorithm, the inner product of 𝔼[Δ𝜽|𝜽]\mathbb{E}[\Delta\boldsymbol{\theta}|\boldsymbol{\theta}] and 𝔼[r|𝜽]\nabla\mathbb{E}[r|\boldsymbol{\theta}] is non-negative, and if ηij>0\eta_{ij}>0, then this inner product is zero if and only if 𝔼[r|𝜽]=0\nabla\mathbb{E}[r|\boldsymbol{\theta}]=0. If ηij\eta_{ij} is independent of ii and jj, then 𝔼[Δ𝜽|𝜽]=η𝔼[r|𝜽]\mathbb{E}[\Delta\boldsymbol{\theta}|\boldsymbol{\theta}]=\eta\nabla\mathbb{E}[r|\boldsymbol{\theta}].

This result relates 𝔼[r|𝜽]\nabla\mathbb{E}[r|\boldsymbol{\theta}], the gradient in weight space of the performance measure 𝔼[r|𝜽]\mathbb{E}[r|\boldsymbol{\theta}], to 𝔼[Δ𝜽|𝜽]\mathbb{E}[\Delta\boldsymbol{\theta}|\boldsymbol{\theta}], the average update vector in weight space. Thus for any reinforce algorithm, the average update vector in weight space lies in a direction for which this performance measure is increasing, and the quantity (rbij)ln(πi/𝜽ij)(r-b_{ij})\ln(\partial{\pi_{i}}/\partial{\boldsymbol{\theta}}_{ij}) represents an unbiased estimate of 𝔼[r|𝜽]/𝜽ij\partial{\mathbb{E}[r|\boldsymbol{\theta}}]/\partial{\boldsymbol{\theta}_{ij}}.

Sutton and Barto [17] demonstrate an actor-critic version of a policy gradient model, where the actor references the learned policy and the critic refers to the learned value function, usually a state-value function. Denote the scalar performance measure as J(𝜽)J(\boldsymbol{\theta}); the gradient ascent update takes the form

𝜽t+1=𝜽t+ηJ(𝜽).\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\eta\nabla J(\boldsymbol{\theta}).

With the one-step actor-critic policy gradient algorithm, one inserts a differentiable policy parameterisation π(a|s,𝜽)\pi(a|s,\boldsymbol{\theta}), a differentiable state-value function parameterisation v^(s,𝐰)\hat{v}(s,\mathbf{w}) and then one draws an action

atπ(.|st,𝜽),a_{t}\sim\pi(.|s_{t},\boldsymbol{\theta}),

taking action ata_{t} and observing a transition to state st+1s_{t+1} with reward rt+1r_{t+1}. Define

δt=rt+1+γv^(st+1,𝐰𝐭)v^(st,𝐰𝐭),\delta_{t}=r_{t+1}+\gamma\hat{v}(s_{t+1},\mathbf{w_{t}})-\hat{v}(s_{t},\mathbf{w_{t}}),

where 0γ10\ll\gamma\leq 1 is discount factor. The critic’s weight vector is updated as follows

𝐰t=𝐰t1+η𝐰δt𝐰v^(st,𝐰𝐭),\mathbf{w}_{t}=\mathbf{w}_{t-1}+\eta_{\mathbf{w}}\delta_{t}\nabla_{\mathbf{w}}\hat{v}(s_{t},\mathbf{w_{t}}),

and finally, the actor’s weight vector is updated as

𝜽t=𝜽t1+η𝜽δtlnπ(at|st,𝜽).\boldsymbol{\theta}_{t}=\boldsymbol{\theta}_{t-1}+\eta_{\boldsymbol{\theta}}\delta_{t}\nabla\ln\pi(a_{t}|s_{t},\boldsymbol{\theta}).

The actor-critic architecture uses temporal-difference learning combined with trial-and-error learning to improve the learned policy sequentially.

II-B1 Policy Gradient Methods in Financial Trading

Moody et al. [6] propose to train trading systems and portfolios by optimising objective functions that directly measure trading and investment performance. Rather than basing a trading system on forecasts or training via a supervised learning algorithm using labelled trading data, they train their systems using a direct, recurrent reinforcement learning algorithm, an example of the policy gradient method. The direct part refers to the fact that the model tries to target a position directly, and the model’s weights are adapted such that the performance measure is maximised. The performance function that they primarily consider is the differential Sharpe ratio. Denote the annualised Sharpe ratio [18] as

srk=2520.5×rkrfsk,sr_{k}=252^{0.5}\times\frac{r_{k}-r_{f}}{s_{k}},

where rkr_{k} is the return of the kthk^{\prime}th strategy, with standard deviation sks_{k} and rfr_{f} is the risk-free rate. For ease of explanation, we now remove the strategy index kk and replace it with a time index tt. The differential Sharpe ratio is defined as

dsrtdτ=bt1Δat0.5at1Δbt(at1at12)3/2,\frac{dsr_{t}}{d\tau}=\frac{b_{t-1}\Delta a_{t}-0.5a_{t-1}\Delta b_{t}}{(a_{t-1}-a_{t-1}^{2})^{3/2}}, (1)

where the quantities ata_{t} and btb_{t} are exponentially weighted estimates of the first and second moments of the reward rtr_{t}

at\displaystyle a_{t} =at1+τΔat=at1+τ(rtat1)\displaystyle=a_{t-1}+\tau\Delta a_{t}=a_{t-1}+\tau(r_{t}-a_{t-1})
bt\displaystyle b_{t} =bt1+τΔbt=bt1+τ(rt2bt1).\displaystyle=b_{t-1}+\tau\Delta b_{t}=b_{t-1}+\tau(r_{t}^{2}-b_{t-1}).

The exponential decay constant is τ(0,1]\tau\in(0,1]. They consider a batch gradient ascent update for model parameters 𝜽\boldsymbol{\theta}

Δ𝜽T=ηdsrTd𝜽,\Delta{\boldsymbol{\theta}_{T}=\eta\frac{dsr_{T}}{d\boldsymbol{\theta}}},

where

dsrTd𝜽=t=1TdsrTdrtdrtd𝜽=t=1T{bTaTrt(bTaT2)3/2}{drtdftdftd𝜽+drtdft1dft1d𝜽}.\begin{split}\frac{dsr_{T}}{d\boldsymbol{\theta}}&=\sum_{t=1}^{T}\frac{dsr_{T}}{dr_{t}}\frac{dr_{t}}{d\boldsymbol{\theta}}\\ &=\sum_{t=1}^{T}\Bigg{\{}\frac{b_{T}-a_{T}r_{t}}{(b_{T}-a_{T}^{2})^{3/2}}\Bigg{\}}\Bigg{\{}\frac{dr_{t}}{df_{t}}\frac{df_{t}}{d\boldsymbol{\theta}}+\frac{dr_{t}}{df_{t-1}}\frac{df_{t-1}}{d\boldsymbol{\theta}}\Bigg{\}}.\end{split}

The reward

rt=Δptft1δt|Δft|r_{t}=\Delta p_{t}f_{t-1}-\delta_{t}|\Delta f_{t}|

depends on the change in reference price ptp_{t} from which a gross profit and loss are computed, transaction cost δt\delta_{t} and a differentiable position function of the model inputs and parameters ftf(𝐱t,𝜽t)f_{t}\triangleq f(\mathbf{x}_{t},\boldsymbol{\theta}_{t}) which is in the range 1ft1-1\leq f_{t}\leq 1.

Trading and portfolio management systems require prior decisions as input to properly consider the effect of transaction costs, market impact, and taxes. This temporal dependence on the system state requires reinforcement versions of standard recurrent learning algorithms. Moody et al. [6] present empirical results in controlled experiments that demonstrate the efficacy of some of their methods for optimising trading systems and portfolios. For a long/short trader, they find that maximising the differential Sharpe ratio yields more consistent results than maximising profits. Both methods outperform a trading system based on forecasts that minimise mean-square error. They find that portfolio trading agents trained to maximise the differential Sharpe ratio achieve better risk-adjusted returns than those trained to maximise profit. However, an undesirable property of the Sharpe ratio is that it penalises a model that produces returns larger than 𝔼[r2]𝔼[r]btat\frac{\mathbb{E}[r^{2}]}{\mathbb{E}[r]}\approx\frac{b_{t}}{a_{t}}, that is, the ratio of the expectation of squared returns to the expectation of returns, which is counter-intuitive to investors’ notion of risk and reward.

Gold [4] extends Moody et al.’s [6] work and investigates high-frequency currency trading with neural networks trained via recurrent reinforcement learning. He compares the performance of linear networks with neural networks containing a single hidden layer and examines the impact of shared system hyper-parameters on performance. In general, he concludes that the trading systems may be effective but that the performance varies widely for different currency markets, and simple statistics of the markets cannot explain this variability.

He also finds that the linear recurrent reinforcement learners outperform the neural recurrent reinforcement learners in this application. Here, we suspect that the choice of inputs (past returns of the target) results in features with weak predictive power. As a result, the neural reinforcement learner struggles to make meaningful forecasts. In comparison, the linear recurrent reinforcement learner does better coping with both noisy inputs and outputs, generating biased yet stable predictions. Gold also used shared hyper-parameters. Many of the currency pairs behave differently in terms of their price action. For example, US dollar crosses are usually momentum-driven. Cross-currencies, such as the Australian dollar versus the New Zealand dollar, tend to be mean-reverting in nature. Therefore, sharing hyper-parameters probably negatively impacts the ex-post performance here.

II-B2 More Recent Work

In terms of more recent work involving policy gradient methods in finance, Tamar et al. [19] discuss risk-sensitive policy gradient methods that augment the standard expected cost minimisation problem with a measure of variability in cost. They consider static and time-consistent dynamic risk measures that combine a standard sampling approach with convex programming. Their approach is actor-critic for dynamic risk measures and involves explicit approximation of value functions.

Luo et al. [20] build a novel reinforcement learning framework trader. They adopt an actor-critic algorithm called deep deterministic policy gradient to find the optimal policy. Their proposed algorithm uses convolutional neural networks and outperforms some baseline methods when experimenting with stock index futures. They also discuss the generalisation and implications of the proposed method for finance.

Zhang et al. [21] use deep reinforcement learning algorithms such as deep q-learning networks [22], neural policy gradients [23] and advantage actor-critic [24] to design trading strategies for continuous futures contracts. They use long short-term memory neural networks [25] to train both the actor and critic networks. Both discrete and continuous action spaces are considered, and volatility scaling is incorporated to create reward functions that scale trade positions based on market volatility. They show that their method outperforms various baseline models, delivering positive profits despite high transaction costs. Their experiments show that the proposed algorithms can follow prominent market trends without changing positions and scale down or hold through consolidation periods.

Azhikodan et al. [26] propose automated trading systems that use deep reinforcement learning, specifically a deep deterministic policy gradient-based neural network model that trades stocks to maximise the gain in asset value. They determine the need for an additional system for trend-following to work alongside the reinforcement learning algorithm. Thus they implement a sentiment analysis model using a recurrent convolutional neural network to predict the stock trend from financial news.

Ye et al. [27] address an optimal trade execution problem that involves limit order books. Here, the model must learn how best to execute a given block of shares at minimal cost or maximal return. To this end, they propose a deep reinforcement learning-based solution that uses a deterministic policy gradient framework. Experiments on three real market datasets show that the proposed approach significantly outperforms other methods such as a submit and leave policy, a q-learning algorithm [28] and a hybrid method that combines the Almgren-Chriss model [29] with reinforcement learning.

Aboussalah and Lee [30] explore policy gradient techniques for continuous action and multi-dimensional state spaces, applying a stacked deep dynamic recurrent reinforcement learning architecture to construct an optimal real-time portfolio. The algorithm adopts the Sharpe ratio as a utility function to learn the market conditions and rebalance the portfolio accordingly.

Betancourt and Chen [31] propose a novel portfolio management method using deep reinforcement learning on markets with a dynamic number of assets. Their model endeavours to learn the optimal inventory to hold whilst minimising transaction costs.

Lei et al. [32] acknowledge that algorithmic trading is an ongoing decision making problem, where the environment requires agents to learn feature representation from highly non-stationary and noisy financial time series, and decision making requires that agents explore the environment and simultaneously make correct decisions in an online manner without any supervised information. Instead, they propose to tackle both problems via a time-driven feature-aware deep reinforcement learning model to improve the financial signal representation learning and decision making.

II-C Foreign Exchange Trading

This section describes the foreign exchange market and the mechanics of the foreign exchange derivatives, which are central to the experimentation that we conduct in section IV. The global foreign exchange market sees transactions above 6 trillion US dollars traded daily. Figure II-C shows this breakdown by instrument type and is extracted from the Bank of International Settlements Triennial Central Bank Survey, 2019.

\Figure

[!t]()[width=0.99]bis_daily_fx_turnover.png Average daily global foreign exchange market turnover in millions of US dollars, source: Bank of International Settlements.

FX transactions implicitly involve two currencies: the dominant or base currency is quoted conventionally on the left-hand side and the secondary or counter currency on the right-hand side. If foreign exchange positions are held overnight, the trader will earn the interest rate of the currency bought and pay the interest rate of the currency sold. The interest rates for specific maturities are determined in the inter-bank currency market and are heavily influenced by the base rates typically set by central banks. Foreign exchange trades settle two business days after the trade date by market convention unless otherwise specified.

Clients fund their positions by rolling them forward via tomorrow/next (tomnext) swaps. Tomnext is a short-term foreign exchange transaction where a currency pair is simultaneously bought and sold over two business days: tomorrow (in one business day) and the following day (two business days from today). The tomnext transaction allows traders to maintain their position without being forced to take physical delivery and is the convention applied by prime brokers to their clients on the inter-bank foreign exchange market. In order to determine this funding cost, one needs to compute the forward rates (prices). Forwards are agreements between two counterparties to exchange currencies at a predetermined rate on some future date.

Forward rates are calculated by adding forward points to a spot rate. These points reflect the interest rate differential between the two currencies being traded and the maturity of the trade. Forward points do not represent an expectation of the direction of a currency but rather the interest rate differential. Let bidtspotbid_{t}^{spot} denote the spot/cash currency pair rate at which price takers can sell at time tt. Similarly, let asktspotask_{t}^{spot} denote the spot/cash currency pair rate at which price takers can buy at time tt. The spot mid-rate is

midtspot=0.5×(bidtspot+asktspot).mid_{t}^{spot}=0.5\times(bid_{t}^{spot}+ask_{t}^{spot}). (2)

Forward points are computed as follows

midtfpts=midtspot(e2e1)T360ϕ,mid_{t}^{fpts}=mid_{t}^{spot}(e_{2}-e_{1})\frac{T}{360\phi},

where e2e_{2} is the secondary interest rate, e1e_{1} is the dominant interest rate, TT is the number of days till maturity, and ϕ\phi is the tick size or pip value for the associated currency pair. Example forward points for GBPUSD are shown in figure II-C. GBP= is the Refinitiv information code (ric) for cash GBPUSD and GBPTND= is the ric for tomnext GBPUSD forward points. Note that the forward points are quoted as a bid/ask pair, reflecting the appropriate interest differential applied to sellers and buyers and the additional cost (spread) quoted by the foreign exchange forwards market maker to compensate them for their quoting risk. The tomnext outrights are computed as

bidttn=bidtspot+asktfptsϕaskttn=asktspot+bidtfptsϕ.\begin{split}bid_{t}^{tn}&=bid_{t}^{spot}+ask_{t}^{fpts}\phi\\ ask_{t}^{tn}&=ask_{t}^{spot}+bid_{t}^{fpts}\phi.\end{split}

As an example of rolling a long GBPUSD position forward, the tomnext swap would involve selling GBPUSD at bidtspotbid_{t}^{spot} and repurchasing it at askttnask_{t}^{tn}. The cost of this roll is thus notional×(bidtspotaskttn)notional\times(bid_{t}^{spot}-ask_{t}^{tn}), where notionalnotional denotes the size of the position taken by the trader. If a trader is short GBPUSD, then to roll the position forward, she would buy asktspotask_{t}^{spot} and sell forward bidttnbid_{t}^{tn}, with the funding cost being notional×(bidttnasktspot)notional\times(bid_{t}^{tn}-ask_{t}^{spot}). This funding may be a loss but also a profit. In addition, many currency market participants hold foreign exchange deliberately to capture the favourable interest rate differential between two currency pairs. This approach is known as the carry trade and is extremely popular with the retail public in Japan, where the Yen interest rates have been historically low relative to other countries for quite some time.

\Figure

[!t]()[width=0.9]gbpfwd.png Refinitiv GBPUSD forward rates.

III Experiment Methods

This section describes how our recurrent reinforcement learner targets a position directly. In addition, we also describe the baseline models that are used for comparison and contrast. Next, we explore online inductive transfer learning, with feature representation transfer from a radial basis function network to a direct, recurrent reinforcement learning agent. The radial basis function network consists of hidden processing units of the Gaussian mixture model. The recurrent reinforcement learning agent learns the desired risk position via the policy gradient paradigm. Finally, the agent is put to work trading the major spot market currency pairs.

III-A Targeting A Position With Direct Recurrent Reinforcement

Sharpe [5] discusses asset allocation as a function of expected utility maximisation, where the utility function may be more complex than that associated with mean-variance analysis. Denote the expected utility at time tt for a single portfolio constituent as

υt=μtλ2σt2,\upsilon_{t}=\mu_{t}-\frac{\lambda}{2}\sigma_{t}^{2}, (3)

where the expected return μt=𝔼[rt]\mu_{t}=\mathbb{E}[r_{t}] and variance of returns σt=𝔼[rt2]𝔼[rt]2\sigma_{t}=\mathbb{E}[r_{t}^{2}]-\mathbb{E}[r_{t}]^{2} may be estimated in an online fashion with exponential decay, where as before τ\tau is an exponential decay constant

μt=τμt1+(1τ)rt,\mu_{t}=\tau\mu_{t-1}+(1-\tau)r_{t}, (4)
σt2=τσt12+(1τ)(rtμt)2.\sigma_{t}^{2}=\tau\sigma_{t-1}^{2}+(1-\tau)(r_{t}-\mu_{t})^{2}. (5)

The risk appetite constant λ>0\lambda>0 can be set as a function of an investor’s desired risk-adjusted return, as demonstrated by Grinold and Kahn [33]. The information ratio is a risk-adjusted differential reward measure, where the difference is taken between the model or strategy being evaluated and a baseline or benchmark strategy with expected returns bt=𝔼[rt(b)]b_{t}=\mathbb{E}\big{[}r_{t}^{(b)}\big{]}:

irt=2520.5×μtbtσt.ir_{t}=252^{0.5}\times\frac{\mu_{t}-b_{t}}{\sigma_{t}}. (6)

The similarity to the Sharpe ratio is apparent. Setting bt=0b_{t}=0 and substituting the non-annualised information ratio into the quadratic utility and differentiating with respect to the risk, we obtain a suitable value for the risk appetite parameter:

irt=μtσtυt=irt×σtλ2σt2dυtdσt=irtλσt=0λ=irtσt.\begin{split}ir_{t}&=\frac{\mu_{t}}{\sigma_{t}}\\ \upsilon_{t}&=ir_{t}\times\sigma_{t}-\frac{\lambda}{2}\sigma_{t}^{2}\\ \frac{d\upsilon_{t}}{d\sigma_{t}}&=ir_{t}-\lambda\sigma_{t}=0\\ \lambda&=\frac{ir_{t}}{\sigma_{t}}.\end{split} (7)

The net returns whose expectation and variance we seek to learn are decomposed as

rt=Δptft1δt|Δft|+κtft,r_{t}=\Delta p_{t}f_{t-1}-\delta_{t}|\Delta f_{t}|+\kappa_{t}f_{t}, (8)

where Δpt\Delta p_{t} is the change in reference price, typically a mid-price

Δpt=0.5×(bidt+asktbidt1askt1),\Delta p_{t}=0.5\times(bid_{t}+ask_{t}-bid_{t-1}-ask_{t-1}),

δt\delta_{t} represents the execution cost for a price taker

δt=max[0.5×(asktbidt),0],\delta_{t}=max[0.5\times(ask_{t}-bid_{t}),0],

κt\kappa_{t} is the profit or loss of rolling the overnight foreign exchange position, the so-called ’carry’ (see section II-C) and ftf_{t} is the desired position learnt by the recurrent reinforcement learner

ft=tanh(𝜽tT𝐱t).f_{t}=tanh\big{(}\boldsymbol{\theta}_{t}^{T}\mathbf{x}_{t}\big{)}. (9)

The model is maximally short when ft=1f_{t}=-1 and maximally long when ft=1f_{t}=1. The recurrent nature of the model occurs in the input feature space where the previous position is fed to the model input

𝐱t=[1,ϕ1(𝐮t),,ϕm(𝐮t),ft1]Tm+2,\mathbf{x}_{t}=[1,\phi_{1}(\mathbf{u}_{t}),...,\phi_{m}(\mathbf{u}_{t}),f_{t-1}]^{T}\in\mathbb{R}^{m+2}, (10)

and ϕj(.)\phi_{j}(.) denotes a radial basis function hidden processing unit, in a network of mm such units, which takes as input a feature vector 𝐮t\mathbf{u}_{t}, see section III-B. The goal of our recurrent reinforcement learner is to maximise the utility in equation 3 by targeting a position in equation 9. To do this, one may apply an online stochastic gradient ascent update

𝜽t=𝜽t1+ηυtΔ𝜽t+ηdυtd𝜽t.\boldsymbol{\theta}_{t}=\boldsymbol{\theta}_{t-1}+\eta\nabla\upsilon_{t}\equiv\Delta\boldsymbol{\theta}_{t}+\eta\frac{d\upsilon_{t}}{d\boldsymbol{\theta}_{t}}.

Instead of a static learning rate η\eta, one may consider the Adam optimiser of Kingma and Ba [34], where an adaptive learning rate is applied. This adaptive learning rate is a function of the gradient expectation and variance. The weight update then takes the form

𝐦t=β1𝐦t1+(1β1)υt\mathbf{m}_{t}=\beta_{1}\mathbf{m}_{t-1}+(1-\beta_{1})\nabla\upsilon_{t}
𝐯t=β2𝐯t1+(1β2)(υt)2\mathbf{v}_{t}=\beta_{2}\mathbf{v}_{t-1}+(1-\beta_{2})(\nabla\upsilon_{t})^{2}
𝜽t=𝜽t1+η𝐦^t𝐯^t0.5+ϵ,\boldsymbol{\theta}_{t}=\boldsymbol{\theta}_{t-1}+\eta\frac{\hat{\mathbf{m}}_{t}}{\hat{\mathbf{v}}_{t}^{0.5}+\epsilon},

with 𝐦^t=𝐦t/(1β1)\hat{\mathbf{m}}_{t}=\mathbf{m}_{t}/(1-\beta_{1}) and 𝐯^t=𝐯t/(1β2)\hat{\mathbf{v}}_{t}=\mathbf{v}_{t}/(1-\beta_{2}) denoting bias-corrected versions of the expected gradient and gradient variance, respectively. β1\beta_{1} and β2\beta_{2} are exponential decay constants. In earlier work, Bottou [35] had considered approximating the Hessian of the performance measure with respect to the model weights as a function of gradient only information. In practice, we find that Adam takes many iterations of model fitting to get the weights large enough to take a meaningful position via function 9; this is not necessarily an Adam problem, but a result of the tanh\tanh position function taking a while to saturate. If the weights are too small, then the average position taken by the recurrent reinforcement learner will be small as well. Therefore, we settle on an extended Kalman filter [36, 9] gradient-based weight update, albeit modified for reinforcement learning in this context.

Require: α\alpha, τ\tau
// α0\alpha\geq 0 is a Ridge penalty.
// 0τ10\ll\tau\leq 1 is an exponential decay factor.
1
Initialise: 𝜽=𝟎d\boldsymbol{\theta}=\mathbf{0}_{d}, 𝐏=𝐈d/α\mathbf{P}=\mathbf{I}_{d}/\alpha
// 𝟎d\mathbf{0}_{d}
is a zero vector in d\mathbb{R}^{d}.
// 𝐏\mathbf{P} is the precision matrix in d×d\mathbb{R}^{d\times d}.
2
Input: υt\nabla\upsilon_{t}
Output: 𝜽t\boldsymbol{\theta}_{t}
3
4z=1+υtT𝐏t1υt/τz=1+\nabla\upsilon_{t}^{T}\mathbf{P}_{t-1}\nabla\upsilon_{t}/\tau
5𝐤=𝐏t1υt/(zτ)\mathbf{k}=\mathbf{P}_{t-1}\nabla\upsilon_{t}/(z\tau)
6𝜽t=𝜽t1+𝐤\boldsymbol{\theta}_{t}=\boldsymbol{\theta}_{t-1}+\mathbf{k}
7𝐏t=𝐏t1/τ𝐤𝐤Tz\mathbf{P}_{t}=\mathbf{P}_{t-1}/\tau-\mathbf{k}\mathbf{k}^{T}z
Algorithm 1 extended Kalman filter

In algorithm 1, 𝐏t\mathbf{P}_{t} is an approximation to [2υt]1[\nabla^{2}\upsilon_{t}]^{-1}, the inverse Hessian of the utility function υt\upsilon_{t} with respect to the model weights 𝜽t\boldsymbol{\theta}_{t}.

We decompose the gradient of the utility function with respect to the recurrent reinforcement learner’s parameters as follows:

υt=dυtdrt{drtdftdftd𝜽t+drtdft1dft1d𝜽t1}=dυtdrt{drtdft{ft𝜽t+ftft1ft1𝜽t1}+drtdft1{ft1𝜽t1+ft1ft2ft2𝜽t2}}.\begin{split}\nabla\upsilon_{t}&=\frac{d\upsilon_{t}}{dr_{t}}\bigg{\{}\frac{dr_{t}}{df_{t}}\frac{df_{t}}{d\boldsymbol{\theta}_{t}}+\frac{dr_{t}}{df_{t-1}}\frac{df_{t-1}}{d\boldsymbol{\theta}_{t-1}}\bigg{\}}\\ &=\frac{d\upsilon_{t}}{dr_{t}}\Bigg{\{}\frac{dr_{t}}{df_{t}}\bigg{\{}\frac{\partial f_{t}}{\partial\boldsymbol{\theta}_{t}}+\frac{\partial f_{t}}{\partial f_{t-1}}\frac{\partial f_{t-1}}{\partial\boldsymbol{\theta}_{t-1}}\bigg{\}}\\ &+\frac{dr_{t}}{df_{t-1}}\bigg{\{}\frac{\partial f_{t-1}}{\partial\boldsymbol{\theta}_{t-1}}+\frac{\partial f_{t-1}}{\partial f_{t-2}}\frac{\partial f_{t-2}}{\partial\boldsymbol{\theta}_{t-2}}\bigg{\}}\Bigg{\}}.\end{split} (11)

The constituent derivatives for the left half of equation 11 are:

dυtdrt=(1η)[1λ(rtμt)]drtdft=δt×sign(Δft)+κt×sign(ft)dftd𝜽t=𝐱t[1tanh2(𝜽tT𝐱t)]+𝜽t,m+2[1tanh2(𝜽tT𝐱t)]×𝐱t1[1tanh2(𝜽t1T𝐱t1)].\begin{split}\frac{d\upsilon_{t}}{dr_{t}}&=(1-\eta)[1-\lambda(r_{t}-\mu_{t})]\\ \frac{dr_{t}}{df_{t}}&=-\delta_{t}\times sign(\Delta f_{t})+\kappa_{t}\times sign(f_{t})\\ \frac{df_{t}}{d\boldsymbol{\theta}_{t}}&=\mathbf{x}_{t}[1-\tanh^{2}{(\boldsymbol{\theta}_{t}^{T}\mathbf{x}_{t})}]\\ &+\boldsymbol{\theta}_{t,m+2}[1-\tanh^{2}{(\boldsymbol{\theta}_{t}^{T}\mathbf{x}_{t})}]\\ &\times\mathbf{x}_{t-1}[1-\tanh^{2}{(\boldsymbol{\theta}_{t-1}^{T}\mathbf{x}_{t-1})}].\end{split}

III-B Radial Basis Function Networks

In Borrageiro et al. [37], the authors show that online transfer learning via radial basis function networks provides a residual benefit in forecasting non-stationary time series. The residual benefit stems from the feature representation transfer of clustering algorithms. These algorithms are adapted sequentially, as are the supervised learners, which map the clustered feature space to the targets. The feature engineering that we use in this paper uses clusters formed of Gaussian mixture models. The network size is determined by the unsupervised learning procedure of finite mixture models described by Figueiredo and Jain [7]. Finally, we briefly describe the key ingredients of this meta-algorithm here.

The radial basis function network is a network of m>0m>0 Gaussian basis functions

ϕj(𝐮)=exp(12(𝐮𝝁j)T𝚺j1(𝐮𝝁j)).\phi_{j}(\mathbf{u})=\exp{\left(-\frac{1}{2}(\mathbf{u}-\boldsymbol{\mu}_{j})^{T}\boldsymbol{\Sigma}_{j}^{-1}(\mathbf{u}-\boldsymbol{\mu}_{j})\right)}.

Here we learn the jthj^{\prime}th mean 𝝁j\boldsymbol{\mu}_{j} and covariance 𝚺j\boldsymbol{\Sigma}_{j} through a Gaussian mixture model fitting procedure. Denote the probability density function of a kk component mixture as

p(𝐮|𝜽)=j=1kπjp(𝐮|𝜽j)=j=1kπj𝒩(𝐮|μ𝐣,𝚺j),p(\mathbf{u}|\boldsymbol{\theta})=\sum_{j=1}^{k}\pi_{j}p(\mathbf{u}|\boldsymbol{\theta}_{j})=\sum_{j=1}^{k}\pi_{j}\mathcal{N}(\mathbf{u}|\mathbf{\mu_{j}},\mathbf{\Sigma}_{j}),

where

𝒩(𝐮|𝝁,𝚺)=1(2π)d/2|𝚺|1/2exp[12(𝐮𝝁)T𝚺|1(𝐮𝝁)],\mathcal{N}(\mathbf{u}|\boldsymbol{\mu},\mathbf{\Sigma})=\frac{1}{(2\pi)^{d/2}|\mathbf{\Sigma}|^{1/2}}\exp\bigg{[}-\frac{1}{2}(\mathbf{u}-\boldsymbol{\mu})^{T}\mathbf{\Sigma}|^{-1}(\mathbf{u}-\boldsymbol{\mu})\bigg{]}, (12)

and the mixing weights satisfy 0πj10\leq\pi_{j}\leq 1, j=1kπj=1\sum_{j=1}^{k}\pi_{j}=1. The maximum likelihood estimate

𝜽ML=argmax𝜽lnp(𝐮|𝜽),\boldsymbol{\theta}_{ML}=\arg\max_{\boldsymbol{\theta}}\ln p(\mathbf{u}|\boldsymbol{\theta}),

and the Bayesian maximum a posteriori criterion

𝜽MAP=argmax𝜽lnp(𝐮|𝜽)+lnp(𝜽),\boldsymbol{\theta}_{MAP}=\arg\max_{\boldsymbol{\theta}}\ln p(\mathbf{u}|\boldsymbol{\theta})+\ln p(\boldsymbol{\theta}),

cannot be found analytically. The standard way of estimating 𝜽ML\boldsymbol{\theta}_{ML} or 𝜽MAP\boldsymbol{\theta}_{MAP} is the expectation-maximisation algorithm [38]. This iterative procedure is based on the interpretation of 𝐮\mathbf{u} as incomplete data. The missing part for finite mixtures is the set of labels 𝒵=𝐳0,,𝐳n\mathcal{Z}={\mathbf{z}_{0},...,\mathbf{z}_{n}}, which accompany the training data 𝐮0,,𝐮n\mathbf{u}_{0},...,\mathbf{u}_{n}, indicating which component produced each training vector. Following Murphy [39], let us define the complete data log-likelihood to be

c(𝜽)=i=1nlnp(𝐮i,𝐳i|𝜽),\ell_{c}(\boldsymbol{\theta})=\sum_{i=1}^{n}\ln{p(\mathbf{u}_{i},\mathbf{z}_{i}|\boldsymbol{\theta})},

which cannot be computed since 𝐳i\mathbf{z}_{i} is unknown. Thus, let us define an auxiliary function

𝒬(𝜽,𝜽t1)=𝔼[c(𝜽)|𝐮,𝜽t1],\mathcal{Q}(\boldsymbol{\theta},\boldsymbol{\theta}_{t-1})=\mathbb{E}[\ell_{c}(\boldsymbol{\theta})|\mathbf{u},\boldsymbol{\theta}_{t-1}],

where tt is the current time step. The expectation is taken with respect to the old parameters 𝜽t1\boldsymbol{\theta}_{t-1} and the observed data 𝐮\mathbf{u}. Denote as ric=p(zi=c|𝐮i,𝜽t1)r_{ic}=p(z_{i}=c|\mathbf{u}_{i},\boldsymbol{\theta}_{t-1}), cluster cc’s responsibility for datum ii. The expectation step has the following form

ric=πcp(𝐮i|𝜽c,t1)j=1kπjp(𝐮i|𝜽j,t1).r_{ic}=\frac{\pi_{c}p(\mathbf{u}_{i}|\boldsymbol{\theta}_{c,t-1})}{\sum_{j=1}^{k}\pi_{j}p(\mathbf{u}_{i}|\boldsymbol{\theta}_{j,t-1})}.

The maximisation step optimises the auxiliary function 𝒬\mathcal{Q} with respect to 𝜽\boldsymbol{\theta}

𝜽t=argmax𝜽𝒬(𝜽,𝜽t1).\boldsymbol{\theta}_{t}=\arg\max_{\boldsymbol{\theta}}\mathcal{Q}(\boldsymbol{\theta},\boldsymbol{\theta}_{t-1}).

The cthc^{\prime}th mixing weight is estimated as

πc=1ni=1nric=rcn.\pi_{c}=\frac{1}{n}\sum_{i=1}^{n}r_{ic}=\frac{r_{c}}{n}.

The parameter set 𝜽c={𝝁c,𝚺c}\boldsymbol{\theta}_{c}=\{\boldsymbol{\mu}_{c},\boldsymbol{\Sigma}_{c}\} is then

𝝁c=i=1nric𝐮irc𝚺c=i=1nric(𝐮i𝝁c)(𝐮i𝝁c)Trc.\begin{split}\boldsymbol{\mu}_{c}&=\frac{\sum_{i=1}^{n}r_{ic}\mathbf{u}_{i}}{r_{c}}\\ \boldsymbol{\Sigma}_{c}&=\frac{\sum_{i=1}^{n}r_{ic}(\mathbf{u}_{i}-\boldsymbol{\mu}_{c})(\mathbf{u}_{i}-\boldsymbol{\mu}_{c})^{T}}{r_{c}}.\end{split}

As discussed by Figueiredo and Jain [7], expectation-maximisation is highly dependent on initialisation. They highlight several strategies to ameliorate this problem, such as multiple random starts, final selection based on the maximum likelihood of the mixture, or k-means based initialisation. However, the distinction between model-class selection and model estimation in mixture models is unclear. For example, a 3 component mixture in which one of the mixing probabilities is zero is indistinguishable for a 2 component mixture. They propose an unsupervised algorithm for learning a finite mixture model from multivariate data. Their approach is based on the philosophy of minimum message length encoding [40], where one aims to build a short-code that facilitates a good data generation model. Their algorithm can select the number of components and, unlike the standard expectation-maximisation algorithm, does not require careful initialisation. The proposed method also avoids another drawback of expectation-maximisation for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space. Denote the optimal mixture parameter set

𝜽=argmin𝜽FJ(𝜽,𝐮),\boldsymbol{\theta}^{*}=\arg\min_{\boldsymbol{\theta}}\ell_{FJ}(\boldsymbol{\theta},\mathbf{u}),

where

FJ(𝜽,𝐮)=n2j=1kln(nπk12)+k2ln(n12)+k(n+1)2lnp(𝐮|𝜽).\begin{split}\ell_{FJ}(\boldsymbol{\theta},\mathbf{u})&=\frac{n}{2}\sum_{j=1}^{k}\ln\bigg{(}\frac{n\pi_{k}}{12}\bigg{)}+\frac{k}{2}\ln\bigg{(}\frac{n}{12}\bigg{)}\\ &+\frac{k(n+1)}{2}-\ln{p(\mathbf{u}|\boldsymbol{\theta})}.\end{split}

This leads to a modified maximisation step

πc=max{0,(i=1nric)n2}j=1kmax{0,(i=1nrij)n2}\displaystyle\pi_{c}=\frac{max\bigg{\{}0,\big{(}\sum_{i=1}^{n}r_{ic}\big{)}-\frac{n}{2}\bigg{\}}}{\sum_{j=1}^{k}max\bigg{\{}0,\big{(}\sum_{i=1}^{n}r_{ij}\big{)}-\frac{n}{2}\bigg{\}}}
forc=1,2,,k.\displaystyle for\ c=1,2,...,k.

The maximisation step is identical to expectation-maximisation, except that the cthc^{\prime}th parameter set 𝜽c\boldsymbol{\theta}_{c} is only estimated when πc>0\pi_{c}>0 and 𝜽c\boldsymbol{\theta}_{c} is discarded from 𝜽\boldsymbol{\theta}^{*} when πc=0\pi_{c}=0. A distinctive feature of the modified maximisation step is that it leads to component annihilation; this prevents the algorithm from approaching the boundary of the parameter space. In other words, if one of the mixtures is not supported by the data, it is annihilated.

We finish the section by showing figure III-B, which provides a visual representation of the feature representation transfer from the radial basis function network to the recurrent reinforcement learning agent. The external input to the transfer learner, represented by the left-most black circles, is a vector of daily returns of the 36 currency pairs used in the experiment, detailed in section IV-A. The grey circles represent the radial basis function network hidden processing unit layer. In addition, we have a blue circle that represents the previously estimated position of the recurrent reinforcement learning agent. The outputs of this hidden layer are stored in a feature vector, as shown by equation 10. These outputs are fed into the recurrent reinforcement learning agent, who learns the position function using equation 9. The weights of the position function are fitted via the extended Kalman filter procedure of algorithm 1. The gradient vector, fed into the extended Kalman filter, is computed using equation 11. This output is fed back into the hidden layer in a recurrent manner represented by the dotted blue line.

\Figure

[!t]()[width=0.99]rbf_net.png Feature representation transfer from a radial basis function network to a recurrent reinforcement learning agent.

III-C Baseline Models

In order to assess the comparative strength of the model of section III-A, we employ two baseline models. The first model is a momentum trader, which uses the sign of the next step ahead return forecast as a target position. This model is also a radial basis function network; except here, the feature representation transfer of the Gaussian mixture model cluster is made available to an exponentially weighted recursive least-squares supervised learner. A visual representation of the model is similar to figure III-B, without any recurrent position unit as represented by the blue circle.

Require: α\alpha, τ\tau
// α0\alpha\geq 0 is a Ridge penalty.
// 0τ10\ll\tau\leq 1 is an exponential decay factor.
1
Initialise: 𝐰=𝟎d\mathbf{w}=\mathbf{0}_{d}, 𝐏=𝐈d/α\mathbf{P}=\mathbf{I}_{d}/\alpha
// 𝟎d\mathbf{0}_{d}
is a zero vector in d\mathbb{R}^{d}.
// 𝐏\mathbf{P} is the precision matrix in d×d\mathbb{R}^{d\times d}.
2
Input: 𝐱t1,𝐱td\mathbf{x}_{t-1},\mathbf{x}_{t}\in\mathbb{R}^{d}, yty_{t}
// yty_{t}
is the daily sampled return of the target.
3
Output: y^t\hat{y}_{t}
4
5r=1+𝐱t1T𝐏t1𝐱t1/τr=1+\mathbf{x}_{t-1}^{T}\mathbf{P}_{t-1}\mathbf{x}_{t-1}/\tau
6𝐤=𝐏t1𝐱t1/(rτ)\mathbf{k}=\mathbf{P}_{t-1}\mathbf{x}_{t-1}/(r\tau)
7𝐰t=𝐰t1+𝐤(yt𝐰t1T𝐱t1)\mathbf{w}_{t}=\mathbf{w}_{t-1}+\mathbf{k}(y_{t}-\mathbf{w}_{t-1}^{T}\mathbf{x}_{t-1})
8𝐏t=𝐏t1/τ𝐤𝐤Tr\mathbf{P}_{t}=\mathbf{P}_{t-1}/\tau-\mathbf{k}\mathbf{k}^{T}r
𝐏t=𝐏tτ\mathbf{P}_{t}=\mathbf{P}_{t}\tau // variance stabilisation
9
y^t=𝐰tT𝐱t\hat{y}_{t}=\mathbf{w}_{t}^{T}\mathbf{x}_{t}
Algorithm 2 exponentially weighted recursive least-squares

The exponentially weighted recursive least-squares fitting procedure is shown compactly in algorithm 2. The precision matrix 𝐏0\mathbf{P}_{0} may be initialised to the identity matrix scaled by the inverse of the Ridge penalty, 𝐈dα1\mathbf{I}_{d}\alpha^{-1} and the initial weights 𝐰0\mathbf{w}_{0} are typically initialised to the zero vector. The discount factor τ\tau is typically close to but less than 1. The particular model form is experimented with by Borrageiro et al. [37] in a multi-step horizon forecasting context.

Our second baseline is the carry trader, hoping to earn the positive differential overnight foreign exchange rate. Denoting the long and short carry as

κtlong=bidtspotaskttnκtshort=bidttnasktspot,\begin{split}\kappa_{t}^{long}&=bid_{t}^{spot}-ask_{t}^{tn}\\ \kappa_{t}^{short}&=bid_{t}^{tn}-ask_{t}^{spot},\end{split}

where the superscript spot denotes the cash price, and the superscript tn denotes the tomorrow/next price, the position of the carry trader is

ftcarry={sign(κtlongκtshort),if κtlongor κtshort>00otherwise.f_{t}^{carry}=\begin{cases}sign\big{(}\kappa_{t}^{long}-\kappa_{t}^{short}\big{)},&\text{if }\kappa_{t}^{long}\text{or }\kappa_{t}^{short}>0\\ 0&\text{otherwise.}\end{cases}

In other words, the carry trader goes long the base currency if the base currency has an overnight interest rate higher than the counter currency. Equally, the carry trader sells the base currency short if the base currency has an overnight interest rate that is lower than the counter currency. Long and short carry may be a cost rather than a profit due to the bid/ask spread that traders make markets in tomnext swaps. Therefore we allow the carry trader to abstain from trading completely in such circumstances.

IV Experiment Design

In this section, we establish the design of the experiment, beginning with a description of the data we use and finishing up with a description of the performance evaluation criteria.

IV-A The Data

We obtain our experiment data from Refinitiv. We extract daily sampled data for 36 major cash foreign exchange pairs with available tomnext forward points and outrights. These foreign exchange pairs are listed in table I. Summary statistics of the distribution of the daily returns for these currency pairs are shown in table II. The dataset begins 2010-12-07 and ends on 2021-10-21, a total of 2,840 observations per pair. Daily spot mid-price returns are constructed for each of these currency pairs. These are used as the features for the recurrent reinforcement learning agent and the exponentially weighted recursive least-squares momentum trader. The mid-price is defined in equation 2, and the return for the kthk^{\prime}th pair is simply

rettk=midtkmidt1k1ln(midtk/midt1k).ret_{t}^{k}=\frac{mid_{t}^{k}}{mid_{t-1}^{k}}-1\approx\ln\big{(}mid_{t}^{k}/mid_{t-1}^{k}\big{)}.
TABLE I: The major foreign exchange pairs we use in our experiment, with Refinitiv information codes (ric)s.
ISO currency pair ric tn ric
AUDUSD AUD= AUDTN=
EURAUD EURAUD= EURAUDTN=
EURCHF EURCHF= EURCHFTN=
EURCZK EURCZK= EURCZKTN=
EURDKK EURDKK= EURDKKTN=
EURGBP EURGBP= EURGBPTN=
EURHUF EURHUF= EURHUFTN=
EURJPY EURJPY= EURJPYTN=
EURNOK EURNOK= EURNOKTN=
EURPLN EURPLN= EURPLNTN=
EURSEK EURSEK= EURSEKTN=
EURUSD EUR= EURTN=
GBPUSD GBP= GBPTN=
NZDUSD NZD= NZDTN=
USDCAD CAD= CADTN=
USDCHF CHF= CHFTN=
USDCNH CNH= CNHTN=
USDCZK CZK= CZKTN=
USDDKK DKK= DKKTN=
USDHKD HKD= HKDTN=
USDHUF HUF= HUFTN=
USDIDR IDR= IDRTN=
USDILS ILS= ILSTN=
USDINR INR= INRTN=
USDJPY JPY= JPYTN=
USDKRW KRW= KRWTN=
USDMXN MXN= MXNTN=
USDNOK NOK= NOKTN=
USDPLN PLN= PLNTN=
USDRUB RUB= RUBTN=
USDSEK SEK= SEKTN=
USDSGD SGD= SGDTN=
USDTHB THB= THBTN=
USDTRY TRY= TRYTN=
USDTWD TWD= TWDTN=
USDZAR ZAR= ZARTN=
TABLE II: A statistical summary of the daily returns of the major foreign exchange pairs we use in our experiment. The dataset begins 2010-12-07 and ends on 2021-10-21, a total of 2,840 observations per pair. The 25t́h, 50t́h and 75t́h percentiles of the returns distribution are shown along with the mean returns and their standard deviations.
mean std 25% 50% 75%
USDSGD 0.00002 0.00328 -0.00178 -0.00007 0.00177
USDHKD 0.00000 0.00034 -0.00010 0.00001 0.00010
USDIDR 0.00017 0.00394 -0.00107 0.00000 0.00150
USDTHB 0.00004 0.00300 -0.00160 0.00000 0.00165
USDINR 0.00019 0.00441 -0.00197 0.00000 0.00212
USDKRW 0.00003 0.00500 -0.00282 0.00000 0.00277
EURGBP 0.00001 0.00498 -0.00286 0.00000 0.00265
USDCAD 0.00008 0.00471 -0.00272 0.00007 0.00276
EURUSD -0.00003 0.00511 -0.00297 0.00000 0.00286
GBPUSD -0.00003 0.00546 -0.00302 0.00000 0.00298
USDJPY 0.00012 0.00544 -0.00250 0.00000 0.00290
USDILS -0.00003 0.00421 -0.00244 -0.00017 0.00226
EURHUF 0.00011 0.00449 -0.00227 0.00000 0.00245
USDZAR 0.00031 0.00985 -0.00581 -0.00005 0.00599
EURCZK 0.00001 0.00304 -0.00116 -0.00004 0.00106
AUDUSD -0.00008 0.00633 -0.00384 0.00002 0.00378
NZDUSD 0.00000 0.00673 -0.00389 -0.00007 0.00417
USDCHF -0.00001 0.00638 -0.00293 0.00005 0.00290
USDNOK 0.00014 0.00721 -0.00418 -0.00008 0.00384
USDSEK 0.00010 0.00640 -0.00373 -0.00000 0.00367
USDMXN 0.00020 0.00791 -0.00428 -0.00004 0.00429
USDDKK 0.00006 0.00510 -0.00286 0.00000 0.00294
USDPLN 0.00012 0.00730 -0.00405 0.00000 0.00401
USDHUF 0.00017 0.00756 -0.00401 0.00004 0.00426
USDTRY 0.00070 0.00913 -0.00379 0.00029 0.00466
USDRUB 0.00034 0.01036 -0.00428 0.00008 0.00457
USDCZK 0.00008 0.00643 -0.00346 0.00000 0.00346
EURSEK 0.00004 0.00406 -0.00242 0.00002 0.00234
EURDKK -0.00000 0.00019 -0.00009 0.00000 0.00009
EURNOK 0.00009 0.00546 -0.00280 -0.00006 0.00265
USDTWD -0.00002 0.00280 -0.00124 0.00000 0.00124
EURJPY 0.00008 0.00611 -0.00311 0.00011 0.00330
EURPLN 0.00006 0.00418 -0.00214 -0.00002 0.00217
EURCHF -0.00006 0.00529 -0.00147 -0.00006 0.00138
EURAUD 0.00007 0.00575 -0.00340 -0.00012 0.00314
USDCNH -0.00001 0.00241 -0.00103 -0.00003 0.00099

One of the challenges that the models will face in the experiment is that these daily data show the last known top of book spot and outright prices at the end of the trading day, 5 pm EST. The bid/ask spread for these prices are at their widest statistically at this time. Therefore the execution and funding costs will be more expensive; this contrasts with a trader who can execute at a more liquid time, such as 2 pm GMT. If we try to use intraday data, say data sampled minutely, Refinitiv restricts us to 41 trading days, which is not a huge sample size. Figure II illustrates the challenge succinctly. It shows relative intraday bid/ask spreads

spreadtspot=asktspotbidtspotmidtspot,spread_{t}^{spot}=\frac{ask_{t}^{spot}-bid_{t}^{spot}}{mid_{t}^{spot}},

for the 36 currency pairs that we experiment with. The data are sampled minutely over two months ending mid-October 2021. The global maximum bid/ask spread occurs precisely when Refinitiv samples the daily data.

\Figure

[!t]()[width=0.99]intraday_spreads.png Relative intra-day bid/ask spreads for the 36 Refinitiv currency pairs that we experiment with.

IV-B Performance Evaluation Methods

We have a little over 11 years of daily data to use in our experiment. From these data, we construct daily returns for each of the 36 currency pairs, reserving the first third as a training set and the final two-thirds as a test set. The structure of the radial basis function networks of sub-section III-B is determined in the training set, with external input being the returns of the various currency pairs. The recurrent reinforcement learning agent is also fitted in the training set to each currency pair, explicitly learning the weights in the position function 9, using the extended Kalman filter learning procedure of algorithm 1. Additionally, the momentum trader of sub-section III-C is fitted in the training set to each currency pair using algorithm 2. Both models continue to learn online during the test set. However, the carry trader baseline does not require any model fitting.

The test set evaluates performance for each currency pair using the net profit and loss equation 8. This reward, net of transaction and funding cost, is in price difference space. We convert to returns space by dividing by the mid-price computed using equation 2. These returns are accumulated to produce the results shown in figure IV and the middle sub-plots of figures V and V. In addition, the daily returns are described statistically in tables III and IV. In table III, the information ratio (ir) is computed using equation 6. We set the baseline return bt=0b_{t}=0. In summary, we evaluate performance by considering the risk-adjusted daily returns generated by each model, net of transaction and funding costs.

IV-C Hyper-parameters

The following hyper-parameters are set in the experiment:

  • τ=0.99\tau=0.99; this is the exponential decay constant of moving moment equations 4, 5, 8, extended Kalman filter weight algorithm 1 and exponentially weighted recursive least-squares algorithm 2.

  • α=1\alpha=1; this is the Ridge penalty of extended Kalman filter weight algorithm 1 and exponentially weighted recursive least-squares algorithm 2.

  • γ\gamma, the risk appetite parameter of equation 3, is initially set to 11 and then updated by passing through the training data once and setting it via the procedure of equation 7.

V Experiment Results

Figure IV shows the accumulated returns for each strategy. The reinforcement learning agent is denoted as drl, the momentum trader is shown as mom and the carry trader is indicated as carry. The carry baseline performs poorly, reflecting the low-interest rate differential environment since the 2008 financial crisis. Essentially the available funding that can be earned relative to execution cost is small. Figure V shows the direction of travel in central bank interest rates over the past 20 years. Central bank rates halved on average during the 2008 global financial crisis and have declined further since. In contrast, the momentum trader achieves the highest return with an annual compound net return of 11.7% and an information ratio of 0.4. Additionally, the recurrent reinforcement learner achieves an annual compound net return of 9.3%, with an information ratio of 0.52. Its information ratio is driven higher because its standard deviation of daily portfolio returns is two-thirds of the momentum trader’s. Table III summarises net profit and loss returns statistics by strategy, with a figure of the distribution of the daily returns in figure IV. Table IV shows the funding or carry in returns space for each strategy. We can see that the carry baseline does indeed capture positive carry, although this return is not enough to offset the execution cost and the profit and loss associated with holding risk, which moves in a trend-following way, mainly as opposed to the funding profit and loss. How funding moves opposite to price trends is expected. Central banks invariably increase overnight rates when currencies depreciate considerably to make their currency more attractive and stem the tide of depreciation. The Turkish Lira and Russian Ruble are two cases in point. We see evidence in table IV that the recurrent reinforcement learner captures more carry relative to the momentum trader. This funding capture is expected as well, as the funding profit and loss make their way into equation 8 and are propagated through the derivative of the utility function with respect to the model weights, using equation 11.

\Figure

[!t]()[width=0.99]cbpol_2111.png Stacked central bank interest rates in percentage points, data source: Bank of International Settlements.

TABLE III: Portfolio net profit and loss returns by strategy: the reinforcement learning agent is (drl), momentum trader is (mom) and carry trader is (carry).
drl mom carry
count 1888 1888 1888
mean 0.00104 0.00121 -0.002
std 0.032 0.048 0.052
min -0.141 -0.202 -0.344
25% -0.019 -0.028 -0.028
50% -0.000 -0.002 0.000
75% 0.019 0.028 0.026
max 0.245 0.423 0.200
sum 1.953 2.296 -4.328
ir 0.518 0.403 -0.701
TABLE IV: Portfolio funding profit and loss returns by strategy: the reinforcement learning agent is (drl), momentum trader is (mom) and carry trader is (carry).
drl mom carry
count 1888 1888 1888
mean -0.00030 -0.00050 0.00048
std 0.00019 0.00031 0.00036
min -0.00395 -0.00576 0.00007
25% -0.00035 -0.00059 0.00029
50% -0.00024 -0.00040 0.00035
75% -0.00019 -0.00032 0.00051
max 0.00153 0.00072 0.00518
sum -0.56226 -0.94769 0.90655
\Figure

[!t]()[width=0.99]cumul_pnl_dist.png Cumulative daily returns for the reinforcement learning agent (pnl drl), momentum trader (pnl mom) and carry trader (pnl carry).

\Figure

[!t]()[width=0.99]drl_pnl_dist.png Distribution of daily returns for the reinforcement learning agent (pnl drl), momentum trader (pnl mom) and carry trader (pnl carry).

VI Discussion

Both baselines make decisions using incomplete information. The momentum trader focuses on learning the foreign exchange trends but ignores the execution and funding costs, whereas the carry trader tries to earn funding but ignores execution costs and the price movements of the underlying currency pair. In contrast, the recurrent reinforcement learner optimises the desired position as a function of market moves and funding whilst minimising execution cost. To demonstrate that the recurrent reinforcement learner is indeed learning from these reward inputs, we compare the realised positions of a USDRUB trader where in the former case, transaction costs and carry are removed (figure V) and in the latter case, transaction costs and carry are included (figure V). We see that without cost, the recurrent reinforcement learner realises a long position (buying USD and selling RUB) broadly, as the Ruble depreciates over time. In contrast, when funding cost is accurately applied, the overnight interest rate differential is roughly 6%, and the recurrent reinforcement learner learns a short position (selling USD and buying RUB), capturing this positive carry. The positive carry is not enough to offset the rapid depreciation of the Ruble.

How significant are these results? Grinold and Kahn [33] show table V of empirical information ratios for US fund managers over the five years from January 2003 through December 2007. The data relates to 338 equity mutual funds, 1,679 equity long-only institutional funds, 56 equities long-short institutional funds and 537 fixed-income mutual funds. Although now a bit dated, the results indicate that our recurrent reinforcement learner that trades statistically at the worst time of day in the foreign exchange market achieves an information ratio at the 75th75^{\prime}th percentile of information ratios achieved empirically by various passive and active fund managers within fixed income and equities. The momentum trader achieves an information ratio between the 50th50^{\prime}th and 75th75^{\prime}th percentile. The information ratio is a measure of consistency and has a probabilistic interpretation: it measures the probability that a strategy will achieve positive residual returns in every period [33]. Equation 6 shows that the information ratio is the ratio of residual return to residual risk. Let us denote this residual return as the strategy’s alpha:

αt=μtbt.\alpha_{t}=\mu_{t}-b_{t}.

The probability of realising a positive residual return is

Pr(αt>0)=Φ(irt),Pr(\alpha_{t}>0)=\Phi(ir_{t}),

where Φ(.)\Phi(.) denotes the cumulative normal distribution function. In this respect, we find that recurrent reinforcement learner has a probability of positive residual return of 70% and the momentum baseline has a probability of positive residual return of 66%.

TABLE V: Empirical information ratios, source: Blackrock.
asset class equities equities equities fixed income both
percentile mutual funds long long short institutional average
90 1.04 0.77 1.17 0.96 0.99
75 0.64 0.42 0.57 0.50 0.53
50 0.20 0.02 0.25 0.01 0.12
25 –0.21 –0.38 –0.22 –0.45 –0.32
10 –0.62 –0.77 –0.58 –0.90 –0.72

In terms of future work, one might consider a multi-layer perceptron version of our recurrent reinforcement learner. One might also consider an echo state network [41] version of the model. In addition, one might be able to improve the results further by applying a portfolio overlay. The utility function of equation 3 is readily treated as a portfolio problem

υt=𝐡T𝝁tλ2𝐡T𝚺t𝐡,\upsilon_{t}=\mathbf{h}^{T}\boldsymbol{\mu}_{t}-\frac{\lambda}{2}\mathbf{h}^{T}\boldsymbol{\Sigma}_{t}\mathbf{h},

where the optimal, unconstrained portfolio weights are obtained by differentiating the portfolio utility with respect to the weight vector

𝐡=1λ𝚺t1𝝁t.\mathbf{h}^{*}=\frac{1}{\lambda}\boldsymbol{\Sigma}_{t}^{-1}\boldsymbol{\mu}_{t}.

Another approach is to treat portfolio selection as a policy gradient problem, where the policy of picking actions, or this case portfolio constituents, is estimated via function approximation techniques.

\Figure

[!t]()[width=0.99]drl_USDRUB_without_cost.png A USDRUB reinforcement learning agent trading without execution or funding cost.

\Figure

[!t]()[width=0.99]drl_USDRUB_with_cost.png A USDRUB reinforcement learning agent trading with execution and funding cost.

VII Conclusion

We conduct a detailed experiment on major cash foreign exchange pairs, accurately accounting for transaction and funding costs. These sources of profit and loss, including the price trends that occur in the currency markets, are made available to our recurrent reinforcement learner via a quadratic utility, which learns to target a position directly. We improve upon earlier work by casting the problem of learning a risk position in an online learning context. This online learning occurs sequentially in time but also via transfer learning. This transfer learning takes the form of radial basis function hidden processing units, whose means, covariances and overall size are determined by an unsupervised learning procedure for finite Gaussian mixture models. The intrinsic nature of the feature space is learnt and made available to the recurrent reinforcement learner and baseline supervised-learning momentum trader. The recurrent reinforcement learning trader achieves an annualised portfolio information ratio of 0.52 with a compound return of 9.3%, net of execution and funding cost, over a 7-year test set, despite forcing the model to trade at the close of the trading day 5 pm EST, when trading costs are statistically the most expensive. The momentum baseline trader achieves a similar total return but a lower risk-adjusted return. The recurrent reinforcement learner does maintain an essential advantage in that the model’s weights can be adapted to reflect the different sources of profit and loss variation, including returns momentum, transaction costs and funding costs. We demonstrate this visually in figures V and V, where a USDRUB trading agent learns to target different positions that reflect trading in the absence or presence of cost.

\EOD

References

  • Tsay and Chen [2019] R. S. Tsay and R. Chen, Nonlinear time series analysis, ser. Wiley series in probability and statistics.   Wiley, 2019.
  • Bengio [1997] Y. Bengio, “Using a financial training criterion rather than a prediction criterion,” International Journal of Neural Systems, vol. 08, 8 1997.
  • Moody and Wu [1997] J. Moody and L. Wu, “Optimization of trading systems and portfolios,” in IEEE, 1997.
  • Gold [2003] C. Gold, “Fx trading via recurrent reinforcement learning,” in IEEE.   IEEE, 2003.
  • Sharpe [2007] W. F. Sharpe, “Expected utility asset allocation,” Financial Analysts Journal, vol. 63, no. 5, pp. 18–30, 2007.
  • Moody et al. [1998] J. Moody, L. Wu, Y. Liao, and M. Saffell, “Performance functions and reinforcement learning for trading systems and portfolios,” Journal of Forecasting, vol. 17, pp. 441–470, 1998.
  • Figueiredo and Jain [2002] M. A. T. Figueiredo and A. K. Jain, “Unsupervised learning of finite mixture models,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, pp. 381–396, 2002.
  • Yang et al. [2020] Q. Yang, Y. Zhang, W. Dai, and S. J. Pan, Transfer Learning.   Cambridge University Press, 2020.
  • Haykin [2001] S. Haykin, Kalman filtering and neural networks.   Wiley, 2001.
  • Merton [1976] R. C. Merton, “Option pricing when underlying stock returns are discontinuous,” Journal of financial economics, vol. 3, pp. 125–144, 1976.
  • Zhao et al. [2014] P. Zhao, S. C. H. Hoi, J. Wang, and B. Li, “Online transfer learning,” Artificial intelligence, vol. 216, pp. 76–102, 2014.
  • Salvalaio and de Oliveira Ramos [2019] B. K. Salvalaio and G. de Oliveira Ramos, “Self-adaptive appearance-based eye-tracking with online transfer learning,” in 2019 8th Brazilian Conference on Intelligent Systems (BRACIS).   IEEE, 2019, pp. 383–388.
  • Wang et al. [2020] X. Wang, X. Wang, and Z. Zeng, “A novel weight update rule of online transfer learning,” in 2020 12th International Conference on Advanced Computational Intelligence (ICACI).   IEEE, 2020, pp. 349–355.
  • Pan and Yang [2010] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, pp. 1345–1359, 2010.
  • Williams [1992a] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, 5 1992.
  • Sugiyama [2015] M. Sugiyama, Statistical Reinforcement Learning, 1st ed.   CRC Press, 2015.
  • Sutton and Barto [2018] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.   Cambridge, MA, USA: A Bradford Book, 2018.
  • Sharpe [1966] W. F. Sharpe, “Mutual fund performance,” The Journal of Business, vol. 39, 1 1966.
  • Tamar et al. [2017] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor, “Sequential decision making with coherent risk,” IEEE transactions on automatic control, vol. 62, no. 7, pp. 3323–3338, 2017.
  • Luo et al. [2019] S. Luo, X. Lin, and Z. Zheng, “A novel cnn-ddpg based ai-trader: Performance and roles in business operations,” Transportation research. Part E, Logistics and transportation review, vol. 131, pp. 68–79, 2019.
  • Zhang et al. [2019] Z. Zhang, S. Zohren, and S. Roberts, “Deep reinforcement learning for trading,” in ArXiv, 2019.
  • Mnih et al. [2013] V. Mnih, K. K, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” in arXiv, 2013.
  • Silver et al. [2016] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, 1 2016.
  • Mnih et al. [2016] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in arXiv, 2016.
  • Hochreiter and Schmidhuber [1997] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • Azhikodan et al. [2019] A. R. Azhikodan, A. G. K. Bhat, and M. V. Jadhav, “Stock trading bot using deep reinforcement learning,” in Innovations in Computer Science and Engineering, H. S. Saini, R. Sayal, A. Govardhan, and R. Buyya, Eds.   Springer Singapore, 2019.
  • Ye et al. [2020] Z. Ye, W. Deng, S. Zhou, Y. Xu, and J. Guan, “Optimal trade execution based on deep deterministic policy gradient,” in Database Systems for Advanced Applications, ser. Lecture Notes in Computer Science.   Springer International Publishing, 2020, vol. 12112, pp. 638–654.
  • Watkins [1989] C. J. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, King’s College, Cambridge, UK, 1989.
  • Almgren and Chriss [2001] R. Almgren and N. Chriss, “Optimal execution of portfolio transactions,” Journal of Risk, vol. 3, pp. 5–40, 2001.
  • Aboussalah and Lee [2020] A. M. Aboussalah and C. Lee, “Continuous control with stacked deep dynamic recurrent reinforcement learning for portfolio optimization,” Expert Systems with Applications, vol. 140, p. 112891, 2020.
  • Betancourt and H.Chen [2021] C. Betancourt and W. H.Chen, “Deep reinforcement learning for portfolio management of markets with a dynamic number of assets,” Expert Systems with Applications, vol. 164, p. 114002, 2021.
  • Lei et al. [2020] K. Lei, B. Zhang, Y. Li, M. Yang, and Y. Shen, “Time-driven feature-aware jointly deep reinforcement learning for financial signal representation and algorithmic trading,” Expert Systems with Applications, vol. 140, p. 112872, 2020.
  • Grinold and Kahn [2019] R. Grinold and R. Kahn, Advances in Active Portfolio Management: New Developments in Quantitative Investing.   McGraw-Hill Education, 2019.
  • Kingma and Ba [2017] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in arXiv, 2017.
  • Bottou [2010] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010).   Paris, France: Springer, August 2010, pp. 177–187.
  • Williams [1992b] R. Williams, “Training recurrent networks using the extended kalman filter,” in [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, vol. 4, 1992, pp. 241–246 vol.4.
  • Borrageiro et al. [2021] G. Borrageiro, N. Firoozye, and P. Barucca, “Online learning with radial basis function networks,” in arXiv, 2021.
  • Dempster et al. [1977] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, pp. 1–38, 1977.
  • Murphy [2012] K. P. Murphy, Machine learning a probabilistic perspective, ser. Adaptive Computation and Machine Learning.   MIT Press, 2012.
  • Wallace and Dowe [1999] C. S. Wallace and D. L. Dowe, “Minimum Message Length and Kolmogorov Complexity,” The Computer Journal, vol. 42, no. 4, pp. 270–283, 01 1999.
  • Yildiz et al. [2012] I. B. Yildiz, H. Jaeger, and S. J. Kiebel, “Re-visiting the echo state property,” Neural networks : the official journal of the International Neural Network Society, vol. 35, pp. 1–9, 2012.
[Uncaptioned image] Gabriel Borrageiro is currently a PhD student at University College London, Computer Science department, in the Financial Computing Group. His research interests include online learning, reinforcement learning, transfer learning, recurrent neural networks and financial time series. Gabriel obtained his executive MBA from Cass Business School, City University, London, in 2008. He also obtained a higher national diploma from Damelin College, South Africa, in 1999. Gabriel is also employed as a quantitative researcher at BlueCrest Capital.
[Uncaptioned image] Nick Firoozye is currently Honorary Reader, Computational Finance, in the Computer Science department at University College London and also part of the Financial Computing Group. He obtained his PhD and MS in mathematics at New York University and a BS in mathematics at Harvey Mudd College. He also works for Exos Bank in the systematic rates trading business.
[Uncaptioned image] Paolo Barucca is a lecturer at University College London, Computer Science department and also part of the Financial Computing Group. He is also editor in chief of the science dissemination project, La Scienza Coatta and scientific officer of the Blockchain Education Network. Paolo received his PhD in theoretical and mathematical physics from Sapienza Universita di Roma in 2015.