Generalization of partitioned Runge–Kutta methods for adjoint systems

Takeru Matsuda Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo, Japan Mathematical Informatics Collaboration Unit, RIKEN Center for Brain Science, Japan Yuto Miyatake Cybermedia Center, Osaka University, Japan

Abstract

This study computes the gradient of a function of numerical solutions of ordinary differential equations (ODEs) with respect to the initial condition. The adjoint method computes the gradient approximately by solving the corresponding adjoint system numerically. In this context, Sanz-Serna [SIAM Rev., 58 (2016), pp. 3–33] showed that when the initial value problem is solved by a Runge–Kutta (RK) method, the gradient can be exactly computed by applying an appropriate RK method to the adjoint system. Focusing on the case where the initial value problem is solved by a partitioned RK (PRK) method, this paper presents a numerical method, which can be seen as a generalization of PRK methods, for the adjoint system that gives the exact gradient.

keywords:

adjoint method , partitioned Runge–Kutta method , geometric integration

MSC:

[2010] 65K10 , 65L05 , 65L09 , 65P10

^mytitlenote^mytitlenotefootnotetext: Fully documented templates are available in the elsarticle package on CTAN.

1 Introduction

We consider a system of ordinary differential equations (ODEs) of the form

\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}x(t;\theta)=f(x(t;\theta))

(1)

with the initial condition $x(0;\theta)=\theta\in\mathbb{R}^{d}$ , where $f:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is assumed to be sufficiently smooth. We denote the numerical approximation at $t_{n}=nh$ by $x_{n}(\theta)\approx x(nh;\theta)$ , where $h$ is the step size. Let $C:\mathbb{R}^{d}\to\mathbb{R}$ be a differentiable function. This paper is concerned with the computation of $\nabla_{\theta}C(x_{n}(\theta))$ , i.e. the gradient of $C(x_{n}(\theta))$ with respect to the initial condition $\theta$ .

Calculating the gradient $\nabla_{\theta}C(x_{n}(\theta))$ is a fundamental task when solving optimization problems such as

\displaystyle\min_{\theta}\sum_{n=1}^{N}C(x_{n}(\theta)).

(2)

A simple method of obtaining an approximation for the gradient is to compare $C(x_{n}(\theta))$ with the numerical approximation corresponding to a perturbed initial condition. For example, $(C(x_{n}(\theta+\Delta e_{i}))-C(x_{n}(\theta)))/\Delta$ may be regarded as an approximation of the $i$ th component of the gradient, where $\Delta$ is a small scalar constant, and $e_{i}$ is the $i$ th column of the $d$ -dimensional identity matrix. However, this approach becomes computationally expensive when the dimensionality $d$ or the number of time steps $N$ is large. Further, the approximation accuracy could deteriorate significantly due to the discretization error. Alternatively, the adjoint method has been widely used in various fields such as variational data assimilation in meteorology and oceanography [4], inversion problems in seismology [5], optimal design in aerodynamics [6], and neural networks [2]. In this approach, the gradient is approximated by integrating the so-called adjoint system numerically. This approach is usually more efficient than the aforementioned approach using perturbed initial conditions; however, the approximation accuracy may still be limited when the collected discretization errors are large.

Recently, Sanz-Serna showed that the gradient $\nabla_{\theta}C(x_{n}(\theta))$ can be computed exactly when the original system is integrated by a Runge–Kutta (RK) method [14]: the intended exact gradient is obtained by applying an appropriate RK method to the adjoint system. Note that even if the numerical integration of the original system (1) is not sufficiently accurate, a sufficiently accurate approximation or the exact computation of the gradient $\nabla_{\theta}C(x_{n}(\theta))$ is often required. We provide two examples:

1.

The forward propagation of several deep neural networks is interpreted as a discretization of ODEs. For example, the Residual Network (ResNet), which is commonly used in pattern recognition tasks such as image classification, can be seen as an explicit Euler discretization [3]. Also, neural network architectures based on symplectic integrators for Hamiltonian systems have been developed recently, which avoid numerical instabilities [8]. Since the output of such neural networks is not the exact solution of ODE but its numerical approximation, their training is formulated as an optimization problem of the form (2). In other words, the backpropagation algorithm [7] is a special case of the adjoint method in this context.
2.

Let us consider solving the optimization problem of the form (2). If we apply the Newton method, we need to solve a linear system having the Hessian of the objective function with respect to the initial state $\theta$ as the coefficient matrix. When the linear system is solved by a Krylov subspace method such as the conjugate gradient (CG) method, the Hessian-vector multiplication needs to be computed. The adjoint method can be used to approximate the Hessian-vector multiplication [17, 18].¹¹1More precisely, the Hessian-vector multiplication is approximately computed by solving the so-called second-order adjoint system numerically. However, this approach usually results in applying the CG method to a non-symmetric linear system even though the exact Hessian is always symmetric, and consequently, its convergence is not always guaranteed. Symmetry can be ideally attained if the Hessian-vector multiplication is computed exactly [10]. We note that the need for computing the exact Hessian-vector multiplication also arises in the context of uncertainty quantification (see, e.g. [11, 16]).

To the best of authors’ knowledge, an algorithm that systematically computes the exact gradient has been developed only for the cases where the ODE system (1) is discretized by using RK methods; similar algorithms should be developed for other types of numerical methods. We note that while it may not be so difficult to construct an algorithm for a specific ODE solver, providing a recipe for a particular class of numerical methods is useful for practitioners. Among others, we focus on the class of partitioned RK (PRK) methods. A straightforward conjecture would be that the exact gradient is obtained by applying an appropriate PRK method to the adjoint system. Indeed, our previous report showed that this conjecture is true for the Störmer–Verlet method [13]. However, except for some special cases, this conjecture does not hold in general. In this study, we shall show that the exact gradient can be computed by a certain generalization of PRK methods.

In this paper, after reviewing the approach proposed by Sanz-Serna in Section 2, we show main results in Section 3. Several examples are provided in Section 4, and concluding remarks are given in Section 5.

2 Preliminaries

In this section, we review the adjoint method and the approach proposed by Sanz-Serna [14].

2.1 Adjoint method

Let $\overline{x}(t)$ be the solution to the system (1) for the perturbed initial condition $\overline{x}(0)=\theta+\varepsilon$ . By linearising the system (1) at $x(t)$ , we see that as $\|\varepsilon\|\to 0$ it follows that $\overline{x}(t)=x(t)+\delta(t)+\mathrm{o}(\|\varepsilon\|)$ , where $\delta(t)$ solves the variational system

\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\delta(t)=\nabla_{x}f(x(t))\delta(t).

(3)

The corresponding adjoint system, which is often introduced by using Lagrange multipliers, is given by

\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\lambda(t)=-\nabla_{x}f(x(t))^{\top}\lambda(t).

(4)

Here, $\nabla_{x}f(x(t))^{\top}$ is the transpose of the Jacobian $\nabla_{x}f(x(t))$ . For any solutions to (3) and (4), we see that $\frac{\mathrm{d}}{\mathrm{d}t}\lambda(t)^{\top}\delta(t)=0$ ; thus, it follows that $\lambda(t)^{\top}\delta(t)=\lambda(0)^{\top}\delta(0)$ for all $t>0$ , so in particular,

\displaystyle\lambda(t_{N})^{\top}\delta(t_{N})=\lambda(0)^{\top}\delta(0).

(5)

Because of the chain rule $\nabla_{\theta}C(x(t_{N};\theta))=\nabla_{\theta}x(t_{N};\theta)^{\top}\nabla_{x}C(x(t_{N};\theta))$ and the fact $\delta(t)=(\nabla_{\theta}x(t;\theta))\delta(0)$ , we have

\displaystyle\nabla_{x}C(x(t_{N};\theta))^{\top}\delta(t_{N})=\nabla_{x}C(x(t_{N};\theta))^{\top}(\nabla_{\theta}x(t_{N};\theta))\delta(0)=\nabla_{\theta}C(x(t_{N};\theta))^{\top}\delta(0).

(6)

By comparing (6) with (5), it is concluded that the solution to the adjoint system (4) with $\lambda(t_{N})=\nabla_{x}C(x(t_{N};\theta))$ satisfies $\lambda(0)=\nabla_{\theta}C(x(t_{N};\theta))$ . In other words, solving the adjoint system (4) backwardly with the final condition $\lambda(t_{N})=\nabla_{x}C(x(t_{N};\theta))$ gives $\lambda(0)=\nabla_{\theta}C(x(t_{N};\theta))$ .

Now, we compute the gradient $\nabla_{\theta}C(x_{N}(\theta))$ .²²2As a particular case of $\nabla_{\theta}C(x_{n}(\theta))$ , $n=1,\dots,N$ , here, we consider $\nabla_{\theta}C(x_{N}(\theta))$ . The above discussion indicates that approximating the solution to the adjoint system (4) with the final condition $\lambda(t_{N})=\nabla_{x}C(x_{N}(\theta))$ gives an approximation of $\nabla_{\theta}C(x_{N}(\theta))$ at $t=0$ . In general, the approximation accuracy depends on the accuracy of the numerical integrators for both the original system (1) and adjoint system (4).

2.2 Exact gradient calculation

In the following, we assume that $x_{n}(\theta)$ is the approximation obtained by applying an RK method to the original system (1). In this case, Sanz-Serna showed in [14] that the gradient $\nabla_{\theta}C(x_{N}(\theta))$ can be computed exactly (up to round-off in practical computation) by applying an appropriate RK method to the adjoint system (4). Below, we briefly review the procedure. The obvious argument $(\theta)$ will often be omitted.

Suppose that the original system (1) has been discretized by an $s$ -stage RK method characterized by $\{a_{ij}\}$ and $\{b_{i}\}$ :

$\displaystyle x_{n+1}$	$\displaystyle=x_{n}+h\sum_{i=1}^{s}b_{i}k_{n,i},$	(7)
$\displaystyle k_{n,i}$	$\displaystyle=f\left\lparen X_{n,i}\right\rparen,\quad i=1,\dots,s,$	(8)
$\displaystyle X_{n,i}$	$\displaystyle=x_{n}+h\sum_{j=1}^{s}a_{ij}k_{n,j},\quad i=1,\dots,s$	(9)

with the initial condition $x_{0}=\theta$ . The pair of $\{a_{ij}\}$ and $\{b_{i}\}$ will be simply denoted by $(a,b)$ . Correspondingly, the variational system (3) is discretized by the same RK method

$\displaystyle\delta_{n+1}$	$\displaystyle=\delta_{n}+h\sum_{i=1}^{s}b_{i}d_{n,i},$	(10)
$\displaystyle d_{n,i}$	$\displaystyle=\nabla_{x}f(X_{n,i})\Delta_{n,i},\quad i=1,\dots,s,$	(11)
$\displaystyle\Delta_{n,i}$	$\displaystyle=\delta_{n}+h\sum_{j=1}^{s}a_{ij}d_{n,j},\quad i=1,\dots,s$	(12)

so that $\delta_{n}=(\nabla_{\theta}x_{n}(\theta))\delta_{0}$ . We discretize the adjoint system (4) with an $s$ -stage RK method, which may differ from the RK method $(a,b)$ , characterized by the coefficients $(A,B)$ :

\displaystyle\begin{aligned} \lambda_{n+1}&=\lambda_{n}+h\sum_{i=1}^{s}B_{i}l_{n,i},\\ l_{n,i}&=-\nabla_{x}f(X_{n,i})^{\top}\Lambda_{n,i},\quad i=1,\dots,s,\\ \Lambda_{n,i}&=\lambda_{n}+h\sum_{j=1}^{s}A_{ij}l_{n,j},\quad i=1,\dots,s\end{aligned}

(13)

with the final condition $\lambda_{N}=\nabla_{x}C(x_{N}(\theta))$ .

Combining the chain rule $\nabla_{\theta}C(x_{N}(\theta))=\nabla_{\theta}x_{N}(\theta)^{\top}\nabla_{x}C(x_{N}(\theta))$ with $\delta_{n}=(\nabla_{\theta}x_{n}(\theta))\delta_{0}$ , we see that $\nabla_{x}C(x_{N}(\theta))^{\top}\delta_{N}=\nabla_{\theta}C(x_{N}(\theta))^{\top}\delta_{0}$ . Therefore, if the approximation of the solution to the adjoint system (4) satisfies $\lambda_{n}^{\top}\delta_{n}=\lambda_{0}^{\top}\delta_{0}$ , in particular $\lambda_{N}^{\top}\delta_{N}=\lambda_{0}^{\top}\delta_{0}$ , it is concluded that $\lambda_{0}=\nabla_{\theta}C(x_{N}(\theta))$ . The theory of geometric numerical integration [9] tells us that if an RK method is chosen for the adjoint system such that the pair of the RK methods for the original and adjoint systems satisfies the condition for a partitioned RK (PRK) method to be canonical, the property $\lambda_{N}^{\top}\delta_{N}=\lambda_{0}^{\top}\delta_{0}$ is guaranteed and the gradient $\nabla_{\theta}C(x_{N}(\theta))$ is exactly obtained as shown in Theorem 1. We note that the canonical numerical methods are well-known in the context of geometric numerical integration. For more details, we refer the reader to [9, Chapter VI] (for PRK methods, see also [1, 15]).

Theorem 1 ([14]).

Assume that the original system (1) is solved by an RK method $(a,b)$ , and the coefficients $(A,B)$ of the RK method for the adjoint system (4) satisfy

	$\displaystyle b_{i}=B_{i},\quad i=1,\dots,s,$		(14)
	$\displaystyle b_{i}A_{ij}+B_{j}a_{ji}=b_{i}B_{j},\quad i,j=1,\dots,s.$		(15)

Then, by solving the adjoint system with the final condition $\lambda_{N}(=\lambda(t_{N}))=\nabla_{x}C(x_{N}(\theta))$ , we obtain the exact gradient $\lambda_{0}=\nabla_{\theta}C(x_{N}(\theta))$ .

Remark 1.

The conditions in Theorem 1 indicate that $A_{ij}=b_{j}-(b_{j}/b_{i})a_{ji}$ , which makes sense only when every weight $b_{i}$ is nonzero. However, for some RK methods one or more of the weights $b_{i}$ vanish. In such cases, the above conditions cannot be used to find an appropriate RK method for the adjoint system. We refer the reader to Appendix in [14] for a workaround.

3 Partitioned Runge–Kutta methods

We now restrict our attention to the system of ODEs in the partitioned form

\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\begin{bmatrix}x^{[1]}\\ x^{[2]}\end{bmatrix}=\begin{bmatrix}f^{[1]}\left\lparen x^{[1]},x^{[2]}\right\rparen\\ f^{[2]}\left\lparen x^{[1]},x^{[2]}\right\rparen\end{bmatrix},\quad\begin{bmatrix}x^{[1]}(0)\\ x^{[2]}(0)\end{bmatrix}=\begin{bmatrix}\theta^{[1]}\\ \theta^{[2]}\end{bmatrix}.

(16)

Suppose that this system has been discretized by a PRK method characterized by the pairs $(a^{[1]},b^{[1]})$ and $(a^{[2]},b^{[2]})$ :

$\displaystyle\begin{bmatrix}\displaystyle x_{n+1}^{[1]}\\ x_{n+1}^{[2]}\end{bmatrix}$	$\displaystyle=\begin{bmatrix}x_{n}^{[1]}\\ x_{n}^{[2]}\end{bmatrix}+h\sum_{i=1}^{s}\begin{bmatrix}\displaystyle b_{i}^{[1]}k_{n,i}^{[1]}\\ \displaystyle b_{i}^{[2]}k_{n,i}^{[2]}\end{bmatrix},$	(17)
$\displaystyle\begin{bmatrix}k_{n,i}^{[1]}\\ k_{n,i}^{[2]}\end{bmatrix}$	$\displaystyle=\begin{bmatrix}f^{[1]}\left\lparen X_{n,i}^{[1]},X_{n,i}^{[2]}\right\rparen\\ f^{[2]}\left\lparen X_{n,i}^{[1]},X_{n,i}^{[2]}\right\rparen\end{bmatrix},\quad i=1,\dots,s,$	(18)
$\displaystyle\begin{bmatrix}X_{n,i}^{[1]}\\ X_{n,i}^{[2]}\end{bmatrix}$	$\displaystyle=\begin{bmatrix}x_{n}^{[1]}\\ x_{n}^{[2]}\end{bmatrix}+h\sum_{j=1}^{s}\begin{bmatrix}\displaystyle a_{ij}^{[1]}k_{n,j}^{[1]}\\ \displaystyle a_{ij}^{[2]}k_{n,j}^{[2]}\end{bmatrix},\quad i=1,\dots,s.$	(19)

For clarity, every component of $b^{[1]}$ and $b^{[2]}$ is assumed to be nonzero. We discretize the variational system expressed as

\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\begin{bmatrix}\delta^{[1]}\\ \delta^{[2]}\end{bmatrix}=\begin{bmatrix}\nabla_{x^{[1]}}f^{[1]}&\nabla_{x^{[2]}}f^{[1]}\\ \nabla_{x^{[1]}}f^{[2]}&\nabla_{x^{[2]}}f^{[2]}\end{bmatrix}\begin{bmatrix}\delta^{[1]}\\ \delta^{[2]}\end{bmatrix}

(20)

by the same PRK method


$\displaystyle\begin{bmatrix}\delta_{n+1}^{[1]}\\ \delta_{n+1}^{[2]}\end{bmatrix}$	$\displaystyle=\begin{bmatrix}\delta_{n}^{[1]}\\ \delta_{n}^{[2]}\end{bmatrix}+h\sum_{i=1}^{s}\begin{bmatrix}b_{i}^{[1]}m_{n,i}^{[1]}\\ b_{i}^{[2]}m_{n,i}^{[2]}\end{bmatrix},$	(21a)
$\displaystyle\begin{bmatrix}m_{n,i}^{[1]}\\ m_{n,i}^{[2]}\end{bmatrix}$	$\displaystyle=\begin{bmatrix}(\nabla_{x^{[1]}}f^{[1]}_{n,i})D_{n,i}^{[1]}+(\nabla_{x^{[2]}}f^{[1]}_{n,i})D_{n,i}^{[2]}\\ (\nabla_{x^{[1]}}f^{[2]}_{n,i})D_{n,i}^{[1]}+(\nabla_{x^{[2]}}f^{[2]}_{n,i})D_{n,i}^{[2]}\end{bmatrix},\quad i=1,\dots,s,$	(21b)
$\displaystyle\begin{bmatrix}D_{n,i}^{[1]}\\ D_{n,i}^{[2]}\end{bmatrix}$	$\displaystyle=\begin{bmatrix}\delta_{n}^{[1]}\\ \delta_{n}^{[2]}\end{bmatrix}+h\sum_{j=1}^{s}\begin{bmatrix}a_{ij}^{[1]}m_{n,j}^{[1]}\\ a_{ij}^{[2]}m_{n,j}^{[2]}\end{bmatrix},\quad i=1,\dots,s.$	(21c)

The corresponding adjoint system of (16) is written as

\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\begin{bmatrix}\lambda^{[1]}\\ \lambda^{[2]}\end{bmatrix}=-\begin{bmatrix}\nabla_{x^{[1]}}f^{[1]}(x^{[1]},x^{[2]})&\nabla_{x^{[2]}}f^{[1]}(x^{[1]},x^{[2]})\\ \nabla_{x^{[1]}}f^{[2]}(x^{[1]},x^{[2]})&\nabla_{x^{[2]}}f^{[2]}(x^{[1]},x^{[2]})\end{bmatrix}^{\top}\begin{bmatrix}\lambda^{[1]}\\ \lambda^{[2]}\end{bmatrix},

(22)

for which the following final condition

\displaystyle\begin{bmatrix}\lambda^{[1]}(t_{N})\\ \lambda^{[2]}(t_{N})\end{bmatrix}=\begin{bmatrix}\nabla_{x^{[1]}}C(x_{N}^{[1]},x_{N}^{[2]})\\ \nabla_{x^{[2]}}C(x_{N}^{[1]},x_{N}^{[2]})\end{bmatrix}

(23)

is imposed. We consider discretizing the adjoint system by the following formula


$\displaystyle\begin{bmatrix}\lambda^{[1]}_{n+1}\\ \lambda^{[2]}_{n+1}\end{bmatrix}$	$\displaystyle=\begin{bmatrix}\lambda^{[1]}_{n}\\ \lambda^{[2]}_{n}\end{bmatrix}-h\sum_{i=1}^{s}\begin{bmatrix}B_{i}^{[11]}l_{n,i}^{[11]}+B_{i}^{[12]}l_{n,i}^{[12]}\\ B_{i}^{[21]}l_{n,i}^{[21]}+B_{i}^{[22]}l_{n,i}^{[22]}\end{bmatrix},$	(24a)
$\displaystyle\begin{bmatrix}l_{n,i}^{[11]}&l_{n,i}^{[12]}\\ l_{n,i}^{[21]}&l_{n,i}^{[22]}\end{bmatrix}$	$\displaystyle=\begin{bmatrix}(\nabla_{x^{[1]}}f_{n,i}^{[1]})^{\top}\Lambda_{n,i}^{[1]}&(\nabla_{x^{[1]}}f_{n,i}^{[2]})^{\top}\Lambda_{n,i}^{[2]}\\ (\nabla_{x^{[2]}}f_{n,i}^{[1]})^{\top}\Lambda_{n,i}^{[1]}&(\nabla_{x^{[2]}}f_{n,i}^{[2]})^{\top}\Lambda_{n,i}^{[2]}\end{bmatrix},\quad i=1,\dots,s,$	(24b)
$\displaystyle\begin{bmatrix}\Lambda^{[1]}_{n,i}\\ \Lambda^{[2]}_{n,i}\end{bmatrix}$	$\displaystyle=\begin{bmatrix}\lambda^{[1]}_{n}\\ \lambda^{[2]}_{n}\end{bmatrix}-h\sum_{j=1}^{s}\begin{bmatrix}A_{ij}^{[11]}l_{n,j}^{[11]}+A_{ij}^{[12]}l_{n,j}^{[12]}\\ A_{ij}^{[21]}l_{n,j}^{[21]}+A_{ij}^{[22]}l_{n,j}^{[22]}\end{bmatrix},\quad i=1,\dots,s.$	(24c)

We note that the above formula ((24a), (24b), (24c)) does not necessarily belong to the class of PRK methods, and thus refer to the formula as a generalized PRK (GPRK) method.³³3This kind of generalization is only applicable for partitioned systems in which the vector field is decomposed into two parts. The formula falls into a PRK method only when the coefficients satisfy

\displaystyle B^{[11]}=B^{[12]},\quad B^{[21]}=B^{[22]},\quad A^{[11]}=A^{[12]},\quad A^{[21]}=A^{[22]}.

(25)

The following theorem shows that the gradient can be computed exactly by an appropriate GPRK method.

Theorem 2.

Assume that the original system (16) is solved by a PRK method characterized by $(a^{[1]},b^{[1]})$ and $(a^{[2]},b^{[2]})$ , and the coefficients of the GPRK method (24a)-(24c) for the adjoint system (22) satisfy

	$\displaystyle b_{i}^{[1]}=B_{i}^{[11]}=B_{i}^{[21]},\quad i=1,\dots,s,$		(26)
	$\displaystyle b_{i}^{[2]}=B_{i}^{[12]}=B_{i}^{[22]},\quad i=1,\dots,s,$		(27)
	$\displaystyle b_{i}^{[1]}A_{ij}^{[11]}+B_{j}^{[11]}a_{ji}^{[1]}=b_{i}^{[1]}B_{j}^{[11]},\quad i,j=1,\dots,s,$		(28)
	$\displaystyle b_{i}^{[1]}A_{ij}^{[12]}+B_{j}^{[12]}a_{ji}^{[1]}=b_{i}^{[1]}B_{j}^{[12]},\quad i,j=1,\dots,s,$		(29)
	$\displaystyle b_{i}^{[2]}A_{ij}^{[21]}+B_{j}^{[21]}a_{ji}^{[2]}=b_{i}^{[2]}B_{j}^{[21]},\quad i,j=1,\dots,s,$		(30)
	$\displaystyle b_{i}^{[2]}A_{ij}^{[22]}+B_{j}^{[22]}a_{ji}^{[2]}=b_{i}^{[2]}B_{j}^{[22]},\quad i,j=1,\dots,s.$		(31)

Then, by solving the adjoint system with the final condition (23), we obtain the exact gradient $\lambda_{0}^{[1]}=\nabla_{\theta^{[1]}}C(x_{N}^{[1]}(\theta),x_{N}^{[2]}(\theta))$ and $\lambda_{0}^{[2]}=\nabla_{\theta^{[2]}}C(x_{N}^{[1]}(\theta),x_{N}^{[2]}(\theta))$ .

Proof.

It suffices to show that $\lambda_{n+1}^{\top}\delta_{n+1}=\lambda_{n}^{\top}\delta_{n}$ . We see from (21a) and (24a) that

	$\displaystyle(\lambda_{n+1}^{[1]})^{\top}\delta_{n+1}^{[1]}+(\lambda_{n+1}^{[2]})^{\top}\delta_{n+1}^{[2]}$		(32)
	$\displaystyle\quad=\left\lparen\lambda_{n}^{[1]}-h\sum_{j=1}^{s}(B_{j}^{[11]}l_{n,j}^{[11]}+B_{j}^{[12]}l_{n,j}^{[12]})\right\rparen^{\top}\left\lparen\delta_{n}^{[1]}+h\sum_{i=1}^{s}b_{i}^{[1]}m_{n,i}^{[1]}\right\rparen$		(33)
	$\displaystyle\quad\phantom{=}+\left\lparen\lambda_{n}^{[2]}-h\sum_{j=1}^{s}(B_{j}^{[21]}l_{n,j}^{[21]}+B_{j}^{[22]}l_{n,j}^{[22]})\right\rparen^{\top}\left\lparen\delta_{n}^{[2]}+h\sum_{i=1}^{s}b_{i}^{[2]}m_{n,i}^{[2]}\right\rparen$		(34)
	$\displaystyle\quad=(\lambda_{n}^{[1]})^{\top}\delta_{n}^{[1]}+(\lambda_{n}^{[2]})^{\top}\delta_{n}^{[2]}$		(35)
	$\displaystyle\quad\phantom{=}+h\sum_{i=1}^{s}b_{i}^{[1]}(\lambda_{n}^{[1]})^{\top}m_{n,i}^{[1]}-h\sum_{j=1}^{s}(B_{j}^{[11]}l_{n,j}^{[11]}+B_{j}^{[12]}l_{n,j}^{[12]})^{\top}\delta_{n}^{[1]}-h^{2}\sum_{i,j=1}^{s}b_{i}^{[1]}(B_{j}^{[11]}l_{n,j}^{[11]}+B_{j}^{[12]}l_{n,j}^{[12]})^{\top}m_{n,i}^{[1]}$		(36)
	$\displaystyle\quad\phantom{=}+h\sum_{i=1}^{s}b_{i}^{[2]}(\lambda_{n}^{[2]})^{\top}m_{n,i}^{[2]}-h\sum_{j=1}^{s}(B_{j}^{[21]}l_{n,j}^{[21]}+B_{j}^{[22]}l_{n,j}^{[22]})^{\top}\delta_{n}^{[2]}-h^{2}\sum_{i,j=1}^{s}b_{i}^{[2]}(B_{j}^{[21]}l_{n,j}^{[21]}+B_{j}^{[22]}l_{n,j}^{[22]})^{\top}m_{n,i}^{[2]}.$		(37)

By substituting (21b), (21c), (24b) and (24c) to (37), we have

	$\displaystyle(\lambda_{n+1}^{[1]})^{\top}\delta_{n+1}^{[1]}+(\lambda_{n+1}^{[2]})^{\top}\delta_{n+1}^{[2]}$		(38)
	$\displaystyle\quad=(\lambda_{n}^{[1]})^{\top}\delta_{n}^{[1]}+(\lambda_{n}^{[2]})^{\top}\delta_{n}^{[2]}$		(39)
	$\displaystyle\quad\phantom{=}+h\sum_{i=1}^{s}(b_{i}^{[1]}-B_{i}^{[11]})(\lambda_{n,i}^{[1]})^{\top}(\nabla_{x^{[1]}}f_{n,i}^{[1]})D_{n,i}^{[1]}+h\sum_{i=1}^{s}(b_{i}^{[1]}-B_{i}^{[21]})(\lambda_{n,i}^{[1]})^{\top}(\nabla_{x^{[2]}}f_{n,i}^{[1]})D_{n,i}^{[2]}$		(40)
	$\displaystyle\quad\phantom{=}+h\sum_{i=1}^{s}(b_{i}^{[2]}-B_{i}^{[12]})(\lambda_{n,i}^{[2]})^{\top}(\nabla_{x^{[1]}}f_{n,i}^{[2]})D_{n,i}^{[1]}+h\sum_{i=1}^{s}(b_{i}^{[2]}-B_{i}^{[22]})(\lambda_{n,i}^{[2]})^{\top}(\nabla_{x^{[2]}}f_{n,i}^{[2]})D_{n,i}^{[2]}$		(41)
	$\displaystyle\quad\phantom{=}+h^{2}\sum_{i,j=1}^{s}(b_{i}^{[1]}A_{ij}^{[11]}+B_{j}^{[11]}a_{ji}^{[1]}-b_{i}^{[1]}B_{j}^{[11]})(l_{n,j}^{[11]})^{\top}m_{n,i}^{[1]}+h^{2}\sum_{i,j=1}^{s}(b_{i}^{[1]}A_{ij}^{[12]}+B_{j}^{[12]}a_{ji}^{[1]}-b_{i}^{[1]}B_{j}^{[12]})(l_{n,j}^{[12]})^{\top}m_{n,i}^{[1]}$		(42)
	$\displaystyle\quad\phantom{=}+h^{2}\sum_{i,j=1}^{s}(b_{i}^{[2]}A_{ij}^{[21]}+B_{j}^{[21]}a_{ji}^{[2]}-b_{i}^{[2]}B_{j}^{[21]})(l_{n,j}^{[21]})^{\top}m_{n,i}^{[2]}+h^{2}\sum_{i,j=1}^{s}(b_{i}^{[2]}A_{ij}^{[22]}+B_{j}^{[22]}a_{ji}^{[2]}-b_{i}^{[2]}B_{j}^{[22]})(l_{n,j}^{[22]})^{\top}m_{n,i}^{[2]}.$		(43)

Hence, if the conditions in Theorem 2 are satisfied, it follows that $\lambda_{n+1}^{\top}\delta_{n+1}=\lambda_{n}^{\top}\delta_{n}$ . ∎

Remark 2.

We note that if the coefficients of the PRK method applied to the system (16) satisfy $b^{[1]}=b^{[2]}$ , the GPRK method, which is uniquely determined from the conditions in Theorem 2, satisfy (25). Therefore, in this case, the GPRK method falls into the category of PRK methods.

4 Examples

We consider several classes of PRK methods and show corresponding GPRK methods that are uniquely determined from the conditions in Theorem 2.

Example 1 (Symplectic PRK methods for Hamiltonian systems).

For Hamiltonian systems, the coefficients of most symplectic PRK methods, such as the Störmer–Verlet method and the 3-stage Lobatto IIIA–IIIB pair, satisfy

	$\displaystyle b_{i}^{[1]}=b_{i}^{[2]},\quad i=1,\dots,s,$		(44)
	$\displaystyle b_{i}^{[1]}a_{ij}^{[2]}+b_{j}^{[2]}a_{ji}^{[1]}=b_{i}^{[1]}b_{j}^{[2]},\quad,i,j=1,\dots,s.$		(45)

In this case, the coefficients of the GPRK method satisfying the conditions in Theorem 2 are explicitly given by

\displaystyle B^{[11]}=B^{[21]}=B^{[12]}=B^{[22]}=b,\quad A^{[11]}=A^{[12]}=a^{[2]},\quad A^{[21]}=A^{[22]}=a^{[1]},

(46)

where $b:=b^{[1]}=b^{[2]}$ . The GPRK method reduces to a PRK method. We note that the GPRK method for the Störmer–Verlet method is consistent with the one which was presented in the context of inverse problem for ODEs [13], and essentially the same algorithm for a certain vector field $f$ can also be found in the context of deep neural networks [8].

Example 2 (Symplectic PRK methods for separable Hamiltonian systems).

For separable Hamiltonian systems, a PRK method is symplectic even if the first condition (44) is violated. We consider the case where $b^{[1]}\neq b^{[2]}$ . In this case, the GPRK method satisfying the conditions in Theorem 2 does not reduces to a PRK method. However, when the system (20) is separable, i.e. $f^{[1]}=f^{[1]}(x^{[2]})$ and $f^{[2]}=f^{[2]}(x^{[1]})$ , we see from (24b) that $l_{n,i}^{[11]}=l_{n,i}^{[22]}=0$ . Thus the coefficients $B^{[11]}$ , $B^{[22]}$ , $A^{[11]}$ and $A^{[22]}$ are no longer needed, and the GPRK method reduces to a PRK method. The remaining coefficients are explicitly given by

\displaystyle B^{[21]}=b^{[1]},\quad B^{[12]}=b^{[2]},\quad A^{[12]}=a^{[2]},\quad A^{[21]}=a^{[1]}.

(47)

Example 3 (Spatially partitioned embedded RK methods).

Spatially partitioned embedded RK methods are sometimes applied to a system of ODEs arising from discretizing partial differential equations [12]. The methods belong to a class of PRK methods with $a^{[1]}=a^{[2]}(=a)$ and $b^{[1]}\neq b^{[2]}$ . In this case, the GPRK method satisfying the conditions in Theorem 2 is no longer a PRK method. The coefficients are given by

	$\displaystyle B^{[11]}=B^{[21]}=b^{[1]},\quad B^{[12]}=B^{[22]}=b^{[2]},$		(48)
	$\displaystyle A_{ij}^{[11]}=b_{j}^{[1]}-\frac{b_{j}^{[1]}}{b_{i}^{[1]}}a_{ji},\quad A_{ij}^{[12]}=b_{j}^{[2]}-\frac{b_{j}^{[2]}}{b_{i}^{[1]}}a_{ji},\quad A_{ij}^{[21]}=b_{j}^{[1]}-\frac{b_{j}^{[1]}}{b_{i}^{[2]}}a_{ji},\quad A_{ij}^{[22]}=b_{j}^{[2]}-\frac{b_{j}^{[2]}}{b_{i}^{[2]}}a_{ji},\quad i,j=1,\dots,s.$		(49)

5 Concluding remarks

In this paper, we have shown that the gradient of a function of numerical solutions to ODEs with respect to the initial condition can be systematically and efficiently computed when the ODE is solved by a partitioned Runge–Kutta method. The key idea is to apply a certain generalization of partitioned Runge–Kutta methods to the corresponding adjoint system. The proposed method is applicable to, for example, estimation of initial condition or underlying system parameters of ODE models and training of deep neural networks.

It is an interesting problem to construct a similar recipe for other types of ODE solvers such as linear multistep methods and exponential integrators. We are now working on this issue.

Acknowledgement

This work was supported in part by JST ACT-I Grant Number JPMJPR18US, and JSPS Grants-in-Aid for Early-Career Scientists Grant Numbers 16K17550 and 19K20220.

References

Abia and Sanz-Serna [1993] L. Abia, J.M. Sanz-Serna, Partitioned Runge–Kutta methods for separable Hamiltonian problems, Math. Comp. 60 (1993) 617–634.
Chen et al. [2018a] R.T. Chen, Y. Rubanova, J. Bettencourt, D. Duvenaud, Neural ordinary differential equations, arXiv:1806.07366 (2018a).
Chen et al. [2018b] R.T.Q. Chen, Y. Rubanova, J. Bettencourt, D.K. Duvenaud, Neural ordinary differential equations, in: Advances in Neural Information Processing Systems 31, 2018b, pp. 6571–6583.
Dimet and Talagrand [1986] F.X.L. Dimet, O. Talagrand, Variational algorithms for analysis and assimilation of meteorological observations: theoretical aspects, Tellus A 38A (1986) 97–110.
Fichtner et al. [2006] A. Fichtner, H.P. Bunge, H. Igel, The adjoint method in seismology: I. theory, Phys. Earth Planet. In. 157 (2006) 86–104.
Giles [2000] N.A. Giles, Michael B.and Pierce, An introduction to the adjoint approach to design, Flow, Turbul. Combust. 65 (2000) 393–415.
Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. http://www.deeplearningbook.org.
Haber and Ruthotto [2018] E. Haber, L. Ruthotto, Stable architectures for deep neural networks, Inverse Problems 34 (2018) 014004.
Hairer et al. [2006] E. Hairer, C. Lubich, G. Wanner, Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential quations, Springer-Verlag, Berlin, second edition, 2006.
Ito et al. [2019] S. Ito, T. Matsuda, Y. Miyatake, Adjoint-based exact hessian-vector multiplication using symplectic runge–kutta methods, arXiv:1910.06524 (2019).
Ito et al. [2016] S. Ito, H. Nagao, A. Yamanaka, Y. Tsukada, T. Koyama, M. Kano, J. Inoue, Data assimilation for massive autonomous systems based on a second-order adjoint method, Phys. Rev. E 94 (2016) 043307.
Ketcheson et al. [2013] D.I. Ketcheson, C.B. MacDonald, S.J. Ruuth, Spatially partitioned embedded Runge-Kutta methods, SIAM J. Numer. Anal. 51 (2013) 2887–2910.
Matsuda and Miyatake [2019] T. Matsuda, Y. Miyatake, Estimation of ordinary differential equation models with discretization error quantification, arXiv:1907.10565 (2019).
Sanz-Serna [2016] J.M. Sanz-Serna, Symplectic Runge–Kutta schemes for adjoint equations, automatic differentiation, optimal control, and more, SIAM Rev. 58 (2016) 3–33.
Sun [1993] G. Sun, Symplectic partitioned Runge–Kutta methods, J. Comput. Math. 11 (1993) 365–372.
Thacker [1989] W.C. Thacker, The role of the hessian matrix in fitting models to measurements, J. Geophys. Res. Oceans 94 (1989) 6177–6196.
Wang et al. [1998] Z. Wang, K. Droegemeier, L. White, The adjoint Newton algorithm for large-scale unconstrained optimization in meteorology applications, Comput. Optim. and Appl. 10 (1998) 283–320.
Wang et al. [1992] Z. Wang, I.M. Navon, F.X.L. Dimet, X. Zou, The second order adjoint analysis: Theory and applications, Meteor. Atmos. Phys. 50 (1992) 3–20.