Symbolic-Numeric Integration of Univariate Expressions based on Sparse Regression

Shahriar Iravanian
Emory University &Carl Julius Martensen
Otto-von-Guericke University
&Alessandro Cheli
University of Pisa
&Shashi Gowda
Massachusetts Institute of Technology
&Anand Jain
Julia Computing
&Yingbo Ma
Julia Computing
&Chris Rackauckas
Massachusetts Institute of Technology [email protected]

Abstract

Most computer algebra systems (CAS) support symbolic integration as core functionality. The majority of the integration packages use a combination of heuristic algebraic and rule-based (integration table) methods. In this paper, we present a hybrid (symbolic-numeric) methodology to calculate the indefinite integrals of univariate expressions. The primary motivation for this work is to add symbolic integration functionality to a modern CAS (the symbolic manipulation packages of SciML, the Scientific Machine Learning ecosystem of the Julia programming language), which is mainly designed toward numerical and machine learning applications and has a different set of features than traditional CAS. The symbolic part of our method is based on the combination of candidate terms generation (borrowed from the Homotopy operators theory) with rule-based expression transformations provided by the underlying CAS. The numeric part is based on sparse-regression, a component of Sparse Identification of Nonlinear Dynamics (SINDy) technique. We show that this system can solve a large variety of common integration problems using only a few dozen basic integration rules.

1 Introduction

Symbolic integration is a core competency of most Computer Algebra Systems (CAS) and has numerous applications [1, 2, 3]. Since the advent of symbolic integration software in the 1960s [4, 5], the majority of the integration packages fall on a spectrum between algebraic methods - based on a combination of heuristics such as derivative-divide and integration-by-part methods to the Risch algorithm - and rule-based methods, incorporating thousands of specific integration rules (e.g., RUBI)[6, 7, 8]. In this paper, we develop a symbolic-numeric (hybrid) integration method, which is closer in spirit to the heuristic methods but takes a detour in numerical computation to simplify the intermediate steps.

JuliaSymbolics is a subset organization of SciML (an open source organization maintaining a collection of hundreds of scientific computing packages for the Julia programming language). Historically, SciML began as a group of ordinary differential equation (ODE) solvers [9]. Symbolic computation was added as an Embedded Domain Specific Language (eDSL) to ease the definition of ODE systems and to aid with automatic calculation of Jacobian and Hessian of such systems. With the expansion of the SciML ecosystem, the purely symbolic routines were decoupled from the ODE solvers and were collected under JuliaSymbolics [10]. Considering its history and origin, JuliaSymbolics is geared toward symbolic differentiation and numerical integration but lacks direct symbolic integration capabilities. Furthermore, SciML has grown to cover recent advances in scientific machine learning. As part of our method, we utilize one of the these new packages (DataDrivenDiffEq.jl) that implements data-driven differential equation structural estimation and identification [11].

SymbolicNumericIntegration.jl is an attempt to build a symbolic integration package by combining the symbolic manipulation and automatic differentiation capabilities of JuliaSymbolics with numerical routines such as sparse regression provided by DataDrivenDiffEq.jl.

2 The Overview of the Method

This section informally introduces the symbolic-numeric integration algorithm by a simple example. We want to find $\int f(x)\,dx$ , where $f:\mathbb{C}\rightarrow\mathbb{C}=x\sin x$ . We can integrate $f$ using the method of indeterminate coefficients. The main idea is to write the solution, i.e., $S=\int x\sin x\,dx$ , as a sum of multiple possible terms with unknown coefficients,

S=\sum_{i}q_{i}\theta_{i}(x)\,,

(1)

where $q_{i}\in\mathbb{C}$ , for $i=1\cdots|T|$ , are the unknown coefficients and $T=\{\theta_{i}(x)\}$ , where $\theta_{i}:\mathbb{C}\rightarrow\mathbb{C}$ , is the set of reasonable candidate terms (ansatz). For our example, a reasonable set of terms is $T=\{x,\sin x,\cos x,x\sin x,x\cos x\}$ . Of course, we need a better method to find $T$ than saying it should be a reasonable set! We will discuss this problem is details later (Section 3.2), but for now assume that an oracle provides $T$ . We have

S=q_{1}x+q_{2}\sin x+q_{3}\cos x+q_{4}x\sin x+q_{5}x\cos x\,.

(2)

Differentiating with respect to $x$ ,

S^{\prime}=q_{1}+(q_{4}-q_{3})\sin x+(q_{2}+q_{5})\cos x-q_{5}x\sin x+q_{4}x\cos x\,.

(3)

By definition, $S=\int f\,dx$ ; therefore, $S^{\prime}=f=x\sin x$ . We obtain the following linear system,

\begin{array}[]{ll}q_{1}=0\\ q_{4}-q_{3}=0\\ q_{2}+q_{5}=0\\ -q_{5}=1\\ q_{4}=0\end{array}

(4)

The solution to the linear the system is $q_{5}=-1$ , $q_{2}=1$ , and $q_{1}=q_{3}=q_{4}=0$ . Therefore,

S=\int x\sin x\,dx=\cos x-x\cos x\,.

(5)

As it should.

Note that the preceding calculations were essentially all symbolic. However, numerical computation becomes necessary to avoid relying on JuliaSymbolics to convert expressions into unique canonical forms. Identities like $\sin^{2}x+\cos^{2}x=1$ may be correctly applied in this case, but in general, according to Richardson’s theorem, the problem of finding canonical forms of transcendental expressions is undecided [12]. Another reason for using numerical computation is that the list of candidates may not be (and usually is not) linearly independent. Finding a linearly-independent subset of a set of expressions is facilitated using numerical methods (Section 4.3).

3 Symbolic Computation

3.1 The Main Integration Algorithm

The main symbolic-numeric integration algorithm is summarized in Algorithm 1. Let $x$ be the independent variable. For this paper, we assume that the input function to be integrated, $f:\mathbb{C}\rightarrow\mathbb{C}$ , is a univariate expression of $x$ . Additionally, we assume that $f$ is well defined in a closed subset of the complex plane with only isolated poles.

The candidates generation algorithm (Section 3.2) produces a list of generator expressions, $G_{0},G_{1},G_{2},\cdots$ (by applying Equations 12, 13 and 15), where $G_{i}=\sum_{k}a_{k}\theta_{k}(x)$ , $a_{k}\in\mathbb{C}$ , $\theta_{k}:\mathbb{C}\rightarrow\mathbb{C}$ , and $\theta_{k}$ does not have a constant leading coefficient. Each $G_{i}$ is converted to a set of candidates, $T_{i}=\{\theta_{k}\}$ . According to Eq. 15, $T_{i}\subset T_{i+1}$ . We can ignore the constant leading coefficients because the final coefficients are calculated by the numerical part of the algorithm. Therefore, we can express $\theta_{k}$ s as equivalence classes of expressions where $f(x)\sim g(x)$ iff $f/g$ is a non-zero constant. Using this convention, we may write $(\sin x)^{\prime}=\cos x$ and $(\cos x)^{\prime}=\sin x$ .

Algorithm 1 Symbolic-Numeric Integration

0: A univariate expression

f(x)

with constant coefficients in

\mathbb{C}

L

, a user-defined parameter that determines the number of generators to try

0: A symbolic expression

y

, the antiderivative of

f

, such that

y^{\prime}=f

1: for

k=0\dots L-1

G_{k}\leftarrow

the

k

th candidate generator {Eqs. 12, 13, and 15}

T_{k}\leftarrow

the candidate terms of

G_{k}

y\leftarrow

the anti-derivative of

f

using

T_{k}

{Algorithm 2}

5: if

y^{\prime}

is equal to

f

then

6: return

y

7: end if

8: end for

9: return no answer

For each $G_{i}$ and the corresponding $T_{i}$ , Algorithm 1 calls Algorithm 2 to find the anti-derivative of the input. If there is an acceptable solution, it returns; otherwise, Algorithm 1 fetches the next generator, $G_{i+1}$ , and tests the resulting candidate set, $T_{i+1}$ . If no solution is found after trying $G_{L-1}$ , it returns with a failure message. $L$ is a parameter specified by the user.

Algorithm 2 The Numerical part of Symbolic-Numeric Integration

0: A univariate expression

f(x)

with constant coefficients in

\mathbb{C}

0: A set of candidate terms

T=\{\theta_{1},\theta_{2},\cdots,\theta_{n}\}

0: A symbolic expression

y

, the antiderivative of

f

, such that

y^{\prime}=f

1: for

i\leftarrow 1\dots n

x_{i}\leftarrow

a random complex numbers inside the open disk

\mathbb{D}_{d}

3: move

x_{i}

toward the poles of

f

\mathbf{b}_{i}\leftarrow f(x_{i})

{

\mathbf{b}

is a vector of

n

elements}

5: for

j\leftarrow 1\dots n

\mathbf{A}_{ij}\leftarrow\theta^{\prime}_{j}(x_{i})

{

\mathbf{A}

is an

n

-by-

n

matrix and

\theta^{\prime}_{j}

is the derivative of

\theta_{j}

with respect to

x

}

7: end for

8: end for

9: Find

R\subset\{1,2,\dots,n\}

, the set of linearly-dependent rows of

\mathbf{A}

10: for

i\in R

11: remove the

i

th row and column from

\mathbf{A}

12: remove the

i

th row from

\mathbf{b}

13: end for

14: solve

\min_{q}\lVert\mathbf{A}\mathbf{q}-\mathbf{b}\rVert_{2}^{2}+\lambda\lVert\mathbf{q}\rVert_{2}

for

\mathbf{q}

(sparse regression)

15:

y\leftarrow\sum_{j}q_{j}\theta_{j}

{

\mathbf{A}

uses

\theta^{\prime}

, but

y

is based on

\theta

}

16: return y

Algorithm 2 generates $n$ test points $x_{i}$ in complex plane, where $n$ is the number of the candidates, to create an $n$ -by- $n$ matrix $\mathbf{A}$ and an $n$ -element vector $\mathbf{b}$ filled, respectively, with the values of the derivatives of the candidate terms and the input function at the test points.

As discussed above, a potential complication at this stage is that the rows of $\mathbf{A}$ may be linearly-dependent. We remedy this problem by finding the linearly-dependent rows and removing them from $\mathbf{A}$ and $\mathbf{b}$ (Section 4.3).

Using full-rank $\mathbf{A}$ and $\mathbf{b}$ , we find $\mathbf{q}$ such that $\mathbf{A}\mathbf{q}=\mathbf{b}$ . As is discussed in Section 4.4, we cast the problem as an optimization task and use sparse regression to calculate $\mathbf{q}$ . Finally, we put everything together and generate $y=\sum_{j}q_{j}\theta_{j}$ , the candidate anti-derivative of the input.

3.2 Candidates Generation

Candidate generation is the core of the Symbolic-Numeric integration algorithm. First, we describe an algorithm for a subset of expressions amenable to simple treatment and then expand to the general algorithm.

One key observation is that the form of the anti-derivative of some functions is similar to their derivative forms. For example, $\int\cos x\,dx=\sin x=-(\cos x)^{\prime}$ . The underlying reason is that these functions can be defined in term of the polynomials of the exponential function and its inverse, i.e., $f\in\mathbb{C}[e^{x},e^{-x}]$ , $(e^{x})^{\prime}=e^{x}$ , and $(e^{-x})^{\prime}=-e^{-x}$ ; therefore $\mathbb{C}[e^{x},e^{-x}]$ is closed under differentiation and integration. The functions with this property include the non-negative integer powers of $\sin x$ , $\cos x$ , $\sinh x$ , $\cosh x$ , and $e^{x}$ . Let’s define

f(x)=(e^{x})^{p_{1}}(\sin x)^{p_{2}}(\cos x)^{p_{3}}(\sinh x)^{p_{4}}(\cosh x)^{p_{5}}\,,

(6)

where $p_{k}\in\mathbb{Z}^{+}$ . For these class of functions, we can obtain the candidate terms with repetitive differentiation. For example, let $f(x)=e^{x}\sin x$ . We start with $T=\emptyset$ ( $T$ is the set of candidates), differentiate $f$ , and add the resulting terms to $T$ (ignoring the constant coefficients):

f^{\prime}=e^{x}\cos x+e^{x}\sin x\Rightarrow T=\{e^{x}\cos x,e^{x}\sin x\}\,.

(7)

Next, we need to differentiate each term in $T$ that was not previously processed. The first term, $e^{x}\cos x$ , gives back $f$ and the second term is equal to $f$ . Therefore, $T=\{e^{x}\cos x,e^{x}\sin x\}$ is the final answer, consistent with $\int e^{x}\sin x\,dx=(e^{x}\sin x-e^{x}\cos x)/2$ . The main idea is repetitive differentiation of the terms until we run out of new terms to process. Essentially, we are generating a closure of the input expression based on differentiation.

Every $f\in\mathbb{C}[e^{x},e^{-x}]$ has a finite number of candidate terms; therefore, the preceding algorithm always terminates. We prove this for functions defined according to Eq. 6. After differentiation, each resulting term is also in the form of Eq. 6 and $p_{1}$ remains the same in all of them. In addition, $p_{2}+p_{3}$ remain the same (one power of $\sin x$ converts to $\cos x$ or vice versa). Similarly, $p_{4}+p_{5}$ is a constant. Therefore, the sum of the powers, $p=\sum_{k}p_{k}$ , is a constant. Since for each $p$ there are only a finite number of possible expressions, we run out of new terms in a finite number of steps and the algorithm terminates.

In general, most integrands are not in $\mathbb{C}[e^{x},e^{-x}]$ and candidates cannot all be found by just repetitive differentiation; however, repetitive differentiation is still the backbone of the general algorithm. The essence of candidate generation is integration by parts,

\int uv^{\prime}\,dx=uv-\int u^{\prime}v\,dx\,

(8)

followed by repetitive differentiation. This process or something equivalent resides at the core of most symbolic integration packages. However, when applying this algorithm to complex expressions, multiple problems arise. At each step, there can be more than one possible split of the integrand into $u$ and $v^{\prime}$ . It is unknown a priori which split will result in the correct solution. Therefore, we need to explore all possibilities. Doing this naively can generate an exponential number of candidates. In fact, there is no guarantee that the process even terminates. Realizing that many intermediate expressions are shared among different paths, we can manage the number of candidates by keeping track of the intermediate results, similar to how memoization reduces the complexity of recursive algorithms. A top-down recursive algorithm with memoization usually has an equivalent bottom-up dynamic programming dual. This is the algorithmic basis of the approach discussed below.

A more systematic way to manage the intermediate expressions generated by repetitive integration by parts is provided by the continuous homotopy operators, which automate integration by parts and are useful in integrating exact expressions composed of multiple functions of independent variables and finding conservation laws in the system of differential equations [13, 14, 15]. Here, our goal is to borrow some of the machinery of the homotopy operators methodology to enhance the generation of candidate expressions. We continue with a brief and incomplete discussion of the homotopy operators technique.

Let $\mathbf{u}=(u^{1},u^{2},\cdots,u^{N})$ be a vector of $N$ dependent variables of $x$ . Let $f(\mathbf{u})$ be a function of $\mathbf{u}$ and its partial derivatives with respect to $x$ . We define $u_{0x}^{i}=u^{i}$ , $u_{x}^{i}=\partial{u^{i}}/\partial{x}$ and $u^{i}_{kx}=\partial^{k}{u^{i}}/\partial{x}^{k}$ . The total derivative operator is defined as

\mathcal{D}_{x}f=\frac{\partial{f}}{\partial{x}}+\sum_{j=1}^{N}\sum_{k=0}^{M^{j}}u^{j}_{(k+1)x}\frac{\partial{f}}{\partial{u^{j}_{kx}}}\,

(9)

where $M^{j}$ is the maximum order of the partial derivatives of $u^{j}$ present in $f$ . The key to the homotopy operators technique is a method to invert $\mathcal{D}_{x}f$ , namely $\int f\,dx=\mathcal{D}_{x}^{-1}f=\mathcal{H}_{\mathbf{u}}f$ , where

\mathcal{H}_{\mathbf{u}}f=\int_{\lambda_{0}}^{1}\left(\sum_{j=1}^{N}\mathcal{I}_{u^{j}}f\right)[\lambda\mathbf{u}]\frac{d\lambda}{\lambda}

(10)

is the one-dimensional homotopy operator, $\lambda$ is a dummy integration variable, and the notation $X[\lambda\mathbf{u}]$ means replacing $u^{i}$ with $\lambda u^{i}$ , $u_{x}^{i}$ with $\lambda u_{x}^{i}$ , and the same for other variables and partial derivatives in $X$ . Moreover, $\mathcal{I}_{u^{j}}$ is defined as

\mathcal{I}_{u^{j}}f=\sum_{k=1}^{M^{j}}\left(\sum_{i=0}^{k-1}u^{j}_{ix}(-\mathcal{D}_{x})^{k-(i+1)}\right)\frac{\partial{f}}{\partial{u^{j}_{kx}}}\,.

(11)

For the proof of these equations, refer to [13, 15]. However, we can make intuitive sense of these equations. Eq 11 has two main roles. First, it replaces $u_{kx}$ (removed by the partial differentiation operator) with $u_{(k-1)x}$ . This action is similar to integration by part, and, in fact, its proof is based on integration by parts. Second, it performs repetitive differentiation (by the sub-expression containing the powers of $\mathcal{D}_{x}$ ), reminiscent of the method discussed above for functions in $\mathbb{C}[e^{x},e^{-x}]$ . Finally, the definite integral in Eq 10 on variable $\lambda$ calculates the coefficients of the candidates.

In this paper, we use sparse regression to find the coefficients; therefore, we do not need Eq 10. Our focus is on Eq 11, but we cannot use it in its current form because the expected integrands are generally not in the form of $f(\mathbf{u})$ . The key is to rewrite the integrand, $f(x)$ , is a usable form,

\boxed{f(x)=\prod_{i}^{N}u_{i}(x)^{n_{i}}\,,\,\text{where }\,u_{i}(x)=g_{i}(v_{i}(x))\,,}

(12)

$g_{i}(x)$ is a function that can be easily integrated (see Section 3.3), and $n_{i}\in\mathbb{Z}^{+}$ . For example, if $f(x)=\sin^{2}(x^{2}-1)$ , then $f=u_{1}^{2}$ for $u_{1}=\sin(v_{1})$ and $v_{1}=x^{2}-1$ .

To find candidates, we follow the same process as in Eq 11 but customized for expressions conformant to Eq 12. However, we cannot simply apply Eq 11 because, for a general non-exact expression, $M^{j}$ may not be known or even bounded. Instead, we unroll the outer summation in Eq 11 to generate an ordered list of candidate generators, $G_{0},G_{1},G_{2},\cdots$ . We generate $G_{0}$ by integrating each $u_{i}$ in turn. We integrate the first factor of $f$ , i.e. $u_{1}^{n_{1}}$ , by multiplying $f$ by $v_{1}^{\prime}/v_{1}^{\prime}$ and split $f$ into $u_{1}v_{1}^{\prime}$ and $f/(u_{1}v_{1}^{\prime})$ . Ignoring the constant factors, the second part can be written as $(v_{1}^{\prime})^{-1}\partial{f}/\partial{u_{1}}$ . Considering that $u=g(v)$ , we have $\int uv^{\prime}\,dx=\int g(v)v^{\prime}\,dx=\int g(v)\,dv$ (remember that $g$ is chosen to be easy to integrate). Therefore,

\boxed{G_{0}=\sum_{i=1}^{N}\left(1+\int g_{i}(v)\,dv\right)_{v\leftarrow v_{i}}\left(1+v_{i}^{-1}\frac{\partial{f}}{\partial{u_{i}}}\right)\,.}

(13)

In $\left(1+\int g_{i}(v)\,dv\right)$ , 1 represents the constant of integration. For the term on the right, 1 is not strictly necessary but improves performance by moving terms from $G_{i>0}$ to $G_{0}$ . Next, we generate $G_{1},G_{2},\cdots$ by repetitive differentiation similar to Eq 11. In addition, we introduce the powers of $x$ into the results by integration-by-parts, assuming $v^{\prime}=1$ and $v=x$ in Eq 8,

\int f\,dx=\int(x)^{\prime}f\,dx=xf-\int xf^{\prime}\,dx\,

(14)

Putting all together,

\boxed{G_{i+1}=(1+x)\left(\mathbf{1}+\mathcal{D}_{x}\right)G_{i}\,.}

(15)

Let’s apply the algorithm to $f(x)=\cot^{4}x$ . Decomposing $f$ based on Eq 12, we have $u=\cot v$ , $v=x$ , and $f=u^{4}$ . Considering that $\int\cot v\,dv=\log(\sin v)$ , we get $G_{0}=1+\log(\sin x)+\cot^{3}x+\log(\sin x)\cot^{3}x$ , which does not solve the integral. Next, we move to $G_{1}$ by using Eq 15. This time, we have

G_{1}=1+\log(\sin x)+\cot^{3}x+\log(\sin x)\cot^{3}x+x+\cot x+\cdots\text{(14 terms)}\,,

(16)

Feeding $f$ and the candidate set derived from $G_{1}$ to Algorithm 2, we solve the integral,

\int\cot^{4}x\,dx=x+\cot x-\frac{1}{3}\cot^{3}x\,.

(17)

3.3 Rule-Based Integration

Converting integrands to a form compatible with Eq. 12 and finding $\int g_{i}(v)\,dv$ in Eq. 13 are facilitated by the term-rewriting and rule definition functionality of JuliaSymbolics, which includes a layered architectural design [16] to decouple term rewriting from other features, and leverages the Julia’s metaprogramming features that have been inherited from other languages in the Lisp family.

On the lowest level, TermInterface.jl and Metatheory.jl packages [17, 10, 16] provide an efficient expression rewriting engine and rule definition eDSL. Rules are written as regular Julia expressions, thus parsed into S-expressions and then efficiently compiled to callable functions that perform pattern matching and support advanced conditional selection [18, 10, 16]. JuliaSymbolics also supports the definition of equational (bidirectional) rewrite rules that can be executed through an e-graph rewriting engine (adopted from [19]), performing equality saturation.

We use JuliaSymbolics’s rewriting capabilities to find the integrals of $g(x)$ . Nearly fifty rules are sufficient to cover elementary functions (exponential, logarithmic, trigonometric, hyperbolic, and their inverses). The following code snippets show a few example rules.

  @rule integrate(^(~x, -1)) => log(~x)
  @rule integrate(^(~x, ~k)) => ^(~x, ~k+1)
  @rule integrate(^(sin(~x), ~k::is_neg)) =>
         integrate(^(csc(~x), -~k))
  @rule integrate(csc(~x)) =>
         log(sin(~x)^-1 - cos(~x)*sin(~x)^-1)

4 Numerical Computation

4.1 Simplifying Rational Expressions

Processing rational expressions is one of the first steps in symbolic integration with a long and mathematically rich history (e.g., Hermite’s and Harowitz’ methods). In this paper, to stay with the general theme of our method, we use a symbolic-numeric (hybrid) algorithm to factor rational expressions based on finding the roots of the denominator (which is assumed to be a polynomial). Our method is broadly similar to the one proposed in [20]; however, to find the coefficients of the terms, we use sparse regression method (see below) instead of the method of residues.

4.2 Generating Random Points and Moving Toward the Poles

In line 2 of Algorithm 2, $n$ random number $x_{i}$ (test points) are generated in, $\mathbb{D}_{d}$ , an open disk of radius $d$ centered at the origin ( $d$ is a parameter provided by the user). Therefore, $\mathbb{D}_{d}$ should contain the majority of the unique poles of $f(x)$ . Note that a function like $\sec(x)$ has an infinity number of poles and $\mathbb{D}_{d}$ should contain at least one of them. For the sparse regression to work properly, test points need to be near the poles of $f(x)$ . We use a variant of the Newton-Raphson method to facilitate probing the poles. The standard Newton-Raphson method to find zeros of $f(x)$ is the iteration $x^{j+1}=x^{j}-f(x^{j})/f^{\prime}(x^{j})$ , where superscript depicts an order index and not a power. The zero-finding algorithm can be transformed to a pole-finding one by changing the minus sign to plus. Therefore,

x_{i}^{j+1}=x_{i}^{j}+\frac{f(x_{i}^{j})}{f^{\prime}(x_{i}^{j})}\,,

(18)

is used by Algorithm 2 on line 4 to move the test points toward the poles.

4.3 Finding Linearly-Independent Subsets of the Candidates

Matrix $\mathbf{A}$ in Algorithm 2 can be rank-deficient. In order to find acceptable regression solutions, we need to remove linearly-dependent columns from $\mathbf{A}$ . This task is achieved by using pivoted QR algorithm [21]. In short, $\mathbf{A}$ is decomposed to $\mathbf{A}=\mathbf{Q}\mathbf{R}\mathbf{P}^{T}$ , where $\mathbf{Q}$ is an orthogonal rotation matrix, $\mathbf{R}$ is an upper triangular matrix, and $\mathbf{P}$ is the permutation matrix. We find the columns linearly dependent on other columns by locating small values on the diagonal of $\mathbf{R}$ (here, small means having an absolute value less than a parameter $\epsilon$ , usually taken to be $10^{-6}$ ) and convert $\mathbf{P}$ to the set $R$ , used by Algorithm 2, that lists indices of linearly-dependent columns.

4.4 Sparse Regression

In Algorithm 2, we need to find $\mathbf{q}$ such that $\mathbf{A}\mathbf{q}=\mathbf{b}$ . If $\mathbf{A}$ is low dimensional, this can be done simply as $\mathbf{q}=\mathbf{A}^{-1}\mathbf{b}$ (by construction, $\mathbf{A}$ is a square matrix). However, this process has the drawback of tending to use all the terms, even those with numerically small coefficients, which obscures the results and differs from the expected answer to integration problems. We need a parsimonious (sparse) model such that $\mathbf{q}$ has the minimum number of non-zero elements while still solves $\mathbf{A}\mathbf{q}=\mathbf{b}$ with an acceptable accuracy. We can achieve this by recasting the problem as an optimization problem to solve

\min_{q}\lVert\mathbf{A}\mathbf{q}-\mathbf{b}\rVert_{2}^{2}+\lambda\lVert\mathbf{q}\rVert_{i}\,,

(19)

for $\mathbf{q}$ , where $i\in\{1,2~{}\}$ , and $\lambda$ is a regularization parameter. In this paper, we use the sequential thresholded least-squares (STLSQ) algorithm, which is a component of Sparse Identification of Nonlinear Dynamics (SINDy) technique [22] and uses $\ell_{2}$ -norm. This method has been chosen due to robust behaviour within the scope of many problems related to SINDy. However, other sparse regression algorithms, e.g., the least absolute shrinkage and selection operation (LASSO) using $\ell_{1}$ -norm [23], Elastic-Net or SR3 are also possible candidates. The sparse regression code is provided by DataDrivenDiffEq.jl [11].

5 Results

SymbolicNumericIntegration.jl is available at https://github.com/SciML/SymbolicNumericIntegration.jl. We tested the performance of the software using a collection of test integrals provided by RUBI based on classic calculus textbooks [8] (Table 1).

Table 1: Performance of SymbolicNumericIntegration.jl

Source	Success	Failure	Total
Apostle [24]	123	26	149
Hearn	191	38	229
Stewart [25]	220	77	297
Timofeev [26]	136	126	262

We showcase the strengths and weaknesses of the symbolic-numeric integration algorithm with two examples (both from Apostle [24]). The algorithm can successfully solve the following integral,

\int\frac{\log x}{x\sqrt{1+\log x}}\,dx=\frac{2}{3}x\log x\sqrt{1+\log x}-\frac{4}{3}\sqrt{1+\log x}\,.

(20)

The reason for the success is that we can solve Eq. 20 after substitution $u=1+\log x$ , which transforms the integral to easily solvable $\int(u-1)/\sqrt{u}\,du$ . However, the integration algorithm does not explicitly apply the substitution. Instead, the way that the candidate generation algorithm (section 3.2) creates new candidates implicitly covers the substitution. On the other hand, the algorithm fails to solve the following simple integral,

\int\frac{1}{1+2\cos x}\,dx=\frac{\log(\tan x/2+\sqrt{3})-\log(\tan x/2-\sqrt{3})}{\sqrt{3}}

(21)

Here, there is no simple substitution that can solve the integral. The standard solution is obtained by expressing $\cos x$ in term of the tangent of half-angle and then solve the resulting rational expression. While it is easy to explicitly add this procedure to the algorithm, it is not an automatic consequence of the candidate generation algorithm based on homotopy operators. However, the algorithm can solve the following two variants of the problem,

\int\frac{\sin x}{1+2\cos x}\,dx=-\frac{1}{2}\log(1+2\cos x)\,,

(22)

because now the substitution $u=1+2\cos x$ works, and,

\int\frac{1}{1+\cos x}\,dx=\frac{\sin x}{1+\cos x}\,,

(23)

since $\sin x/(1+\cos x)=\tan(x/2)$ and the algorithm can implicitly apply the tangent of half-angle trick.

6 Conclusions

SymbolicNumericIntegration.jl provides a proof-of-concept that symbolic-numeric integration, based on a combination of symbolic candidates generation and numerical sparse regression, is able to solve many standard integration problems. Symbolic integration systems are tightly coupled to the CAS they are build on. JuliaSymbolics has strong simplification, rule-based rewriting, differentiation and code generation capabilities – all useful for the scientific machine learning workflow it is designed to support. We add the much needed feature of symbolic integration by building on the same strengths of this CAS. On the other hand, polynomials with rational or integer coefficients ( $\mathbb{Z}[x]$ and $\mathbb{Q}[x]$ ), which are the workhorse of most traditional CAS, currently lack robust algorithmic support. Therefore, SymbolicNumericIntegration.jl employs the strengths of JuliaSymbolics, as used in Eqs. 12, 13, and 15, and covers for its shortcomings by using numerical methods. We believe that the features of strong auto-differentiation and numerical routines combined with a relatively weak classic symbolic algebra are not unique to SciML ecosystem and are shared by many recent systems gearing toward machine learning and expect that the symbolic-numeric methodology developed in this paper can be applicable to these other systems.

Acknowledgements

This material is based upon work supported by the National Science Foundation under grant no. OAC-1835443, grant no. SII-2029670, grant no. ECCS-2029670, grant no. OAC-2103804, and grant no. PHY-2021825. We also gratefully acknowledge the U.S. Agency for International Development through Penn State for grant no. S002283-USAID. The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0001211 and DE-AR0001222. We also gratefully acknowledge the U.S. Agency for International Development through Penn State for grant no. S002283-USAID. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. This material was supported by The Research Council of Norway and Equinor ASA through Research Council project "308817 - Digital wells for optimal production and drainage". Research was sponsored by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

References

[1] Manuel Bronstein. Symbolic Integration I. Springer-Verlag, 2005.
[2] Edmund A. Lamagna. Computer algebra concepts and techniques. CRC Press, 2018.
[3] Jurgen Gerhard. Modern computer algebra: Third edition. Cambridge University Press, 2011.
[4] SlagleJames R. A Heuristic Program that Solves Symbolic Integration Problems in Freshman Calculus. Journal of the ACM (JACM), 10(4):507–520, oct 1963.
[5] Joel Moses. Symbolic Integration: The Stormy Decade. pages 427–440.
[6] Robert H. Risch. The Problem of Integration in Finite Terms. Transactions of the American Mathematical Society, 139:167, may 1969.
[7] Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Štěpán Roučka, Ashutosh Saboo, Isuru Fernando, Sumith Kulal, Robert Cimrman, and Anthony Scopatz. SymPy: symbolic computing in Python. PeerJ Computer Science, 3(1):e103, jan 2017.
[8] Albert Rich, Patrick Scheibe, and Nasser Abbasi. Rule-based integration: An extensive system of symbolic integration rules. Journal of Open Source Software, 3(32):1073.
[9] Christopher Rackauckas and Qing Nie. Differentialequations.jl–a performant and feature-rich ecosystem for solving differential equations in julia. Journal of Open Research Software, 5(1), 2017.
[10] Shashi Gowda, Yingbo Ma, Alessandro Cheli, Maja Gwozdz, Viral B Shah, Alan Edelman, and Christopher Rackauckas. High-performance symbolic-numerics via multiple dispatch. arXiv preprint arXiv:2105.03949, 2021.
[11] JuliusMartensen, Christopher Rackauckas, et al. Datadrivendiffeq.jl, July 2021.
[12] Daniel Richardson. Some undecidable problems involving elementary functions of a real variable. Journal of Symbolic Logic, 33:514–520, 1 1969.
[13] W. Hereman, B. Deconinck, and L. D. Poole. Continuous and Discrete Homotopy Operators: A Theoretical Approach made Concrete. oct 2005.
[14] Bernard Deconinck and Michael Nivala. Symbolic integration using homotopy methods. Mathematics and Computers in Simulation, 80(4):825–836, dec 2009.
[15] Douglas Poole and Willy Hereman. The homotopy operator method for symbolic integration by parts and inversion of divergences with applications. https://doi.org/10.1080/00036810903208155, 89(4):433–455, apr 2010.
[16] Alessandro Cheli and Christopher Rackauckas. Automated code optimization with e-graphs, 2021.
[17] Alessandro Cheli. Metatheory.jl: Fast and elegant algebraic computation in julia with extensible equality saturation. Journal of Open Source Software, 6(59):3078, 2021.
[18] Yingbo Ma, Shashi Gowda, Ranjan Anantharaman, Chris Laughman, Viral Shah, and Chris Rackauckas. Modelingtoolkit: A composable graph transformation system for equation-based modeling, 2021.
[19] Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock, and Pavel Panchekha. Egg: Fast and extensible equality saturation. Proc. ACM Program. Lang., 5(POPL), jan 2021.
[20] NodaMatu-Tarow and MiyahiroEiichi. A hybrid approach for the integration of a rational function. Journal of Computational and Applied Mathematics, 40(3):259–268, jul 1992.
[21] G. H. Golub and C. F. Van Loan. Matrix computations. Johns Hopkins University Press, 2013.
[22] Steven L. Brunton, Joshua L. Proctor, J. Nathan Kutz, and William Bialek. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences of the United States of America, 113(15):3932–3937, apr 2016.
[23] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58:267–288, 1 1996.
[24] Tom M. Apostol. Calculus, Vol. 1: One-Variable Calculus, with an Introduction to Linear Algebra. J. Wiley, 1967.
[25] J. Stewart and R.S. Andre. Calculus. Mathematics Series. Brooks/Cole, 1991.
[26] A. F. Timofeev. Integration of Functions. 1948.