Deep Learning Approximation of Diffeomorphisms via Linear-Control Systems
Abstract.
In this paper we propose a Deep Learning architecture to approximate diffeomorphisms diffeotopic to the identity. We consider a control system of the form , with linear dependence in the controls, and we use the corresponding flow to approximate the action of a diffeomorphism on a compact ensemble of points. Despite the simplicity of the control system, it has been recently shown that a Universal Approximation Property holds. The problem of minimizing the sum of the training error and of a regularizing term induces a gradient flow in the space of admissible controls. A possible training procedure for the discrete-time neural network consists in projecting the gradient flow onto a finite-dimensional subspace of the admissible controls. An alternative approach relies on an iterative method based on Pontryagin Maximum Principle for the numerical resolution of Optimal Control problems. Here the maximization of the Hamiltonian can be carried out with an extremely low computational effort, owing to the linear dependence of the system in the control variables. Finally, we use tools from -convergence to provide an estimate of the expected generalization error.
Keywords
Deep Learning, Linear-Control System, -convergence, Gradient Flow, Pontryagin Maximum Principle.
MSC2020 classification
Primary: 49M05, 49M25. Secondary: 68T07, 49J15.
1. Introduction
Residual Neural Networks (ResNets) were originally introduced in [18] in order to overcome some issues related to the training process of traditional Deep Learning architectures. Indeed, it had been observed that, as the number of the layers in non-residual architectures is increased, the learning of the parameters is affected by the vanishing of the gradients (see, e.g., [7]) and the accuracy of the network gets rapidly saturated (see [17]). ResNets can be represented as the composition of non-linear mappings
where represents the depth of the Neural Network and, for every , the building blocks are of the form
(1.1) |
where is a non-linear activation function that acts component-wise, and and are the parameters to be learned. In contrast, we recall that in non-residual architectures , the building blocks have usually the form
for . In some recent contributions [14, 19, 16], ResNets have been studied in the framework of Mathematical Control Theory. The bridge between ResNets and Control Theory was independently discovered in [14] and [16], where it was observed that each function defined as in (1.1) can be interpreted as an Explicit Euler discretization of the control system
(1.2) |
where and are the control variables. Since then, Control Theory has been fruitfully applied to the theoretical understanding of ResNets. In [24] a Universal Approximation result for the flow generated by (1.2) was established under suitable assumptions on the activation function . In [19] and [8] the problem of learning an unknown mapping was formulated as an Optimal Control problem, and the Pontryagin Maximum Principle was employed in order to learn the optimal parameters. In [10] it was consider the mean-field limit of (1.2), and it was proposed a training algorithm based on the discretization of the necessary optimality conditions for an Optimal Control problem in the space of probability measures. The Maximum Principle for Optimal Control of probability measures was first introduced in [21], and recently it has been independently rediscovered in [9]. In this paper, rather than using tools from Control Theory to study properties of existing ResNets, we propose an architecture inspired by theoretical observations on control systems with linear dependence in the control variables. As a matter of fact, the building blocks of the ResNets that we shall construct depend linearly in the parameters, namely they have the form
where is a non-linear function of the input, and is the vector of the parameters at the -th layer. The starting points of this paper are the controllability results proved in [4, 5], where the authors considered a control system of the form
(1.3) |
where are smooth and bounded vector fields on , and is the control. We immediately observe that (1.3) has a simpler structure than (1.2), having linear dependence with respect to the control variables. Despite this apparent simplicity, the flows associated to (1.3) are capable of interesting approximating results. Given a control , let be the flow obtained by evolving (1.3) on the time interval using the control . In [5] it was formulated a condition for the controlled vector fields called Lie Algebra Strong Approximating Property. On one hand, this condition guarantees exact controllability on finite ensembles, i.e., for every , for every such that , and for every injective mapping , there exists a control such that for every . On the other hand, this property is also a sufficient condition for a -approximate controllability result in the space of diffeomorphisms. More precisely, given a diffeomorphism diffeotopic to the identity, and given a compact set , for every there exists a control such that . The aim of this paper is to use this machinery to provide implementable algorithms for the approximation of diffeomorphisms diffeotopic to the identity. More precisely, we discretize the control system (1.3) on the evolution interval using the Explicit Euler scheme and the uniformly distributed time-nodes , and we obtain the ResNet represented by the composition , where, for every , is of the form
(1.4) |
where represents an initial input of the network. In this construction, the points represent the approximations at the time-nodes of the trajectory that solves (1.3) with Cauchy datum . We insist on the fact that in our discrete-time model we assume that the controls are piecewise constant on the time intervals . For this reason, when we derive a ResNet with hidden layers, we deal with building blocks (i.e., ) and with -dimensional parameters . Moreover, when we evaluate the ResNet at a point , we end up with input/output variables .
As anticipated before, we use the ResNet (1.4) for the task of a data-driven reconstruction of diffeomorphisms. The idea is that we may imagine to observe the action of an unknown diffeomorphism on a finite ensemble of points with . Using these data, we try to approximate the mapping . Assuming that the controlled vector fields satisfy the Lie Algebra Strong Approximating Property, a first natural attempt consists in exploiting the exact controllability for finite ensembles. Indeed, this ensures that we can find a control that achieves a null training error, i.e., the corresponding flow satisfies for every . Unfortunately, when the number of observation grows, this training strategy can not guarantee any theoretical upper bound for the generalization error. Indeed, as detailed in Section 4, in the case is not itself a flow of the control system (1.3), if we pursue the null-training error strategy then the sequence of the controls will be unbounded in the -norm. Since the Lipschitz constant of a flow generated by (1.3) is related to the -norm of the corresponding control, it follows that we cannot bound from above the Lipschitz constants of the approximating diffeomorphisms . Therefore, even though for every we achieve perfect matching on the training dataset , we have no control on the generalization error made outside the training points. This is an example of overfitting, a well-known phenomenon in the practice of Machine Learning.
In order to overcome this issue, when constructing the approximating diffeomorphism we need to find a balance between the training error and the -norm of the control. For this reason we consider the non-linear functional defined on the space of admissible controls as follows:
(1.5) |
where is a non-negative function such that , and is a fixed parameter that tunes the squared -norm regularization. A training strategy alternative to the one described above consists in looking for a minimizer of , and taking as an approximation of . Under the assumption that the training dataset is sampled from a compactly-supported Borel probability measure on , we prove a -convergence result. Namely, we show that, as the size of the training dataset tends to infinity, the problem of minimizing converges to the problem of minimizing the functional , defined as
(1.6) |
Using this -convergence result, we deduce the following upper bound for the expected testing error:
(1.7) |
where is a minimizer of , and are constants such that and as , and is the Wasserstein distance between and the empirical measure
Recalling that as , inequality (1.7) ensures that we can achieve arbitrarily small expected testing error by choosing small enough and having a large enough training dataset. Indeed, assuming that we can choose the size of the dataset, for every we consider such that , and then we take such that . We insist on the fact that the estimate in (1.7) holds a priori with respect to the choice of a testing dataset and the corresponding computation of the generalization error. We report that the minimization of the functional (1.6) plays a crucial role in [10].
On the base of the previous theoretical results,
we propose two possible training
algorithms to approximate the unknown
diffeomorphism , having observed its action
on an ensemble .
The first one consists in introducing a
finite-dimensional subspace ,
and in projecting onto the gradient flow
induced by in the space of admissible
controls. Here we make use of
the gradient flow formulation
for optimal control problems derived in [23].
The second approach is based on the Pontryagin
Maximum Principle, and it is similar to one followed
in [19], [8], [10]
for control system (1.2).
We used the iterative method proposed in
[22] for the the numerical resolution
of Optimal Control problems.
The main advantage of dealing with a
linear-control system is that the
maximization of the Hamiltonian
(which is a key-step for the controls update)
can be carried out with a low computational effort.
In contrast, in [19] a non-linear
optimization software was employed
to manage this task, while in
[10] the authors wrote the maximization
as a root-finding problem and solved it with
Brent’s method.
The paper is organized as follows.
In Section 2 we establish the
notations and we prove preliminary results regarding
the flow generated by control system
(1.3).
In Section 3 we recall
some results contained in [4] and [5]
concerning exact and approximate controllability
of ensembles.
In Section 4 we explain why
the “null training error strategy” is not suitable
for the approximation purpose, and we outline
the alternative strategy based on the resolution
of an Optimal Control problem.
In Section 5 we prove a
-convergence result that holds when the
size of the training dataset tends to infinity.
As a byproduct, we obtain the upper bound on the
expected generalization error reported in
(1.7).
In Section 6 we describe the
training algorithm based on the projection of
the gradient flow induced by the cost .
In Section 7 we propose the
training procedure based on the Pontryagin
Maximum Principle.
In Section 8 we test numerically
the algorithms by approximating a diffeomorphism
in the plane.
2. Notations and Preliminary Results
In this paper we consider control systems of the form
(2.1) |
where the controlled vector fields satisfy the following assumption.
Assumption 1.
The vector fields are smooth and bounded111With minor adjustments, the hypothesis of boundedness of the vector fields can be slightly relaxed by requiring a sub-linear growth condition. We preferred to prove the results for bounded vector fields in order to avoid technicalities. in , together with their derivatives of each order.
In particular, this implies that there exists such that
(2.2) |
The space of admissible controls is , endowed with the usual Hilbert space structure. Using Assumption 1, the classical Caratéodory Theorem guarantees that, for every and for every , the Cauchy problem
(2.3) |
has a unique solution . Hence, for every , we can define the flow as follows:
(2.4) |
where solves (2.3). We recall that is a diffeomorphism, since it is smooth and invertible.
In the paper we will make extensive use of the weak topology of . We recall that a sequence is weakly convergent to an element if, for every , we have that
and we write . In the following result we consider a sequence of solutions of (2.3), corresponding to a weakly convergent sequence of controls.
Lemma 2.1.
Proof.
Since the sequence is weakly convergent, it is bounded in . This fact and Assumption 1 imply that there exists a compact set such that for every and for every . Moreover, being bounded, we deduce that
for some constant . Then it follows that the sequence is bounded in , and, as a matter of fact, that it is pre-compact with respect to the weak topology of . Therefore, there exists a subsequence and such that as . Using the compact inclusion , we deduce that . In particular, we obtain that . On the other hand, we have that , since . Recalling that and that implies , we obtain that
i.e., . Since the reasoning above is valid for every possible -weak limiting point of , we deduce that the whole sequence is -weakly convergent to . Using again the compact inclusion , we deduce the thesis. ∎
We now prove an estimate of the Lipschitz constant of the flow for .
Lemma 2.2.
For every admissible control , let be corresponding flow defined in (2.4). Then is Lipschitz-continuous with constant
(2.5) |
where depends only on the controlled fields and on the dimension of the ambient space.
Proof.
For every we have that , where solves the linearized system
where is the controlled trajectory that solves (2.3), and for every the expression denotes the Jacobian of the field . Therefore, using (2.2), we deduce that
Recalling that and that
we deduce that
where is a constant that depends only on and on the dimension of the ambient space. Finally this shows that
proving the thesis. ∎
The next result regards the convergence of the flows corresponding to a weakly convergent sequence .
Proposition 2.3.
Given a sequence such that as with respect to the weak topology of , then the flows converge to uniformly on compact sets.
Proof.
Let be a compact set. For every let us define , and . Being weakly convergent, the sequence is bounded in . This fact and Assumption 1 imply that there exists a compact set such that any solution of (2.3) with initial datum in is entirely contained in . In particular, we deduce that for every , i.e., the sequence is equi-bounded. Moreover, using the boundedness of and Lemma 2.2, we obtain that is an equi-continuous family of diffeomorphisms. By Ascoli-Arzelà Theorem, it follows that is pre-compact with respect to the topology. On the other hand, Lemma 2.1 implies that for every as . This guarantees that the set of limiting points with respect to the topology of the sequence is reduced to the single element . ∎
3. Ensemble Controllability
In this section we recall the most important results regarding the issue of ensemble controllability. For the proofs and the statements in full generality we refer the reader to [4] and [5]. We begin with the definition of ensemble in . In this section we will further assume that , which is the most interesting case.
Definition 1.
Given a compact set , an ensemble of points in is an injective and continuous map . We denote by the space of ensembles.
Remark 3.1.
If , then an ensemble can be simply thought as an injective map from to , or, equivalently, as an element of , where
We define . Given a vector field , we define its -fold vector field as , for every . Finally, we introduce the notation to denote the space of ensembles of with elements.
We give the notion of reachable ensemble.
Definition 2.
The ensemble is reachable from the ensemble if there exists an admissible control such that its corresponding flow defined in (2.4) satisfies:
We can equivalently say that can be steered to .
Definition 3.
Control system (2.1) is exactly controllable from if every is reachable from . The system is globally exactly controllable if it is exactly controllable from every .
We recall the definition of Lie algebra generated by a system of vector fields. Given the vector fields , the linear space is defined as
(3.1) |
where denotes the Lie bracket between , smooth vector fields of . In the case of finite ensembles, i.e., when , we can provide sufficient condition for controllability. The proof reduces to the classical Chow-Rashevsky theorem (see ,e.g, the textbook [1]).
Theorem 3.2.
Let be a system of vector fields on . Given , if for every the system of M-fold vector fields is bracket generating at , i.e.,
(3.2) |
then the control system (2.1) is globally exactly controllable on .
For a finite ensemble, the global exact controllability holds for a generic system of vector fields.
Proposition 3.3.
For every , and sufficiently large, then the -tuples of vector fields such that system (2.1) is globally exactly controllable on is residual with respect to the Whitney -topology.
We recall that a set is said residual if it is the complement of a countable union of nowhere dense sets. Proposition 3.3 means that, given any -tuple of vector fields, the corresponding control system (2.1) can be made globally exactly controllable in by means of an arbitrary small perturbation of the fields in the -topology.
When dealing with infinite ensembles, the notions of “exact reachable” and “exact controllable” are too strong. However, they can be replaced by their respective -approximate versions.
Definition 4.
The ensemble is -approximately reachable from the ensemble if for every there exists an admissible control such that its corresponding flow defined in (2.4) satisfies:
(3.3) |
We can equivalently say that can be -approximately steered to .
Definition 5.
Control system (2.1) is -approximately controllable from if every is -approximately reachable from . The system is globally -approximately controllable if it is -approximately controllable from every .
Remark 3.4.
Let us further assume that the compact set has positive Lebesgue measure, and that it is equipped with a finite and positive measure , absolutely continuous w.r.t. the Lebesgue measure. Then, the distance between the target ensemble and the approximating ensemble can be quantified using the -norm:
and we can equivalently formulate the notion of -approximate controllability. In general, given a non-negative continuous function such that , we can express the approximation error as
(3.4) |
In Section 5 we will consider an integral penalization term of this form. It is important to observe that, if is -approximately reachable from , then (3.4) can be made arbitrarily small.
Before stating the next result we introduce some notations. Given a vector field and a compact set , we define
Then we set
We now formulate the assumption required for the approximability result.
Assumption 2.
The system of vector fields satisfies the Lie algebra strong approximating property, i.e., there exists such that, for every -regular vector field and for each compact set there exists such that
(3.5) |
The next result establishes a Universal Approximating Property for flows.
Theorem 3.5.
We recall that is diffeotopic to the identity if and only if there exists a family of diffeomorphisms smoothly depending on such that and . In this case, can be seen as the flow generated by the non-autonomous vector field , where
From Theorem 3.5 we can deduce a -approximate reachability result for infinite ensembles.
Corollary 3.6.
Let be diffeotopic, i.e., there exists a diffeomorphism diffeotopic to the identity such that . Then is -approximate reachable from .
Remark 3.7.
If a system of vector fields satisfies Assumption 2, then, for every and for every Lie bracket generating condition (3.2) is automatically satisfied (see [5, Theorem 4.3]). This means that Assumption 2 guarantees global exact controllability in , and -approximate reachability for infinite diffeotopic ensembles.
We conclude this section with the exhibition of a system of vector fields in meeting Assumptions 1 and 2.
Theorem 3.8.
Remark 3.9.
If we consider the vector fields defined in (3.7), then Theorem 3.5 and Theorem 3.8 guarantee that the flows generated by the corresponding linear-control system can approximate on compact sets diffeorphisms diffeotopic to the identity. From a theoretical viewpoint, this approximation result cannot be strengthened by enlarging the family of controlled vector fields, since the flows produced by any controlled dynamical system are themselves diffeotopic to the identity. On the other hand, in view of the discretization of the dynamics and the consequent construction of the ResNet, it could be useful to enrich the system of the controlled fields. As suggested by Assumption 2, the expressivity of the linear-control system is more directly related to the space , rather than to the family itself. However, as we are going to see, “reproducing” the flow of a field that belongs to can be expensive. Let us consider an evolution step-size and let us choose two of the controlled vector fields, say , and let us assume that . Let us denote by the flows obtained by evolving for an amount of time equal to . Then, using for instance the computations in [1, Subsection 2.4.7], for every compact we obtain that
as . The previous computation shows that, in order to approximate the effect of evolving the vector field for an amount of time equal to , we need tho evolve the fields for a total amount . If represents the discretization step-size used to derive the ResNet (1.4) from the linear-control system (2.1) on the interval , then we have that , where is the number of layers of the ResNet. The argument above suggests that we need to use layers of the network to “replicate” the effect of evolving for the amount of time (note that when ). This observation provides an insight for the practical choice of the system of controlled fields. In first place, the system should meet Assumption 2. If the ResNet obtained from the discretization of the system does not seem to be expressive enough, it should be considered to enlarge the family of the controlled fields, for example by including some elements of (or, more generally, of ). We insist on the fact that this procedure increases the width of the network, since the larger is the number of fields in the control system, the larger is the number of parameters per layer in the ResNet.
4. Approximation of Diffeomorphisms: Robust Strategy
In this section we introduce the central problem of the paper, i.e., the training of control system (2.1) in order to approximate an unknown diffeomorphism diffeotopic to the identity. A typical situation that may arise in applications is that we want to approximate on a compact set , having observed the action of on a finite number of training points . Our aim is to formulate a strategy that is robust with respect to the size of the training dataset, and for which we can give upper bounds for the generalization error. In order to obtain higher and higher degree of approximation, we may thing to triangulate the compact set with an increasing number of nodes where we can evaluate the unknown map . Using the language introduced in Section 3, we have that, for every , we may understand a triangulation of with nodes as an ensemble . After evaluating in the nodes, we obtain the target ensemble as .
If the vector fields that define control system (2.1) meet Assumptions 1 and 2, then Theorem 3.2 and Remark 3.7 may suggest a first natural attempt to design an approximation strategy. Indeed, for every , we can exactly steer the ensemble to the ensemble with an admissible control . Hence, we can choose the flow defined in (2.4) as an approximation of on , achieving a null training error. Assume that, for every , the corresponding triangulation is a -approximation of the set , i.e., for every there exists such that , where . Then, for every , we can give the following estimate for the generalization error:
where are respectively the Lipschitz constants of and . Assuming (as it is natural to do) that as , the strategy of achieving zero training error works if, for example, the Lipschitz constants of the approximating flows keep bounded. This in turn would follow if the sequence of controls were bounded in -norm. However, as we are going to see in the following proposition, in general this is not the case. Let us define
the space of flows restricted to obtained via (2.4) with admissible controls, and let be the space of diffeomorphisms diffeotopic to the identity restricted to . Theorem 3.5 guarantees that, for every ,
Proposition 4.1.
Given a diffeomorphism and an approximating sequence such that on , then the sequence of controls is unbounded in the -norm.
Proof.
By contradiction, let be a bounded sequence in . Then, we can extract a subsequence weakly convergent to . In virtue of Proposition 2.3, we have that on . However, since on , we deduce that on , but this contradicts the hypothesis . ∎
The previous result sheds light on a weakness of the approximation strategy described above. Indeed, the main drawback is that we have no bounds on the norm of the controls , and therefore, even though the triangulation of is fine, we cannot give an a priori estimate of the testing error. We point out that, in the different framework of simultaneous control of several systems, a similar situation was described in [3].
4.1. Approximation via Optimal Control
In order to avoid the issues described above, we propose a training strategy based on the solution of an Optimal Control problem with a regularization term penalizing the -norm of the control. Namely, given a set of training points , we consider the nonlinear functional defined as follows:
(4.1) |
where is a smooth loss function such that and , and is a fixed parameter. The functional is composed by two competing terms: the first aims at achieving a low mean training error, while the second aims at keeping bounded the -norm of the control. In this framework, it is worth assuming that the compact set is equipped with a Borel probability measure . In this way, we can give higher weight to the regions in where we need more accuracy in the approximation. As done before, for every we understand the training dataset as an ensemble . Moreover, we associate to it the discrete probability measure defined as
(4.2) |
and we can equivalently express the mean training error as
From now on, when considering datasets growing in size, we make the following assumption on the sequence of probability measures .
Assumption 3.
There exists a Borel probability measure supported in the compact set such that the sequence is weakly convergent to , i.e.,
(4.3) |
for every bounded continuous function . Moreover, we ask that is supported in for every .
Remark 4.2.
The request of Assumption 3 is rather natural. Indeed, if the elements of the ensembles are sampled using the probability distribution associated to the compact set , it follows from the law of large numbers that (4.3) holds. On the other hand, since we ask that all the ensembles are contained in the compact set , we have that the sequence of probability measures is tight. Therefore, in virtue of Prokhorov Theorem, is sequentially weakly pre-compact (for details see, e.g., [12]).
Remark 4.3.
When , if for every is a -approximation of such that as , then the corresponding sequence of probability measures is weakly convergent to , where denotes the Lebesgue measure in .
In the next result we prove that the functional attains minimum for every . Moreover, we can give an upper bound for the -norm of the minimizers of that does not depend on .
Proposition 4.4.
Proof.
The existence of minimizers descends from the direct method of Calculus of Variations, since for every the functional is coercive and lower semi-continuous with respect to the weak topology of . Indeed, recalling that , we have that
and this shows that is coercive, for every . As regards the lower semi-continuity, let us consider a sequence such that as . Using Proposition 2.3 and the Dominated Convergence Theorem, for every we have that
Finally, recalling that
we deduce that for every the functional is lower semi-continuous.
We now prove the uniform estimate on the norm of minimizers. For every , let be a minimizer of . If we consider , recalling that , we have that
Using the fact that are uniformly compactly supported, we deduce that there exists such that
Therefore, recalling that , we have that
for every . This concludes the proof. ∎
The previous result suggests as a training strategy to look for a minimizer of the functional . In the next section we investigate the properties of the functionals using the the tools of -convergence.
5. Ensembles growing in size and -convergence
In this section we study the limiting problem when the size of the training dataset tends to infinity. The main fact is that a -convergence result holds. Roughly speaking, this means that increasing the size of the training dataset does not make the problem harder, at least from a theoretical viewpoint. For a thorough introduction to the subject, the reader is referred to [13].
For every , let be an ensemble of points in the compact set , and let us consider the discrete probability measure defined in (4.2). For every we consider the functional defined as follows:
(5.1) |
The tools of -convergence requires the domain where the functionals are defined to be equipped with a metrizable topology. Recalling that the weak topology of is metrizable only on bounded sets, we need to properly restrict the functionals. For every , we set
Using Proposition 4.4 we can choose , so that
for every . With this choice we restrict the minimization problem to a bounded subset of , without losing any minimizer. We define . We recall the definition of -convergence.
Definition 1.
Let us define the functional as follows:
(5.4) |
where the probability measure is the weak limit of the sequence . Using the same argument as in the proof of Proposition 4.4, we can prove that attains minimum and that
with . As before, we define .
Theorem 5.1.
Proof.
Let us prove the -condition (5.3). Let us fix and, for every , let us define . Then, recalling that the measures are weakly convergent to , we have that
This proves (5.3).
We now prove the -condition. Let be weakly convergent to . Using Proposition 2.3 we have that with respect to the topology. Moreover, there exists a compact set such that
Using the fact that is uniformly continuous on , we deduce that
(5.5) |
Using (5.5) and the weak convergence of to , we deduce that
Recalling that the norm is lower semi-continuous with respect to the weak convergence, we have that condition (5.2) is satisfied. ∎
Remark 5.2.
Using the equi-coercivity of the functionals and [13, Corollary 7.20], we deduce that
(5.6) |
and that any cluster point of a sequence of minimizers is a minimizer of . Let us assume that a sub-sequence as . Using Proposition 2.3 and the Dominated Convergence Theorem we deduce that
(5.7) |
where we stress that is a minimizer of . Combining (5.6) and (5.7) we obtain that
Since as , the previous equation implies that the subsequence converges to also with respect to the strong topology of . This argument shows that any sequence of minimizers is pre-compact with respect to the strong topology of .
We can establish an asymptotic upper bound for the mean training error. Let us define
(5.8) |
As suggested by the notation, the value of highly depends on the positive parameter that tunes the -regularization. Given a sequence of minimizers of the functionals , from (5.7) we deduce that
(5.9) |
In the next result we show that under the hypotheses of Theorem 3.5 can be made arbitrarily small with a proper choice of .
Proposition 5.3.
Proof.
5.1. An estimate of the generalization error
We now discuss an estimate of the expected generalization error based on the observation of the mean training error, similar to the one established in [20] for the control system (1.2). A similar estimate was obtained also in [10]. Assumption 3 implies that the Wasserstein distance as (for details, see [6, Proposition 7.1.5]). We recall that, if are Borel probability measures on , then
For every let us introduce such that and , and
Let us consider an admissible control , and let be the corresponding flow. If the testing samples are generated using the probability distribution , then the expected generalization error that we commit by approximating with is
On the other hand, we recall that the corresponding training error is expressed by
Hence we can compute
Then for every we have
where , and are the Lipschitz constant, respectively, of , and . The last inequality finally yields
(5.12) |
for every .
Remark 5.4.
We observe that the estimate (5.12) does not involve any testing dataset. In other words, in principle we can use (5.12) to provide an upper bound to the expected generalization error, without the need of computing the mismatch between and on a testing dataset. In practice, while we can directly measure the first quantity at the right-hand side of (5.12), the second term could be challenging to estimate. Indeed, if on one hand we can easily approximate the quantity (for instance by means of (2.5)), on the other hand we may have no access to the distance . This is actually the case when the measure used to sample the training dataset is unknown.
In the case we consider the flow corresponding to a minimizer of the functional we can further specify (5.12). Indeed, combining Proposition 4.4 and Lemma 2.2, we deduce that is uniformly bounded with respect to by a constant . Provided that is large enough, from (5.12) and (5.9) we obtain that
(5.13) |
The previous inequality shows how we can achieve an arbitrarily small expected generalization error, provided that the vector fields of control system (2.1) satisfy Assumption 2, and provided that the size of the training dataset could be chosen arbitrarily large. First, using Proposition 5.3 we set the tuning parameter such that the quantity is as small as desired. Then, we consider a training dataset large enough to guarantee that the second term at the right-hand side of (5.13) is of the same order of .
Remark 5.5.
Given , Proposition 5.3 guarantees the existence of such that if . The expression of obtained in the proof of Proposition 5.3 is given in terms of the norm of a control such that (5.11) holds. In [4], where Theorem 3.5 is proved, it is explained the construction of an admissible control whose flow approximate the target diffeomorphism with assigned precision. However the control produced with this procedure is, in general, far from having minimal -norm, and as a matter of fact the corresponding might be smaller than necessary. Unfortunately, at the moment, we can not provide a more practical rule for the computation of .
Remark 5.6.
As observed in Remark 5.4 for (5.12), the estimate (5.13) of the expected generalization error holds as well a priori with respect to the choice of a testing dataset. Moreover, if the size of the training dataset is assigned and it cannot be enlarged, in principle one could choose the regularization parameter by minimizing the right-hand side of (5.13). However, in practice this may be very complicated, since we have no direct access to the function .
6. Training the Network: Projected Gradient Flow
In this section we propose a training algorithm based on a gradient flow formulation. We recall that we want to express the approximation of a diffeomorphism diffeotopic to the identity through a proper composition , where, for every the functions are of the form
(6.1) |
and play the role of inner layers of the ResNet. We recall that the building blocks (6.1) are obtained by discretizing the linear-control system (2.1) with the explicit Euler scheme on the evolution interval , using a constant time-step . In Section 4 and Section 5 we observed that the minimization of the non-linear functional defined in (4.1) provides a possible strategy for approximating using a flow generated by control system (2.1). The idea that underlies this section consists in projecting the gradient flow induced in by onto a finite dimensional subspace . Using this gradient flow, we obtain a procedure to learn the parameters that define the inner layers.
In [23] it has been derived a gradient flow equation for optimal control problems with end-point cost, and a convergence result was established. In the particular case of the functional defined in (4.1) and associated to control system (2.1), from [23, Remark 3.5] it follows that the differential can be represented as an element of as follows:
(6.2) |
For every and a.e. , the functions and satisfy
(6.3) |
and
(6.4) |
A remarkable aspect that can be used in the numerical implementation is that the solutions of (6.3) and (6.4) can be computed in parallel with respect to the elements of the training ensemble, i.e., in other words, with respect to the index . In this section we propose a Finite-Elements approach to derive a method for training the ResNet (6.1). The idea is to replace the infinite-dimensional space with a proper finite-dimensional subspace . A natural choice could be to consider piecewise constant functions, i.e.,
(6.5) |
The orthogonal projection onto , is given by
(6.6) |
Hence, the gradient flow of the functional
can be projected onto the finite-dimensional space :
(6.7) |
With a discretization-in-time of (6.7), we obtain the following iterative implementable method for the approximate minimization of :
-
(i)
Consider , and, for every , compute an approximation of at the nodes using a proper numerical scheme for ODE;
-
(ii)
Using the approximations of obtained at step (i), compute the approximation of at the nodes using (6.2);
-
(iii)
Compute , i.e., the approximation of using (6.6), the approximation of obtained at step (ii) and a proper numerical integration scheme;
-
(iv)
Update the control using an explicit Euler discretization of (6.7) with a proper step-size :
Finally go to step (i).
The step-size can be chosen using backtracking line search. In Algorithm 1 we present the implementation that we used for the numerical simulations. We use the following notations:
-
•
denotes an element of . Namely, represents the value of the control corresponding to the vector field in the time interval . The same notations are used for , which represents the approximation of .
-
•
For every , denotes the approximated solution of (6.3) with Cauchy datum . Namely, represents the approximation at the node . We use to denote the array containing the approximations of all trajectories at all nodes. Moreover, we observe that we have for every and for every , where are the maps introduced in (6.1).
-
•
For every , denotes the approximated solution of (6.4) with Cauchy datum . Namely, represents the approximation at the node . We use to denote the array containing the approximations of all co-vectors at all nodes.
Remark 6.1.
We briefly comment Algorithm 1. Both (6.3) and (6.4) are discretized using the explicit Euler method. However, since (6.4) is solved backward in time, the update rule
is implicit in the variable , and it requires the solution of a linear system, namely
(6.8) |
as done in line 16. We recall that should be understood as row vectors.
In line 21 we used the trapezoidal rule to approximate
This choice seems natural since the function inside the integral is available at the nodes and , for every .
Remark 6.2.
It is interesting to compare the update of the controls prescribed by Algorithm 1 with the standard backpropagation of the gradients. Given the parameters , for every let be the -th building block obtained by using in (6.1), and let be the corresponding composition. For every , the quantity represents the the training error relative to the point in the dataset. A direct computation shows that
where we used the notation . The last identity can be rewritten as
where are recursively defined as follows
(6.9) |
Therefore, from (6.9) we deduce that the backpropagation is related to the discretization of (6.4) via the implicit Euler scheme:
However, since the Cauchy problem (6.4) is solved backward in time, the previous expression is explicit in .
Remark 6.3.
In Algorithm 1 all the for loops that involve the index variable of the elements of the ensemble (i.e., the counter ) can be carried out in parallel. This is an important fact when dealing with large dataset, since these loops are involved in the time-consuming procedures that solve (6.3) and (6.4). However, in order to update the controls (lines 20-23), we need to wait for the conclusion of these computations and to recollect the results.
Remark 6.4.
In Algorithm 1 at each iteration we use the entire dataset to update the controls. However, it is possible to consider mini-batches, i.e., to sample at the beginning of each iteration a subset of the training points of size . This practice is very common in Deep Learning, where it is known as stochastic gradient descent (see, e.g., [15, Section 5.9]). We stress that, in case of stochastic mini-batches, the cost used to accept/reject the update should be computed with respect to the reduced dataset, and not to the full one. More precisely, we should compare the quantity
before and after the update of the controls.
7. Training the Network: Maximum Principle
In this section we study the same problem as in Section 6, i.e., we want to define an implementable algorithm to learn the parameters that define the functions (6.1), which play the role of inner layers in our ResNet.
As in the previous section, we approach the training problem by minimizing the functional . However, in this case, the minimization procedure is based on the Pontryagin Maximum Principle. We state below the Maximum Principle for our particular Optimal Control problem. For a detailed and general presentation of the topic the reader is referred to the textbook [1].
Theorem 7.1.
Let be an admissible control that minimizes the functional defined in (5.1). Let be the hamiltonian function defined as follows:
(7.1) |
where we set and . Then there exists an absolutely continuous function such that the following conditions hold:
-
•
For every the curve satisfies
(7.2) -
•
For every the curve satisfies
(7.3) -
•
For every , the following condition is satisfied:
(7.4)
Remark 7.2.
In Theorem 7.1 we stated the Pontryagin Maximum Principle for normal extremals only. This is due to the fact that the optimal control problem concerning the minimization of does not admit abnormal extremals.
An iterative method based on the Maximum Principle for the numerical resolution of Optimal Control problems was proposed in [11]. The idea of training a deep neural network with a numerical method designed for optimal control problems is already present in [19] and in [8], where the control system is of the form (1.1). In [19] the authors introduced a stabilization of the iterative method described in [11], while in [8] the authors considered symplectic discretizations of (7.2)-(7.3). Finally, in [10] the authors used the particle method to solve numerically the Mean Field Pontryagin equations, and employed a root-finding procedure to maximize the Hamiltonian. The iterative method that we use in this section was proposed in [22], and once again it is a stabilization of a method described in [11]. The procedure consists in the following steps:
-
(a)
Consider a control and compute using (7.2);
-
(b)
Assign and compute solving backward (7.3);
- (c)
-
(d)
If , then update and go to Step (b). If , then decrease the parameter , and go to Step (c).
The algorithm as described above is not implementable, since the maximization in Step (c) requires the availability of . However, in practice, the maximization of the extended Hamiltonian in (7.5) and the computation of the updated trajectory take place in subsequent steps. Namely, after introducing the finite dimensional subspace as in Section 6, for every we plug into (7.5) the approximations of and at the node . The value of the control in the interval is set as the value where the maximum of (7.5) is attained, and we use it to compute in the node .
Remark 7.3.
In [22] the authors proved that the cost of the optimal control problem is decreasing along the sequence of controls produced by the method, provided that is small enough. Actually, this monotonicity result is valid only for the “ideal” continuous-time procedure, and not for the discrete-time implementable algorithm. A similar issue is observed for the iterative algorithm proposed in [19]. Anyway, the condition in Step (d) is useful in the discrete-time case to adaptively adjust .
We report in Algorithm 2 the implementation of the procedure described above. We use the same notations as in Section 6.
Remark 7.4.
In lines 22–24 of Algorithm 2
we introduced a corrective term for the covectors.
This is due to the fact that
in [22] where considered only problems without
end-point cost.
The maximization problem at line 25 can be solved
with an explicit formula. Indeed, we have that
We stress that the existence of such a simple expression is due to the fact that control system (2.1) is linear in the control variables.
Remark 7.5.
We observe that the computation of the approximate solutions of (7.3) can be carried out in parallel with respect to the index variable of the elements of the ensembles (see lines 13–18 of Algorithm 2). However, when solving (7.2) in order to update the trajectories, the for loop on the elements of the ensemble is nested into the for loop on the discretization nodes (see lines 21–29 of Algorithm 2).
8. Numerical Experiments
In this section we describe the numerical experiments involving the approximation of an unknown flow by means of Algorithm 1 and Algorithm 2. We implemented the codes in Matlab and we ran them on a laptop with 16 GB of RAM and a 6-core 2.20 GHz processor. Since we consider as ambient space, Theorem 3.5 and Theorem 3.8 guarantee that the linear-control system associated to the fields
is capable of approximating on compact sets diffeomorphisms that are diffeotopic to the identity. However, it looks natural to include in the set of the controlled vector fields also the following ones:
Indeed, with this choice, we observe that the corresponding control system
(8.1) |
is capable of reproducing exactly non-autonomous vector fields that are linear in the state variables . Moreover, the discretization of the control system (8.1) on the evolution interval with step-size gives rise to a ResNet with layers, whose building blocks have the form:
(8.2) |
and each of them has parameters.
8.1. Approximation of a diffeomorphism
We consider the following diffeomorphism :
the rotation centered at the origin and with angle , and the translation prescribed by the vector . Finally, we set
(8.3) |
We generate the training dataset with points by constructing a uniform grid in the square centered at the origin and with side of length . In Figure 1 we report the training dataset and its image trough .


If we denote by the probability measure that charges uniformly the square and if we set , we obtain the following estimate
that can be used to compute the a priori estimate of the generalization error provided by (5.12). We use the Lipschitz loss function
and we look for a minimizer of
(8.4) |
where is the regularization hyper-parameter. In the training phase we use the same dataset for Algorithm 1 and Algorithm 2, and in both cases the initial guess of the control is . Finally, the testing dataset has been generated by randomly sampling points using , the uniform probability measure on the square. The value of the hyper-parameter is set equal to . We first tried to approximate the diffeomorphism using , resulting in inner layers. Hence, recalling that each building-block (8.2) has parameters, the corresponding ResNet has in total parameters. We tested different values of , and we set . The results obtained by Algorithm 1 and Algorithm 2 are reported in Table 1 and Table 2, respectively. We observe that in both algorithms the Lipschitz constant of the produced diffeomorpism grows as the hyper-parameter gets smaller, consistently with the theoretical intuition. As regards the testing error, we observe that it always remains reasonably close to the corresponding training error.
We report in Figure 2 the image of the approximation that achieves the best training and testing errors, namely Algorithm 1 with . As we may observe, the prediction is quite unsatisfactory, both on the training and on the testing data-sets. Finally, we report that the formula (5.12) correctly provides an upper bound to the testing error, even though it is too pessimistic to be of practical use.



In order to improve the quality of the approximation, a natural attempt consists in trying to increase the depth of the ResNet. Therefore, we repeated the experiments setting , that corresponds to layers. Recalling that the ResNet in exam has parameters per layer, the architecture has globally weights. The results are reported in Table 3 and Table 4. Unfortunately, despite doubling the depth of the ResNet, we do not observe any relevant improvement in the training nor in the testing error. Using the idea explained in Remark 3.9, instead of further increase the number of the layers, we try to enlarge the family of the controlled vector fields in the control system associated to the ResNet.
8.2. Enlarged family of controlled fields
Using the ideas expressed in Remark 3.9, we enrich the family of the controlled fields. In particular, in addition to the fields considered above, we include the following ones:
Therefore, the resulting linear-control system on the time interval has the form
while the building blocks of the corresponding ResNet have the following expression:
(8.5) | ||||
(8.6) |
for , where is the discretization step-size and is the number of layers of the ResNet. We observe that each building block has parameters.
As before, we set , and we considered , resulting in a ResNet with layers and with total number of weights equal to . We used the same training data-set as above, namely the grid of points and the corresponding image trough depicted in Figure 1. We trained the network using both Algorithm 1 and Algorithm 2. The results are collected in Table 5 and Table 6, respectively. Once again, we observe that the Lipschitz constant of the approximating diffeomorphisms grows as is reduced. In this case, with both algorithms, the training and testing errors are much lower if compared with the best case of the ResNet 8.2. We insist on the fact that in the present case the ResNet 8.5-8.6 has in total parameters divided into layers, and it overperforms the ResNet 8.2 with parameters divided into layers. We report in Figure 3 the approximation produced by Algorithm 1 with . In this case the approximation provided is very satisfactory, and we observe that it is better in the area where more observations are available. Finally, also in this case the estimate on the expected generalization error (5.12) provides an upper bound for the testing error, but at the current state it is too coarse to be of practical use.
Training error | Testing error | ||
---|---|---|---|
Training error | Testing error | ||
---|---|---|---|



Conclusions
In this paper we have derived a Deep Learning architecture for a Residual Neural Network starting from a control system. Even though this approach has already been undertaken in some recent works (see [14], [16], [19], [8], [10]), the original aspect of our contribution lies in the fact that the control system considered here is linear in the control variables. In spite of this apparent simplicity, the flows generated by these control systems (under a suitable assumption on the controlled vector fields) can approximate with arbitrary precision on compact sets any diffeomorphism diffeotopic to the identity (see [4], [5]). Moreover, the linearity in the control variables simplifies the implementation and the computational effort required in the training procedure. Indeed, when the parameters are learned via a Maximum-Principle approach, the maximization of the Hamiltonian function is typically highly time-consuming if the control system is non-linear in the control variables. In the paper we propose two training procedure. The first one consists in projecting onto a finite-dimensional space the gradient flow associated to a proper Optimal Control problem. The second procedure relies on an iterative method for the solution of Optimal Control problems based on Pontryagin Maximum Principle. We have tested the learning algorithms on a diffeomorphism approximation problem in , obtaining interesting results. We plan to investigate in future work the performances of this kind of Neural Networks in other learning tasks, such as classification.
Acknowledgments
The Author acknowledges partial support from INDAM-GNAMPA. The Author wants to thank Prof. A. Agrachev and Prof. A. Sarychev for encouraging and for the helpful discussions. The Author is grateful to an anonymous referee for the invaluable suggestions that contributed to improve the quality of the paper.
References
- [1] A.A. Agrachev, Y.L. Sachkov. Control Theory from the Geometric Viewpoint. Springer-Verlag Berlin Heidelberg (2004).
- [2] A.A. Agrachev, D. Barilari, U. Boscain. A Comprehensive Introduction to Sub-Riemannian Geometry. Cambridge University Press (2019).
- [3] A.A. Agrachev, Y.M. Baryshnikov, A.V. Sarychev. Ensemble controllability by Lie algebraic methods. ESAIM: Cont., Opt. and Calc. Var., 22:921–938 (2016).
- [4] A.A. Agrachev, A.V. Sarychev. Control in the Spaces of Ensembles of Points. SIAM J. Control Optim., 58(3):1579–1596 (2020).
- [5] A.A. Agrachev, A.V. Sarychev. Control on the Manifolds of Mappings with a View to the Deep Learning. J. Dyn. Control Syst., (2021).
- [6] L. Ambrosio, N. Gigli, G. Savaré. Gradient flows in metric spaces and in the space of probability measures. Springer Science & Business Media (2008).
- [7] Y. Bengio, P. Simard, P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw., 5(2):157–166 (1994).
- [8] M. Benning, E. Celledoni, M.J. Erhardt, B. Owren, C.B. Schönlieb. Deep learning as optimal control problems: Models and numerical methods. J. Comput. Dyn., 6(2):171–198 (2019).
- [9] M. Bongini, M. Fornasier, F. Rossi, F. Solombrino. Mean-field Pontryagin Maximum Principle, J. Optim. Theory Appl., 175(1):1–38 (2017).
- [10] B. Bonnet, C. Cipriani, M. Fornasier, H. Huang. A measure theoretical approach to the Mean-field Maximum Principle for training NeurODEs. Preprint arXiv:2107.08707 (2021).
- [11] F.L. Chernousko, A.A. Lyubushin. Method of successive approximations for solution of optimal control problems. Optim. Control Appl. Methods, 3(2):101-114 (1982).
- [12] E. Çinlar. Probability and Stochastics. Springer-Verlag, New York (2010).
- [13] G. Dal Maso. An Introduction to -convergence. Birkhäuser, Boston, MA (1993).
- [14] W. E. A Proposal on Machine Learning via Dynamical Systems. Commun. Math. Stat., 5(1):1–11 (2017).
- [15] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. MIT Press (2016).
- [16] E. Haber, L. Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1) (2017).
- [17] K. He, J. Sun. Convolutional neural networks at constrained time cost. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5353–5360 (2015).
- [18] K. He, X. Zhang, S. Ren, J. Sun. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016).
- [19] Q. Li, L. Chen, C. Tai, W. E. Maximum Principle Based Algorithms for Deep Learning. J. Mach. Learn. Res., 18(1):5998–6026 (2017).
- [20] M. Marchi, B. Gharesifard, P. Tabuada. Training deep residual networks for uniform approximation guarantees. PMLR, 144:677–688 (2021).
- [21] A.V. Pukhlikov. Optimal Control of Distributions. Comput. Math. Model., 15:223–256 (2004).
- [22] Y. Sakawa, Y. Shindo. On global convergence of an algorithm for optimal control. IEEE Trans. Automat. Contr., 25(6):1149–1153 (1980).
- [23] A. Scagliotti. A gradient flow equation for optimal control problems with end-point cost. J. Dyn. Control Syst., to appear (2022).
- [24] P. Tabuada, B. Gharesifard. Universal Approximation Power of Deep Neural Networks via Nonlinear Control Theory. Preprint arXiv:007.06007 (2020).
- [25] M. Thorpe, Y. van Gennip. Deep Limits of Residual Neural Networks. Preprint arXiv:1810.11741 (2018).
(A. Scagliotti) Scuola Internazionale Superiore di
Studi Avanzati, Trieste, Italy.
Email address: [email protected]