Emulating Quantum Dynamics with Neural Networks via Knowledge Distillation

Yu Yao Department of Physics and Astronomy, University of Southern California, 920 Bloom Walk, Los Angeles, CA 90089, USA Chao Cao Department of Physics and Astronomy, University of Southern California, 920 Bloom Walk, Los Angeles, CA 90089, USA Stephan Haas Department of Physics and Astronomy, University of Southern California, 920 Bloom Walk, Los Angeles, CA 90089, USA Mahak Agarwal Department of Computer Science, University of Southern California, 941 Bloom Walk, Los Angeles, California 90089, USA Divyam Khanna Department of Computer Science, University of Southern California, 941 Bloom Walk, Los Angeles, California 90089, USA Marcin Abram mjarbam@usc.edu Department of Physics and Astronomy, University of Southern California, 920 Bloom Walk, Los Angeles, CA 90089, USA Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA 90292, USA

Abstract

High-fidelity quantum dynamics emulators can be used to predict the time evolution of complex physical systems. Here, we introduce an efficient training framework for constructing machine learning-based emulators. Our approach is based on the idea of knowledge distillation and uses elements of curriculum learning. It works by constructing a set of simple, but rich-in-physics training examples (a curriculum). These examples are used by the emulator to learn the general rules describing the time evolution of a quantum system (knowledge distillation). The goal is not only to obtain high-quality predictions, but also to examine the process of how the emulator learns the physics of the underlying problem. This allows us to discover new facts about the physical system, detect symmetries, and measure relative importance of the contributing physical processes. We illustrate this approach by training an artificial neural network to predict the time evolution of quantum wave packages propagating through a potential landscape. We focus on the question of how the emulator learns the rules of quantum dynamics from the curriculum of simple training examples and to which extent it can generalize the acquired knowledge to solve more challenging cases.

Neural Networks can be viewed as sophisticated pattern recognition methods, capable of constructing non-linear mappings between a specified set of input data and a set of target outputs ¹. Their strength comes from the fact, that even with just one hidden layer they can be trained to approximate any finite, Borel-measurable function ². However, with increasing complexity of the target function, the number of required neurons in the hidden layer becomes prohibitively large ³. Therefore, it is practical to train multi-layer networks that are much more efficient in that regard ⁴. Lower layers of such deep architectures can learn efficient representation of the input data ⁵, whereas the upper layers model higher-level concepts and solve the final classification or regression task ⁶.

Neural networks, with their variety of architectures ⁷, have already been shown to be effective tools in medical ⁸, business ⁹ and scientific applications, including physics in general ¹⁰, as well as material ¹¹ and quantum science ¹² in particular. Many standard applications of machine learning methods can be reduced to either classification or regression tasks ¹³. In the latter case, they can serve as powerful interpolation tools ^{14, 15}. However, out-of-domain predictions, are typically challenging ^{16, 17}. While in many applications machine learning models come as theory-agnostic tools ^{10, 18}, there also exist families of physics-informed models ¹⁹ that implicitly incorporate domain knowledge about the studied system, e.g., by imposing a set of constraints that preserve conservation laws or symmetries. In the scientific context, machine learning can be also used to enable concept discoveries. This can be accomplished by either directly constructing a machine learning system capable of answering scientific questions ²⁰ or indirectly, e.g., by interpreting already trained models ^{21, 22}.

In this work, we focus on the application of machine learning to the emulation of quantum dynamics. The goal is to examine the following idea: Can we train a neural network using some easily generated – but rich in physics – examples, and than apply the extracted knowledge to solve some more complex cases, not represented explicitly during the training?

Through the paper, our central focus will be on the following three aspects: knowledge extraction, generalization capability, and model interpretability. To extract knowledge, we use the concept of curriculum learning ²³. Namely, we construct a training set that allows a neural network to effectively learn the basic rules governing the physics of the quantum system. This procedure can also be viewed as knowledge distillation ²⁴ from a physically-informed simulator, responsible for constructing the training examples, to an auxiliary network that learns from the prepared curriculum. This idea is rooted in the concept of teacher-network frameworks ²⁵, whereby a smaller (and simpler) machine learning model is trained to approximate a larger, more complex system. However, whereas in the original formulation, this technique was primarily used to reduce the final model complexity ²⁶, i.e., in order to decrease the inference time and to reduce the overall computational and storage requirements, here we have another motivation. We want to promote the ability of the machine learning model to generalize (to make out-of-domain predictions). In other words, the goal here is to train on examples that are easy to construct, and then make predictions for cases that could potentially be difficult to simulate in a direct way. Finally, we want to observe how our machine learning-based emulator learns the essential physics from the physically-informed simulator. As we will show, by doing this, we can get additional insights about the nature of the underlying problem, discover symmetries, and measure the relative importance of the contributing features.

In this work, as a prove of concept, we focus on the quantum dynamics of one-dimensional systems. While the problem is fairly easy to simulate in a traditional way ²⁷, it also exhibits several non-trivial properties, such as wave function inference, scattering, and tunnelling. Additionally, the emulator must learn to preserve the wave function normalization and must correctly interpret the real and imaginary part of the input. Another practical advantage of this problem formulation is that we can easily scale the difficulty of the task by analyzing potential landscapes of various complexity.

The underlying motivation for focusing on quantum dynamics emulation tools is their use in simulating quantum systems and role in the design process of quantum devices, such as qubits and sensors ²⁸. Specifically, modeling devices that are embedded in an environment requires challenging predictions of open quantum system dynamics ^{29, 30}. Such simulations are inherently difficult on classical computers ^{31, 32}. The reason is that direct calculations can only be performed for fairly small systems, as the limiting factor are the exponential dimensions of their Hilbert spaces ³³. Consequently, new tools that offer efficient and high-fidelity approximation of quantum dynamics can help the science community to model larger and more complex systems.

In terms of related work, machine learning methods have recently been successfully used to solve many-electron Schrödinger equations ^{34, 35}. However, in contrast to our work, their focus was not on the quantum dynamics, but on finding equilibrium quantum states in electronic and molecular systems. Machine learning methods can be also used to solve partial differential equations (PDEs). In that context, some recent studies have been based on finite-dimensional approaches ^{36, 37}, neural finite element ^{38, 39, 40}, and Fourier neural operator methods ^{41, 42}. However, in most of these approaches, the trained emulators can only generalize to a specific distribution of initial conditions. Consequently, they do not generalize in the space of the PDE parameters, and therefore they need to be re-trained for each new scenario. Machine learning was also used to emulate classical fluid dynamics ⁴³. However, in those cases the focus was placed on accelerating large-scale, classical simulations. In contrast, here we focus on quantum systems. Additionally, our training framework differs from a typical supervised setting that is often primarily concerned with in-domain predictions. We aim to extract knowledge from a curriculum of simple examples, and then generalize to more complex scenarios. Therefore our focus will be on training methods that facilitate generalization to (near) out-of-domain cases.

Refer to caption — Figure 1: Framework for Training a Machine Learning-Based Quantum Dynamics Emulator. Illustration of how a machine learning-based emulator extracts knowledge from a physically informed simulator. a, First, we sample initial conditions, according to which we generate a set of physical simulations. In this case, it is a set of time steps depicting Gaussian modulated quantum wave packets propagating through a system with a single rectangular potential barrier. b, Next, by selecting small space and time slices (illustrated by the sliding red window), we construct individual training examples. c, We sample from the space of all possible training examples to create a training curriculum. The idea is to select a diverse set of examples that illustrate all significant physical processes. In our case, these are quantum wave dispersion, tunnelling, scattering, and interference. d, Finally, the machine learning-based emulator learns from the prepared curriculum of training examples. We test the emulator by selecting novel cases and by recursively forecasting the next time steps of the system evolution.

Results

The Training Framework

In Figure 1, we show our proposed training framework for preparing machine learning-based quantum dynamics emulators. The main idea is based on the concepts of knowledge distillation ²⁴ and curriculum learning ²³. However, instead of extracting information from a larger machine learning model, our target is a simple, physically informed simulator. In detail, the framework consists of the following steps. First, the simulator samples the initial conditions and generates time-trajectories describing an evolution of the physical system of interest (cf. Fig. 1a). Next, we construct training examples from these recorded simulations (cf. Fig. 1b). We select a diverse set of the training examples, making sure that all important phenomena of interest are represented (cf. Fig. 1c). This balanced curriculum of examples is consequently used as the input for the machine learning model. We train the model, and then we validate it using some novel examples. When testing the model, we include cases that were not directly represented during the training, to measure whether the model can combine and generalize the acquired knowledge (cf. Fig. 1d).

Training and Testing Procedure

We illustrate our approach by training a neural network based emulator to predict the quantum dynamics of a one-dimensional system. To demonstrate that the emulator can extract knowledge from simple examples and generalize it in a non-trivial way, we restrict the physical simulator to cases with only a single rectangular potential barrier. The emulator need to learn from these simple examples the basic properties of the wave function propagation, namely dispersion, scattering, tunneling, and quantum wave interference. Next, we test whether the emulator can predict the time-evolution of wave packages in a more general case: e.g., for packages of different shapes propagating through an arbitrarily complex potential landscape.

In detail, a single training simulation depicts a propagation of a Gaussian modulated wave packet. The initial conditions consists of: the center-of-mass position ( $X_{0}$ ), the spread ( $S_{0}$ ), and the energy ( $E_{0}$ ) of the packet. Due to the translational symmetry of the problem, we can assume that the rectangular barrier is located at the center of the system, without any loss of generality. Consequently, the environment is fully described by two numbers: the height ( $H_{b}$ ), and the width ( $W_{b}$ ) of the rectangular barrier (cf. Fig. 1a, again; see also more details in the Supplementary Information).

The Architecture of the Quantum Dynamics Emulator

The architecture of our machine learning emulator is depicted in Figure 2a. We start by stacking four time steps of the original simulation data. To make the input independent of the size of the emulated system, we use windows of fixed width (represented by red rectangles in Fig. 2b). Each windowed chunk of data contains local information about the real and the imaginary part of the wave function as well as the local information about the potential landscape. Altogether, the data can be represented in the form of three channels (similar to the RGB channels when representing visual data), as depicted in on the left side of Fig. 2a. Next, we feed the input data to our neural network. The first few layers transform each time step into a hidden vector (represented graphically in our diagram by the red array). The role of the dense layer in our architecture is to allow the network to mix the information between different spatial points (non-local operation) and between channels (both local and non-local operations). In the next step, the hidden vectors pass through a set of gated recurrent units (GRU) ⁴⁴. These units are responsible for extracting time-dependent information from the data. In the last step we reconstruct the wave function (the real and the imaginary part, represented by two channels, as depicted on the right side of Fig. 2a).

The Training Procedure

We generated a training data set with combinations of 189 different sets of initial Gaussian wave packets and 14 different rectangular potential barriers (in total, 2646 configurations). We kept the widths of the barrier fixed at $W_{b}=7\,a.u.$ (71 pixels), the size of the environment at $N_{x}=1024$ pixels (with periodic boundary conditions enabled), and the window width at $W=23$ pixels.

We trained the emulator for five epochs using the AdamW⁴⁵ optimizer with the MSE loss functions as the training objective. For more details regarding the architecture of our neural network-based emulator, the training procedure, and the composition of our training sets, see the Methods section.

Quantum Dynamics Emulation with Neural Networks

We present the results of the emulation and the comparison with the ground truth in Figs. 2d–g. To show, that the emulator can generalize to out-of-domain situations, we have chosen two challenging shapes for the potential barrier: a step pyramid and a smooth half-circle. Since the emulator was trained only on a single rectangular barrier of a fixed width (much wider than the width of the step in the pyramid), those testing cases can not be reduced to any examples that were seen during the training. Instead, to make valid predictions, the neural networks must recombine the acquired knowledge in a non-trivial way.

As we can see in Figs. 2d–g, in both cases the predictions of our emulator match well the ground truth. It is worth of noticing, that the emulator makes its predictions in a recurrent manner (by using the predictions of the previous steps as input of the next step). Therefore, it is expected that predicting the evolution over hundreds of steps will cause some error accumulation. The fact that even after $350$ steps the error is negligible, speaks about the quality of the individual (step by step) predictions. This result indicates also, that long-term predictions of the dynamics are possible in our framework.

As a conclusion, the proposed neural network emulator successfully learns the classical aspects of the wave dynamics, such as dispersion and interference. It also captures the more complex quantum phenomena, such as quantum tunneling. To further show that the emulator can generalize the acquired knowledge to make both in- and out-of-distribution predictions, we hand-designed a test data set with 12 freely dispersing cases, 11 rectangular barriers (with randomly chosen width and height), as well as 14 more challenging test cases depicting both multiple and irregularly shaped barriers (in total, 37 test instances). In all cases, the results were satisfactory, confirming the ability of the emulator to generalize to novel (and notably, more challenging) situations (see the detailed results in Figs. S3 and S4 in the Supplementary Information).

Architecture Justification

In this section, we aim to provide some justification for the specific architecture of the emulator, that we introduced in the previous sections. In order to verify, whether all the introduced complexity is necessary, we ask a question whether there exists a simpler machine learning model capable of learning quantum dynamics with a similar accuracy. Consequently, we compare our recurrent-network based approach with other popular network architectures. For consistency, in each case we use the same window-based scheme (to keep the input constant), while emulating the wave packet propagation for 400 time steps. When comparing with the ground truth, we measured the average mean absolute error (MAE) and the average normalized correlation $\mathcal{C}$ .

Table 1: Neural Network Architecture Comparison. Performance comparison for different architectures of our machine learning-based emulator. As a metric, we used the mean absolute error

\overline{|\epsilon|}

(less is better) and a normalized correlation

\mathcal{C}

(closer to

1

is better), both averaged over all spatial grid points, all time step, and all available test cases.

Model	Parameters	$\langle\overline{\|\epsilon\|}\rangle$	$\langle\mathcal{C}\rangle$
Linear	03,220	0.0366	0.1597
Dense	27,163	0.0411	0.5729
Conv	28,889	0.1467	0.3667
GRU	40,204	0.0051	0.9953

The results are presented in Table 1, where the values represent the average performance over all our 37 test cases. For the comparison, we used three other models: a linear model, a densely connected feedforward model, and a convolutional model (for a detail description of each test case and each network architecture, see the Method section). As evident in our results, the proposed architecture (utilizing the gated recurrent units, GRU) outperforms all other, simpler architectures by a large margin.

We present a details comparison for one of our test set in Figure 3. It is evident that the recurrent architecture provides the best results, whereas the simpler architectures fail to capture the complex long-time evolution of the wave packets. Notably, a failure of each simpler model can be used to justify different aspect of our final design. For example, the failure of the linear model might indicate the importance of the nonlinear activation functions included in our final model. The convolutional architecture captures quantum wave dispersion, but does not capture correctly the interaction between waves and the potential barrier. It might suggest, that to correctly capture the tunneling and scattering phenomena, we must mix the information not only between different spatial points but also between different channels. The dense architecture is able to capture both the reflection and the tunneling phenomena, but yields significant errors comparing to the recurrent-based model. This indicates, how important the temporal dimension is – something what recurrent architectures are designed to explore, as they are capable of (selectively) storing the memory of the previous steps in their internal states – and retrieving them when needed ^{44, 46}.

Generalization Capability

The usefulness of a neural network is mainly determined by its generalization capability. In this section, we provide a systematic analysis of both the in-domain and the out-of-domain predictive performance of our emulator.

In the training process, the raw emulation data is broken up into small windows, and those windows are re-sampled to build the curriculum. One of the reason for doing so, was to balance cases featuring different distinct phenomena, e.g., free propagation vs. tunneling through the potential barrier. During the training, we artificially restricted our training instances only to those featuring Gaussian packets and single rectangular potential barriers. This allows us later to test the out-of-distribution generalizability potential – namely, whether our emulator can correctly handle packets of different modulation or barriers of complex shape.

The barrier used in our training instances had a variable height, but a fixed width of $7\,a.u$ . Notably, the width of the window, that we used to cut chunks of the input data for our neural network, was $2.25\,a.u.$ – i.e., it was smaller than the width of the potential barrier itself. As a result, the network was exposed during the training to three distinct situations: (1) freely dispersing quantum waves in zero potential; (2) quantum wave propagation with non-zero constant potential; (3) quantum wave propagation with a potential step from zero to a constant value of the potential (or vice versa). While it is obvious that our emulator should handle rectangular potential barriers wider than the width of the window, since such situations are analogous to those already encountered in the training process, it is not all so obvious that the same should happen with barriers of a smaller width, yet alone with barriers of different shapes than rectangular. However, as it was already presented in Fig. 2, those cases do not present a challenge for our emulator.

In Figure 4, we present a more systematic study of that phenomena. We have evaluated how the accuracy of the emulation is affected when changing the following parameters: (a) the initial wave packet spread, (b) the initial wave packet energy, (c) the rectangular barrier width, and (d) the rectangular barrier height. In the experiments, we alter one of the parameters, while holding all other parameters at a constant value. We report the average correlation calculated across all the spatial grid points and all the emulated time steps. As we can see, the performance of the emulator remains effectively unchanged for all the tested values of the initial wave spread and the rectangular barrier width. This demonstrates that the network generalizes well with respect to those parameters. Remarkably, the performance does not deteriorate even when the barrier width becomes smaller than the size of the window, i.e., $2.5\,a.u$ . This explains why our emulator was able to correctly emulate the evolution with presence of continuously-shaped barriers (that, due to the discrete nature of our input, can be seen as a collection of adjacent 1-pixel-wide rectangular barriers).

With respect to the value of the initial wave packet energy, the correlation curve has a “ $\cap$ ” shape, indicating that the accuracy deteriorates as we leave the energy-region sampled during the training stage. However, this is as much a failure to generalize as the limitation of the way how we represent the data and the environment. Lower energy of the packet means that the evolution of the quantum wave is slower – and we have to predict more steps of the simulation to cover the same spatial distance as for wave packets with a higher energy. Due to the recurrent process of our emulation, this means larger error accumulation. On the other end, when the energy is high, the wave functions change rapidly within each discretized time unit. This means a larger potential for error each time we predict the next time step. Similar situation also happens when the barrier height increases. A steeper barrier means a more dramatic changes to the values of the wave function happening between two consequent time steps. Those both effects can be mitigated by increasing either the spatial or temporal resolution of our emulation.

Model Interpretability Through Input Feature Attribution

In this section, to get some insight into how our emulator makes the prediction, we measure the feature attribution (the contribution of each data input to the final prediction). We employ the direct gradient method. Namely, for a given target pixel in the output window, we calculate its gradients with respect to every input value. We follow the intuition that larger gradient is associated with higher importance of measured input ⁴⁷. Since all the target values are complex numbers, this method gives us two sets of values, one for the real and one for the imaginary value of the target pixel,

\frac{\partial\Re\left(\psi_{\text{target}}\right)}{\partial\{\ldots\}}\hskip 12.0pt\mbox{and}\hskip 12.0pt\frac{\partial\Im\left(\psi_{\text{target}}\right)}{\partial\{\ldots\}}.

(1)

In Figure 5, we show a sample of the results. Namely, we selected one test example, and as the target we took the central pixel of the predicted frame. Next, we calculated the gradient with respect to each input value. In this concrete example, we can see that the importance of the input pixels increases with each consecutive time steps. However, how much exactly the early steps matter depends on the region where the predictions are made. Far from the potential barrier (cf. Fig. 5a), the gradient measured with respect to the pixels of the first two steps (no. 176 and 177) is close to zero, indicating that those steps carry relative little importance. However, when predicting the evolution of the wave function in the vicinity of the potential barrier (cf. Fig. 5c), the importance of the third-from-the last step (no. 177) increases noticeable. This indicates, that when predicting free propagation, the network simply performs a linear extrapolation based on the latest two time-steps. However, when predicting tunneling and scattering, the emulator reaches beyond the last two time-steps for additional information, likely due to the non-linear character of those predictions.

Discussion

Interpreting neural networks creates interesting possibilities. Namely, by analysing how machine learning models make their predictions, we can learn new facts about the underlying physical problem. As an example, using the direct gradient descend method, despite its simplicity and known limitations⁴⁸, we can discover interesting relations between the input parameters. By analyzing several examples like those depicted in Fig. 5, we can observe a strong relationship between the averaged direct gradients of $\Re(\psi_{\text{target}})$ and $\Im(\psi_{\text{target}})$ , namely,

	$\displaystyle\frac{\partial\Re(\psi_{\text{target}})}{\partial\Re(\psi_{\text{input}})}$	$\displaystyle=$	$\displaystyle\frac{\partial\Im(\psi_{\text{target}})}{\partial\Im(\psi_{\text{input}})},\hskip 12.0pt\mbox{and}$		(2)
	$\displaystyle\frac{\partial\Re(\psi_{\text{target}})}{\partial\Im(\psi_{\text{input}})}$	$\displaystyle=$	$\displaystyle-\frac{\partial\Im(\psi_{\text{target}})}{\partial\Re(\psi_{\text{input}})}.$		(3)

This result can be used to (re)discover the existence of Cauchy-Riemann relations⁴⁹. Notable, this happens, despite the fact that we have not imposed any prior knowledge of wave dynamics or complex analysis during the training procedure. Our model was able to learn the correct relation directly from the presented training examples.

This trend, of looking at machine learning algorithms as something more than just black-box systems capable of extracting patterns from the data or solving given classification and regression problems, opens new possibilities ^{50, 51}. As it is evident in our example, we can use machine learning to get a better insight into the studied problem. There is an increasing number of papers, exploring this direction. For example, V. Bapst et al., in their article⁵², demonstrated how to train a graph neural network to predict the time-evolution of a glass system. By measuring the contributions to the prediction from particles located on subsequent “shells” around the target particle, they were able to estimate the effective correlation length of the interactions. They were also able to qualitatively show, that when system approaches the glass transition, the effective cut-off distance for the particle-particle interactions increases rapidly, a phenomena that is experimentally observed in glass systems. As an another example, Lee et al., in their work⁵³, trained support vector machines to predict the preferred phase type for the multi-principal element alloys. Next, by interpreting the trained models, they were able to discover phase-specific relations between the chemical composition of the alloys and their experimentally observed phases. Knowing what influence the phase formation, can be important both in the scientific context, as it increases our understanding, and can help us to refine our theoretical models, as well as from the manufacturing perspective, as it enables synthesis of materials with desired mechanical properties.

Another topic discussed in our paper is the ability of machine learning models to generalize to out-of-distribution examples. For the completeness, it is important to discuss what we mean by this term. We train our model on some simple examples, generated by an emulator restricted to a specific region of the parameter space. Next, we wanted to generalize in that parameter space. Namely, our goal is to predict the time-evolution of the system for parameters that are outside of the scope used during the training. This is in contrast to the typical understanding of generalization, as the ability of handling a novel (unseen during the training) data-input instances, that nevertheless come from the same distribution as the training examples. While our parameter space is relatively low-dimensional (we have just a couple of variables describing the initial conditions of the system), it is not the case for the ambient space of the inputs (each input instance is a tensor of $4\times 23\times 3$ , this is $276$ dimensions). This high-dimensionality of the ambient space has profound implications. Balestriero, Pesenti and LeCun, in their work⁵⁴, have shown that in highly dimensional space all predictions are effectively extrapolations, not interpolations. The logic is, that the number of training examples required for a prediction to be seen as interpolation, grows at least as fast as exponentially with the increasing dimension of the input. Effectively, for any dimensions above a few dozens, the training points become very sparse. Therefore, any point during the testing phase is likely to be far (at least along some dimensions) from any other point observed during the training. This makes that almost every testing point is located outside of the convex hull defined by the training examples – thus predictions of those points is nothing else than an extrapolation. In our case, however, we are interested in the ability to extrapolate in the space of the original parameter space, that control our physic-informed emulator, not in the ambient space of the neural network input. Since our parameters space has much lower dimension – just below 10 – thus, we can distinguish cases that require interpolation and extrapolation. As an example, if our model was trained by showing propagation of wave packages with the energy spread $S_{0}\in\{1,1.5,2,2.5,3,3.5,4\}$ , then to predict an evolution for a package that has spread $2.7$ would be an interpolation, and to predict the evolution for the spread $5$ would be an extrapolation. In Fig. 4, we have shown the performance of our emulator both in the interpolation and the extrapolation regime. We have also discussed, that the emulator can generalize outside the explored region of the parameter space, at lest along some dimensions.

There are other works discussing generalization to out-of-domain cases^{55, 56}. However, in those works the question of generalizability was asked mostly in the context of out-of-distribution detection, and used mostly in the context of fraud detection⁵⁷, general novelty detection⁵⁸ or as a means for increasing trustworthiness of machine learning systems⁵⁹. In this paper, we explored this concepts in the context of obtaining robust emulators of physical systems.

Conclusions

We have designed a training framework that allows us to extract knowledge from a restricted physically informed simulator. Furthermore, we have demonstrated how the interpretation of machine learning models can lead to scientific concept discovery. We also showed that it is possible for the network to train on simple examples and generalize to more general cases.

In future work we will optimize training schemes that maximize the generalization capability. Moreover, we will apply this methodology to physically challenging systems, e.g., including interactions. Furthermore, we will implement more sophisticated interpretability techniques and increase the robustness of the emulator by utilizing known symmetries, e.g., by constructing physically informed neural networks. Finally, we would like to pursue an iterative framework, whereby any time a new symmetry is discovered, it is implemented as a system constraint, repeating this procedure for a full characterization of the underlying physical system.

References

[1] Bahri, Y. et al. Statistical mechanics of deep learning. Annual Review of Condensed Matter Physics 11, 501–528 (2020). URL https://doi.org/10.1146/annurev-conmatphys-031119-050745.
[2] Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359–366 (1989). URL https://doi.org/10.1016/0893-6080(89)90020-8.
[3] Delalleau, O. & Bengio, Y. Shallow vs. deep sum-product networks. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F. & Weinberger, K. Q. (eds.) Advances in Neural Information Processing Systems, vol. 24 (Curran Associates, Inc., 2011). URL https://proceedings.neurips.cc/paper/2011/file/8e6b42f1644ecb1327dc03ab345e618b-Paper.pdf.
[4] Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J. & Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I. & Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29 (Curran Associates, Inc., 2016). URL https://proceedings.neurips.cc/paper/2016/file/148510031349642de5ca0c544f31b2ef-Paper.pdf.
[5] LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015). URL https://doi.org/10.1038/nature14539.
[6] Bau, D. et al. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences 117, 30071–30078 (2020). URL https://doi.org/10.1073/pnas.1907375117.
[7] Sengupta, S. et al. A review of deep learning with special emphasis on architectures, applications and recent trends. Knowledge-Based Systems 194, 105596 (2020). URL https://doi.org/10.1016/j.knosys.2020.105596.
[8] Jiang, F. et al. Artificial intelligence in healthcare: past, present and future. Stroke and Vascular Neurology 2, 230–243 (2017). URL https://doi.org/10.1136/svn-2017-000101.
[9] House of Lords: Select Committee on Artificial Intelligence. AI in the UK: Ready, Willing and Able? Report of Session 2017-19. No. 100 in HL paper (Dandy Booksellers Limited, 2018). URL https://books.google.com/books?id=sNyMtgEACAAJ.
[10] Carleo, G. et al. Machine learning and the physical sciences. Rev. Mod. Phys. 91, 045002 (2019). URL https://link.aps.org/doi/10.1103/RevModPhys.91.045002.
[11] Schmidt, J., Marques, M. R. G., Botti, S. & Marques, M. A. L. Recent advances and applications of machine learning in solid-state materials science. npj Computational Materials 5 (2019). URL https://doi.org/10.1038/s41524-019-0221-0.
[12] Carrasquilla, J. Machine learning for quantum matter. Advances in Physics: X 5, 1797528 (2020). URL https://doi.org/10.1080/23746149.2020.1797528.
[13] Alzubaidi, L. et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. Journal of Big Data 8 (2021). URL https://doi.org/10.1186/s40537-021-00444-8.
[14] Feldman, V. & Zhang, C. What neural networks memorize and why: Discovering the long tail via influence estimation. arXiv 2008.03703 (2020). URL https://arxiv.org/abs/2008.03703.
[15] Chai, X. et al. Deep learning for irregularly and regularly missing data reconstruction. Scientific Reports 10 (2020). URL https://doi.org/10.1038/s41598-020-59801-x.
[16] Haley, P. & Soloway, D. Extrapolation limitations of multilayer feedforward neural networks. In Proceedings of the IJCNN International Joint Conference on Neural Networks (IEEE, 1992). URL https://doi.org/10.1109/ijcnn.1992.227294.
[17] Xu, K. et al. How neural networks extrapolate: From feedforward to graph neural networks. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 (OpenReview.net, 2021). URL https://openreview.net/forum?id=UH-cmocLJC.
[18] Han, J., Jentzen, A. & E, W. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences 115, 8505–8510 (2018). URL https://doi.org/10.1073/pnas.1718942115.
[19] Karniadakis, G. E. et al. Physics-informed machine learning. Nature Reviews Physics 3, 422–440 (2021). URL https://doi.org/10.1038/s42254-021-00314-5.
[20] Iten, R., Metger, T., Wilming, H., del Rio, L. & Renner, R. Discovering physical concepts with neural networks. Physical Review Letters 124 (2020). URL https://doi.org/10.1103/physrevlett.124.010508.
[21] Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R. & Yu, B. Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences 116, 22071–22080 (2019). URL https://doi.org/10.1073/pnas.1900654116.
[22] Joshi, G., Walambe, R. & Kotecha, K. A review on explainability in multimodal deep neural nets. IEEE Access 9, 59800–59821 (2021). URL https://doi.org/10.1109/access.2021.3070212.
[23] Bengio, Y., Louradour, J., Collobert, R. & Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09 (ACM Press, 2009). URL https://doi.org/10.1145/1553374.1553380.
[24] Gou, J., Yu, B., Maybank, S. J. & Tao, D. Knowledge distillation: A survey. International Journal of Computer Vision 129, 1789–1819 (2021). URL https://doi.org/10.1007/s11263-021-01453-z.
[25] Romero, A. et al. Fitnets: Hints for thin deep nets. In Bengio, Y. & LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015). URL http://arxiv.org/abs/1412.6550.
[26] Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop (2015). URL http://arxiv.org/abs/1503.02531.
[27] Figueiras, E., Olivieri, D., Paredes, A. & Michinel, H. An open source virtual laboratory for the schrödinger equation. European Journal of Physics 39, 055802 (2018). URL https://doi.org/10.1088/1361-6404/aac999.
[28] Meyerov, I., Liniov, A., Ivanchenko, M. & Denisov, S. Simulating quantum dynamics: Evolution of algorithms in the hpc context. arXiv 2005.04681 (2020). URL https://arxiv.org/abs/2005.04681.
[29] Candia, R. D., Pedernales, J. S., del Campo, A., Solano, E. & Casanova, J. Quantum simulation of dissipative processes without reservoir engineering. Scientific Reports 5 (2015). URL https://doi.org/10.1038/srep09981.
[30] Luchnikov, I., Vintskevich, S., Ouerdane, H. & Filippov, S. Simulation complexity of open quantum dynamics: Connection with tensor networks. Physical Review Letters 122 (2019). URL https://doi.org/10.1103/physrevlett.122.160401.
[31] Loh, E. Y. et al. Sign problem in the numerical simulation of many-electron systems. Physical Review B 41, 9301–9307 (1990). URL https://doi.org/10.1103/physrevb.41.9301.
[32] Prosen, T. & Žnidarič, M. Is the efficiency of classical simulations of quantum dynamics related to integrability? Physical Review E 75 (2007). URL https://doi.org/10.1103/physreve.75.015202.
[33] Breuer, H.-P. & Petruccione, F. The Theory of Open Quantum Systems (Oxford University Press, 2007). URL https://doi.org/10.1093/acprof:oso/9780199213900.001.0001.
[34] Pfau, D., Spencer, J. S., Matthews, A. G. D. G. & Foulkes, W. M. C. Ab initio solution of the many-electron schrödinger equation with deep neural networks. Physical Review Research 2 (2020). URL https://doi.org/10.1103/physrevresearch.2.033429.
[35] Hermann, J., Schätzle, Z. & Noé, F. Deep-neural-network solution of the electronic schrödinger equation. Nature Chemistry 12, 891–897 (2020). URL https://doi.org/10.1038/s41557-020-0544-y.
[36] Zhu, Y. & Zabaras, N. Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification. Journal of Computational Physics 366, 415–447 (2018). URL http://dx.doi.org/10.1016/j.jcp.2018.04.018.
[37] Bhatnagar, S., Afshar, Y., Pan, S., Duraisamy, K. & Kaushik, S. Prediction of aerodynamic flow fields using convolutional neural networks. Computational Mechanics 64, 525–545 (2019). URL http://dx.doi.org/10.1007/s00466-019-01740-0.
[38] Smith, J. D., Azizzadenesheli, K. & Ross, Z. E. Eikonet: Solving the eikonal equation with deep neural networks. IEEE Trans. Geosci. Remote. Sens. 59, 10685–10696 (2021). URL https://doi.org/10.1109/TGRS.2020.3039165.
[39] Raissi, M., Perdikaris, P. & Karniadakis, G. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378, 686–707 (2019). URL https://doi.org/10.1016/j.jcp.2018.10.045.
[40] Raissi, M. Deep hidden physics models: Deep learning of nonlinear partial differential equations. Journal of Machine Learning Research 19, 25:1–25:24 (2018). URL http://jmlr.org/papers/v19/18-046.html.
[41] Lu, L., Jin, P. & Karniadakis, G. E. DeepONet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. arXiv 1910.03193 (2020). URL https://arxiv.org/abs/1910.03193.
[42] Li, Z. et al. Neural operator: Graph kernel network for partial differential equations. arXiv 2003.03485 (2020). URL https://arxiv.org/abs/2003.03485.
[43] Sanchez-Gonzalez, A. et al. Learning to simulate complex physics with graph networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, vol. 119 of Proceedings of Machine Learning Research, 8459–8468 (PMLR, 2020). URL http://proceedings.mlr.press/v119/sanchez-gonzalez20a.html.
[44] Cho, K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Moschitti, A., Pang, B. & Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, 1724–1734 (ACL, 2014). URL https://doi.org/10.3115/v1/d14-1179.
[45] Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 (OpenReview.net, 2019). URL https://openreview.net/forum?id=Bkg6RiCqY7.
[46] Gers, F. A., Schmidhuber, J. & Cummins, F. A. Learning to forget: Continual prediction with LSTM. Neural Comput. 12, 2451–2471 (2000). URL https://doi.org/10.1162/089976600300015015.
[47] Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Bengio, Y. & LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings (2014). URL http://arxiv.org/abs/1312.6034.
[48] Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Precup, D. & Teh, Y. W. (eds.) Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, vol. 70 of Proceedings of Machine Learning Research, 3319–3328 (PMLR, 2017). URL http://proceedings.mlr.press/v70/sundararajan17a.html.
[49] Riemann, B. Grundlagen für eine allgemeine Theorie der Funktionen einer veränderlichen komplexen Grösse (1851). In Weber, H. (ed.) Riemann’s gesammelte math. Werke (in German), 3–48 (Dover, 1953).
[50] Zdeborová, L. New tool in the box. Nature Physics 13, 420–421 (2017). URL https://doi.org/10.1038/nphys4053.
[51] Zdeborová, L. Understanding deep learning is also a job for physicists. Nature Physics 16, 602–604 (2020). URL https://doi.org/10.1038/s41567-020-0929-2.
[52] Bapst, V. et al. Unveiling the predictive power of static structure in glassy systems. Nature Physics 16, 448–454 (2020). URL https://doi.org/10.1038/s41567-020-0842-8.
[53] Lee, K., Ayyasamy, M. V., Delsa, P., Hartnett, T. Q. & Balachandran, P. V. Phase classification of multi-principal element alloys via interpretable machine learning. npj Computational Materials 8 (2022). URL https://doi.org/10.1038/s41524-022-00704-y.
[54] Balestriero, R., Pesenti, J. & LeCun, Y. Learning in high dimension always amounts to extrapolation. arXiv 2110.09485 (2021). URL https://arxiv.org/abs/2110.09485.
[55] Wald, Y., Feder, A., Greenfeld, D. & Shalit, U. On calibration and out-of-domain generalization. arXiv 2102.10395 (2021). URL https://arxiv.org/abs/2102.10395.
[56] Wang, J., Lan, C., Liu, C., Ouyang, Y. & Qin, T. Generalizing to unseen domains: A survey on domain generalization. In Zhou, Z. (ed.) Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, 4627–4635 (ijcai.org, 2021). URL https://doi.org/10.24963/ijcai.2021/628.
[57] Pang, G., Shen, C., Cao, L. & van den Hengel, A. Deep learning for anomaly detection: A review. ACM Comput. Surv. 54, 38:1–38:38 (2021). URL https://doi.org/10.1145/3439950.
[58] Yang, J., Zhou, K., Li, Y. & Liu, Z. Generalized out-of-distribution detection: A survey. arXiv 2110.11334 (2021). URL https://arxiv.org/abs/2110.11334.
[59] Liu, W., Wang, X., Owens, J. D. & Li, Y. Energy-based out-of-distribution detection. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. & Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020). URL https://proceedings.neurips.cc/paper/2020/hash/f5496252609c43eb8a3d147ab9b9c006-Abstract.html.
[60] Nakano, A., Vashishta, P. & Kalia, R. K. Massively parallel algorithms for computational nanoelectronics based on quantum molecular dynamics. Computer Physics Communications 83, 181–196 (1994). URL https://www.sciencedirect.com/science/article/pii/0010465594900477.
[61] Bengio, Y., Simard, P. Y. & Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks 5, 157–166 (1994). URL https://doi.org/10.1109/72.279181.
[62] Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, vol. 28 of JMLR Workshop and Conference Proceedings, 1310–1318 (JMLR.org, 2013). URL http://proceedings.mlr.press/v28/pascanu13.html.
[63] Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. & LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015). URL http://arxiv.org/abs/1412.6980.

Methods

Problem Formulation

We consider the one-dimensional time-dependent Schrödinger equation in atomic units,

i\frac{\partial}{\partial t}\psi(x,t)=\mathcal{H}\psi(x,t),

(4)

where $x\in[0,L_{x}),t\in[0,T)$ , and the Hamiltonian operator $\mathcal{H}$ is defined as

\mathcal{H}=T+V=-\frac{1}{2}\frac{\partial^{2}}{\partial x^{2}}+V(x),

(5)

with $T$ and $V$ being the kinetic and potential energy operators, respectively.

We represent the quantum waves as complex-valued functions, $\psi(x,t)\in\mathbb{C}$ , on a $N_{x}\times N_{t}$ mesh grid, and we impose periodic boundary conditions: $\psi(x+L_{x},t)=\psi(x,t)$ , where $L_{x}$ is the size of our 1-dimensional environment.

We use an equal mesh spacing. With $\Delta x=L_{x}/N_{x}$ and $\Delta t=T/N_{t}$ , the spatial and temporal coordinates become $x_{i}=i\Delta x$ and $t_{j}=j\Delta t$ , with $i\in\{0,1,\ldots,N_{x}-1\}$ and $j\in\{0,1,\ldots,N_{t}-1\}$ , respectively. For brevity, we denote the discrete wave function values as $\psi_{i}^{j}=\psi(x_{i},t_{j})$ , and the discrete potential steps as $v_{i}=V(x_{i})$ .

Given the discretizations described above, we consider the neural network-based emulator as a parameterized map from an input, constructed from the $H$ consecutive time steps, $\{t_{j},t_{t+1},\ldots,t_{j+H-1}\}$ , to the output that relates to the next time step, $t_{j+H}$ . To make the input and the output of the neural network independent from the size of the system, we construct a slices (we call them windows through the text) of a fixed width $W$ . Namely, a portion of the input representing the values of the quantum wave at the time step $t_{j}$ , can be written as

\bm{\omega}_{i}^{j}=\left(\psi_{i-\lfloor W/2\rfloor}^{j},\psi_{i-\lfloor W/2\rfloor+1}^{j},\ldots,\psi_{i}^{j},\ldots,\psi_{i+\lfloor(W-1)/2\rfloor-1}^{j},\psi_{i+\lfloor(W-1)/2\rfloor}^{j}\right).

(6)

Similarly, a portion of the input representing the potential landscape at the same time step, can be written as

\bm{\nu}_{i}=\left(v_{i-\lfloor W/2\rfloor},v_{i-\lfloor W/2\rfloor+1},\ldots,v_{i},\ldots,v_{i+\lfloor(W-1)/2\rfloor-1},v_{i+\lfloor(W-1)/2\rfloor}\right).

(7)

As shown in Fig. 2, the input of raw simulation data can be decomposed into a series of windows. The proposed neural network-based emulator maps each input data window to an output data window, that can be subsequently re-combined to form a complete time step. More specifically, we look for a network to learn the mapping

f_{\bm{\Theta}}:\begin{Bmatrix}\bm{\omega}_{i}^{j},\bm{\omega}_{i}^{j+1},\ldots,\bm{\omega}_{i}^{j+H-1},\bm{\nu}_{i}\end{Bmatrix}\rightarrow\begin{Bmatrix}\bm{\omega}_{i}^{j+H}\end{Bmatrix}.

(8)

During the training, we try to find parameters $\bm{\Theta}$ of the map $f_{\bm{\Theta}}$ , that minimize the training objective function (a loss function $\mathcal{L}$ ) on each window output, defined as

\mathcal{L}=\mbox{MSE}_{i}^{j}=\frac{1}{W}\left\|\bm{\hat{\omega}}_{i}^{j+H}-\bm{\omega}_{i}^{j+H}\right\|^{2},

(9)

where $\bm{\hat{\omega}}_{i}^{j+H}$ denotes the predicted values of the quantum wave at the time step $t_{j}$ in the considered window, and $\bm{{\omega}}_{i}^{j+H}$ denotes the ground truth values.

Prediction using the window based scheme

To make the emulator independent of the environment size, we predict only the local evolution of the wave packet $\bm{\omega}_{i}^{j}$ within a given window (cf. Eq. 6). However, to recreate the entire wave function $\psi(x,t_{j})$ for a given time step $j$ , we need to assemble the individual predictions from different, overlapping windows. Let us denote the $\psi_{m}^{j}$ evaluated from window $\bm{\omega}_{m+k}^{j}$ as $\psi_{m}^{j}\left(\bm{\omega}_{m+k}^{j}\right)$ . We average those predictions by giving a larger weight to those predictions for which the grid-point $x_{m}$ is located closer to the center of the window. For the weight, we use the Gaussian modulation,

\langle\psi_{m}^{j}\rangle=\sum_{k=-\lfloor(W-1)/2\rfloor}^{\lfloor W/2\rfloor}A\,\exp\!{\left(-\frac{k^{2}}{2\delta^{2}}\right)}\psi_{i}^{j}\left(\bm{\omega}_{m+k}^{j}\right),

(10)

where $A=1/\sum_{k=-\lfloor(W-1)/2\rfloor}^{\lfloor W/2\rfloor}\exp\!{\left(-k^{2}\middle/2\delta^{2}\right)}$ is the normalization constant, and $\delta$ denote the Gaussian averaging spread.

A complete prediction of the next time-step is a collection of all predicted points $[\langle\psi_{0}^{j}\rangle,\langle\psi_{1}^{j}\rangle,\ldots,\langle\psi_{N_{x}}^{j}\rangle]$ . Fortunately, all the intermediate predictions $\psi_{m}^{j}\left(\bm{\omega}_{m+k}^{j}\right)$ can be obtained independently, therefore the entire algorithm is easily parallelizable (e.g., by performing a batch inference on a modern GPU).

Predicting multiple steps of evolution

We predict the next time-step of the system evolution in a recurrent manner, i.e., the predictions of the previous time steps are used to form the input of the current time-step. By iteratively predicting next time-steps, we can obtain a sequence of snapshots, portraying the wave function evolution, of an arbitrary length.

As an initial condition, the trained models are fed in with first four time steps of the system evolution.

Data generation and processing

Our data generation implements a two-step procedure. First, we generate raw simulation data that represents the state of the system. Then we slice them into smaller windows to feed into the neural network.

Raw simulation.

Table 2: Raw Simulation Parameters. Parameters describing the spatial and temporal discretization of the wave function evolution (

N_{x}

\Delta t

), the environment together with a centrally-located rectangular potential barrier (

L_{x}

H_{b}

W_{b}

), and the initial state of the wave packet in the time

t_{0}

(

X_{0}

S_{0}

E_{0}

$N_{x}$ , $L_{x}$	The wave function is discretized on a regular mesh of size $\Delta x=L_{x}/N_{x}$ where $N_{x}$ is the number of mesh points and $L_{x}$ is the length of the simulation box.
$\Delta t$	Time discretization unit. The time between each discrete time-step.
$X_{0}$ , $S_{0}$ , $E_{0}$	Center-of-mass, spread and energy of initial Gaussian-modulated wave packet, respectively.
$H_{b}$ , $W_{b}$	Rectangular barrier height and width centered at the simulation box, respectively.

We generate raw simulation data of wave propagation discretized on a regular mesh of size $\Delta x$ using a space splitting method ⁶⁰. We simulate the wave propagation over a rectangular potential barrier centered at the simulation box. The important parameters controlling the simulations are listed in Tab. 2. Specifically, we set $L_{x}=100\,a.u.$ , $N_{x}=1024$ , and $W_{b}=7.0\,a.u$ . In $t=0$ , the wave packet has a Gaussian modulation,

\psi_{t=0}(x)=C\exp{\left(-\frac{(x-X_{0})^{2}}{4S_{0}^{2}}\right)}\,\exp{\left(i\sqrt{2E_{0}}x\right)}.

(11)

The simulations are running with an internal-time step of $\Delta t_{int}=0.0005\,a.u$ . We run the simulation for $100.000$ steps, while saving snapshots of the simulation by every 200 steps (thus $\Delta t=0.1\,a.u$ ). As a result, we get $(N_{t}=500)\times(N_{x}=1024)$ records of the wave function evolution. In case of a free propagation, each pixel is represented by two values, the real and the imaginary part of the wave function. In the simulations including a potential barrier, we add a third channel to each pixel, to encode values of the potential at each point.

Data windowing.

To make the input of the neural network independent of the environment size, we slice the recorded raw simulation data into windows of size $(H+1)\times W\times C$ . Specifically, $W=23$ represents the spatial window size, $C=2$ or $3$ denotes the number of information channels, $H+1=5$ stands for the number of consequent time steps. The first $H$ time steps are used to construct the training data set. The last time step in each simulation is used to construct the training target.

Data processing.

We adopted a few techniques to efficiently process the training data.

1.

To reduce the correlations between training samples, we apply spatial and temporal sampling to intentionally skip some windows. We set the spatial sampling ratio to 0.1 and the temporal ratio to 0.9.
2.

We assign a higher possibility to data windows overlapped with the central potential barrier. This helps us balance the “hard” and “easy” cases in the training set.
3.

We apply the periodic boundary condition.
4.

We renormalize the values in the channels. Specificly, the potential values are rescaled to [0, 1] to improve the training stability (e.g., to avoid the exploding gradient problem^{61, 62}).

Description of the neural network architectures

The input of the neural network has dimension $4\times 23\times 3$ (four time steps, window size $23$ , three channels). First the input tensor is reshaped into a $4\times 69$ tensor, and then connected to a time-distributed dense layer (a dense layer that is applied independently to each time step) with the ReLU activation functions. We treat the size of the dense layer, $K$ , as a tunable hyper-parameter. We pass the resulting $4\times K$ tensor to a gated recurrent unit, GRU ⁴⁴ with the ReLU activation as well. The output of that step is a single tensor of size $K$ . We connect it to the output layer of size $46$ , with a linear activation (see again the schematic representation of the network in Fig. 2a).

In addition to the architecture described above, we used the following three benchmark network architectures:

1.

Linear model. For this model, we used a simplified input. Instead of four time-steps, the linear model takes a single time-step (thus, the input tensor shape is $1\times 23\times 3$ ). We connect the input directly to the output layer of size $46$ . There is no hidden layers. All activations are linear.
2.

Fully dense model. For the input we take the standard four time-steps (represented by a $4\times 23\times 3$ tensor). First, the input is reshaped into a $4\times 69$ tensor. Next, each time step of the input goes through a time-distributed dense layer of size $K$ with the ReLU activation. The output of the time-distributed layer is subsequently flattened and connected to a regular dense layers of size $K$ , also with the ReLU activation. The resulting tensor is finally connected to the output layer of size $46$ with a linear activation.
3.

Convolutional model. We use the same input as above, four time-steps represented as $4\times 23\times 3$ tensors. We apply 1-dimensional convolution with a filter size of $4$ and with the ReLU activation. The purpose of that layer is to mix the original channels (encoding the real and imaginary part of the wave packet as well as the values of the potential) from different temporal points. The resulting tensor has shape $1\times 23\times F$ , where $F$ is the number of the convolution filters. We took $F=\lfloor K/4\rfloor$ . The resulting tensor is subsequently flattened and connected to a hidden layer with $K$ neurons, and with the ReLU activation. The last layer is the output of size $46$ with a linear activation.

The outputs of all the models have sizes of 46. In the last step, the outputs are reshaped to form $23\times 2$ tensors, with the second dimension encoding the real and imaginary part of the wave function.

Training details

We have trained the neural networks using AdamW ⁴⁵ optimizer with the mean squared error (MSE) loss function. We used a linearly-decayed learning rate with a warm-up. We included more information regarding the training procedure in the Supplementary Information.

Model evaluation

To evaluate the model’s performance, we compare the predicted evolution of the system with the ground truth, using the following two metrics:

1.

Mean absolute error (MAE), calculated per each time step,

$\overline{|\epsilon|}=\sum_{i=1}^{N_{x}}\frac{{|\hat{\psi_{i}}-\psi_{i}|}}{N_{x}},$ (12)

where $N_{x}$ is the number of spatial points, while $\hat{\psi_{i}}$ and $\psi_{i}$ are the ground truth and the predicted wave function value, respectively, evaluated in the i-th discrete position, $x_{i}$ . MAE remains at low values for a good model. However, a consistent low MAE is not sufficient to determine a good model, as a model that constantly outputs zero or near-zero values might result in a relatively low MAE in some cases (e.g., when the ground truth is a wave packet concentrated only in a small volume of a larger space). This limitation of MAE, leads us to introduce another metric, namely, the normalized correlation.
2.

Normalized correlation per time step:

$\mathcal{C}=\frac{\sum_{i=1}^{N_{x}}{\hat{\psi_{i}}^{*}\psi_{i}}}{|\bm{\hat{\psi}}||\bm{\psi}|},$ (13)

where, $N_{x}$ , $\hat{\psi_{i}}$ and $\psi_{i}$ are defined in the same way as above. The symbol ^∗ represents the complex conjugate. Normalized correlation treats the predicted and true wave functions as two vectors, i.e. $\bm{\hat{\psi}}$ and $\bm{\psi}$ . To this end, normalized correlation can be understood as the angular similarity between two wave function vectors.

For an overall performance evaluation, we calculate the average over a number of consequitive time steps, denoted by $\langle\overline{|\epsilon|}\rangle$ and $\langle\mathcal{C}\rangle$ respectively.

Data Availability

A sample data can be generated using scripts, available in our GitHub repository, https://github.com/yaoyu-33/quantum_dynamics_machine_learning.

The full training data set as well as all validation and test cases are available from the corresponding author upon reasonable request.

Code Availability

Source code for training and evaluating the machine learning models is available at https://github.com/yaoyu-33/quantum_dynamics_machine_learning.

Acknowledgments

The authors acknowledge the Center for Advanced Research Computing (CARC) at the University of Southern California for providing computing resources that have contributed to the research results reported within this publication. URL: https://carc.usc.edu.

We would like to thank Prof. Oleg V. Prezhdo and Prof. Aiichiro Nakano for helpful discussions during the early stages of the project.

Author Contributions

The study was planned by M.Ab. and S.H. The manuscript was prepared by Y.Y., C.C, S.H., and M.Ab. The data set construction was done by Y.Y. and C.C. The machine learning studies were performed by both Y.Y. and C.C. Result validation and code optimization was performed by M.Ag., D.K., and M.Ab. All authors discussed the results, wrote, and commented on the manuscript.

Competing Interests

The authors declare no competing interests.

Supplementary Information

Choice of Examples

One of the goals of our work was to demonstrate the following possibility: it is feasible to emulate certain physical processes using machine learning models, that had been trained solely on some restricted, simple examples. In other words, our aim was to address the particular challenge of training machine learning models in the reality, where generating training examples is difficult or expensive with exception of some specific discrete cases. To deliver a clear proof of concept, we have deliberately chosen a simple and well known problem: one-dimensional quantum dynamics. To train the model, we have used only examples concerning a single, rectangular potential barrier and a single wave packet with Gaussian modulation. To quantify the extend to which the model generalize, we have designed a number of test cases consist of various potential barrier shapes and compositions.

To that extend, we have demonstrated that it is possible to emulate the dynamics of quantum mechanical systems, without explicit knowledge of the Schrödinger equation or any physical laws in general. In contrast to previous studies, we focused on eliciting what neural networks need for rendering faithful emulations and what the neural network learns during the training.

Details on the Training Data Set

We generated training examples by evolving a single Gaussian-modulated wave packet. The wave packet can be described with three parameters $X_{0}$ , $S_{0}$ , and $E_{0}$ , which denote the center-of-mass position, the spread, and the energy, respectively.

During our training, we have considered two training regimes. The first one includes examples of free propagation (no potential barrier), while the second is concerned with propagation through an environment with a single rectangular barrier. If present, we constrain the barrier to be located at the center of the simulation box. Thus, the potential can be defined by two quantities, $W_{b}$ and $H_{b}$ , which are the height and width of the barrier, respectively.

Below, we show the values of parameters used when generating the training examples.

Emulation of freely dispersing wave packets:

$X_{0}=10.0,\ 40.0,\ 70.0\,\ a.u.$
$S_{0}=1.0,\ 1.5,\ 2.0,\ 2.5,\ 3.0,\ 3.5,\ 4.0\,\ a.u.$
$E_{0}=1.0,\ 2.0,\ 3.0,\ 4.0,\ 5.0,\ 6.0,\ 7.0,\ 8.0,\ 9.0\,\ a.u.$

Quantum wave emulations with potential barrier being present:

$X_{0}=10.0,\ 40.0,\ 70.0\,\ a.u.$
$S_{0}=1.0,\ 1.5,\ 2.0,\ 2.5,\ 3.0,\ 3.5,\ 4.0\,\ a.u.$
$E_{0}=1.0,\ 2.0,\ 3.0,\ 4.0,\ 5.0,\ 6.0,\ 7.0,\ 8.0,\ 9.0\,\ a.u.$
$H_{b}=1.0,\ 2.0,\ 3.0,\ 4.0,\ 5.0,\ 6.0,\ 7.0,\ 8.0,\ 9.0,\ 10.0,\ 11.0,\ 12.0,\ 13.0,\ 14.0\,\ a.u.$
$W_{b}=7.0\,\ a.u.$

In Figure S1, we show in detail the input structure used for training the neural network-based emulator. The red rectangular indicates the cutting window, that produces slices of data from the recorded simulations. Those slices are than reassembled into a time series, that consist of the input of a neural network.

In Figure S2, we present the purpose of the input sampling. We assign a higher possibility to data windows that overlaps with the central potential barrier. The motivation is, that these cases are inherently harder to learn than examples depicting a free propagation. By tuning the sampling probability, we can easily control the ratio between the “no-potential” and “with-potential” windows, which helps to balance the training data set.

Hand-Designed Test Data Sets

Freely Dispersing Test Cases.

To test the performance of our emulator in the freely dispersing regime, we have generated $12$ random test cases. The parameters were randomly selected from $X_{0}\in(10.0,90.0)$ , $S_{0}\in(0.5,9.0)$ , and $E_{0}\in(0.0,9.0)$ . The particular test cases used in this study are depicted in Fig. S3.

Test Cases with Potential Barriers.

To test the performance in the more general case, we hand-designed a test data set, that includes: 11 random single-rectangular barriers, 2 double-rectangular barriers, 1 triple-rectangular barrier, 7 irregular barriers, 2 quadratic potentials, and 2 rectangular wells. All the test cases are depicted in Fig. S4.

Non-Gaussian Wave Packets

In Figure S5, we show that the emulator can handle not only different potential landscapes, it can also generalize to different modulations of the emulated wave packet. We can see, that the performance for the triangle-shaped packet is in pair with the Gaussian-modulated packet. The performance for the square-shaped packet is noticeable worse. This can be easily understood when comparing Fourier analysis of differently shaped signals. To accurately reproduce a square signal, we need to include a number or higher harmonics. We know (cf. Fig. 4 in the main text), that when energy is too high, the performance of the emulator deteriorate. In comparison, the higher harmonics (that correspond to higher energies of plain waves) are less crucial when reproducing a triangle signal with the same average accuracy as in the case of a rectangular-shaped signal. This might explains the observed differences in Fig. S5.

Freely Dispersing Wave Packets Case

In this sections, we present the performance benchmark for our neural network emulator in the case of freely dispersing quantum wave packets. As there is no potential, the input data window only contains two channels, representing the real and imaginary parts of the wave function. For the freely-dispersing-wave-packet example, we use a training data set generated by evolving 189 different Gaussian wave packets, each with a different values of $X_{0}$ , $S_{0}$ and $E_{0}$ . The machine learning models was trained for three epochs using the AdamW optimizer⁴⁵, and with MSE loss functions.

Figure S6a shows results of the emulation for all four tested architectures. All four models, shown in the top row, successfully reproduce the simulation results, with only minor errors. We find that the emulation errors are adding up as time evolves. Such error accumulation is inevitable given the recurent nature of the predictions. The errors from the convolutional model are the most significant, however still within the limits of acceptable performance.

Figure S6b and S6c shows, that all tested architectures maintain low MAEs and high correlations, even after 400 time steps, indicating that they achieve accurate and stable performance. We report the MAE and normalized correlations, averaged over all time steps and test samples in Table 3.

It is noteworthy that the baseline linear model achieves nearly perfect predictive powers – as long as the potential barriers are not considered (this would be not the case if the potential barrier was present, see the results in the main text, e.g., the performance comparison presented in Fig. 3). The reason is that the time evolution of the wave function in a small interval $\Delta t$ , can be expressed as $\psi(t+\Delta t)=\exp\left(-iT_{x}\Delta t\right)\psi(t)+O\left([\Delta t]^{3}\right)$ , where the kinetic operator $T_{x}$ is represented by a tridiagonal matrix. Therefore, with respect to the leading terms, two consecutive time steps of the wave function evolution are linearly connected, and hence, a simple linear architecture can be trained to reproduce the simulations, given a sufficiently small time-step $\Delta t$ .

Table 3: Benchmark using Freely Dispersing Quantum Wave Packets. Performance comparison for different architectures of our machine learning-based emulator in case of a freely dispersing quantum wave packets. As a metric, we used the mean absolute error

\overline{|\epsilon|}

(less is better) and a normalized correlation

\mathcal{C}

(closer to

1

is better), both averaged over all spatial grid points, all time step, and all available test cases.

Model	Parameters	$\langle\overline{\|\epsilon\|}\rangle$	$\langle\mathcal{C}\rangle$
Linear	2,162	0.0031	0.9827
Dense	12,834	0.0019	0.9952
Conv	19,280	0.0030	0.9966
GRU	19,412	0.0014	0.9984

Training details and hyper-parameters tuning

All models were trained using AdamW ⁴⁵ optimizer with the mean squared error (MSE) loss function. We selected the initial learning rate of $10^{-3}$ , with the first and second momentum $\beta_{1}=0.9$ and $\beta_{2}=0.99$ , respectively. We applied a learning rate scheduler. Namely, one percent of total training steps was used as warm-up steps with a linearly increasing learning rate. Afterward, the learning rate decays linearly to $10^{-6}$ . We provide the detailed values of the training parameters and the models hyper-parameters in Tab. 4.

Table 4: Values of training parameters and hyper-parameters.

	Without potential	With potential
Number of raw simulations	189	2646
Input time steps (H)	4	4
Data window size (W)	23	23
Hidden size (K)	46	69
Training epochs	3	5
Optimizer	AdamW ^{45, 63}	AdamW
Learning rate scheduler	$0\rightarrow 10^{-3}$ (warm-up)	$0\rightarrow 10^{-3}$ (warm-up)
	$10^{-3}\rightarrow 10^{-6}$ (linear decay)	$10^{-3}\rightarrow 10^{-6}$ (linear decay)
Weight decay rate	1	1
Gradient clipping ⁶²	1.0	1.0

To determine the optimal value of the hyper-parameter, we performed a search over the input time steps, the window size, and the hidden-layer size. We present the results in Fig. S7.

Figure S7a shows that the optimal depth for the history is just two time steps, what is consistent with the analysis of the results presented in Fig. 5, in the main text. Noticeable, even with just one time step, our model is still able to achieve a relatively good, average performance. However, this high average score is deceiving, as it is inflated by the fact, that a single time-step is enough to predict the evolution far from the potential barrier (cf. results presented in Fig. S6). However, to properly account for the scattering and tunnelling, a longer history than one time-step is needed (as evident in results presented in Fig. 3, in the main text).

In Figure S7b, we observe window size 23 gives the best results. We want to remark that this parameter is highly relevant to the sampling strategy employed during the data generation process.

In Figure S7c, our model shows the lowest error with hidden states 69 and 92. The error slightly increases after hidden size 115. We expect models with a larger hidden size should in principle better approximate the target function, however, they might also be more prone to overfitting, and therefore they might require more precise optimization of the training parameters to fully leverage their potential.

Average Direct Gradients

In Figure 5, in the main text, we presented an exemplary results of the direct gradients measurements. Interestingly, when repeating the calculations for other positions of the input-window, we have consistently gotten similar values of the direct gradient. To analyze this phenomenon, we measured the average value of the direct gradients for 200 randomly sampled data windows. The results are depicted in Fig. S8. There is a clearly visible pattern, that might indicate a steady and almost linear relationship between the input wave functions and the predicted value. As presented in the previous section, with the absence of any potential barrier, the target values are linearly related to the input from the previous step. Apparently, introducing a potential does not change this linear mapping by much (compare the relatively small values of the standard deviation to the size of the bars). The situation is different when we consider the sensitivity to the values of the potential, instead. We find that the direct gradients vary greatly over different window positions, compared to the average values. This might be an indicator, that the mapping from potential values to the target wave functions is a more complex, non-linear function – which is also consistent with the interpretation of the results presented in Figs. 3 and S6, as it was already discussed above.