This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Generation of Molecular Clusters with Dual-Scale Equivariant Flow Matching

Akshay Subramanian Shuhui Qu Cheol Woo Park Massachusetts Institute of Technology Samsung Display America Lab Samsung Display America Lab [email protected] Sulin Liu Janghwan Lee Rafael Gómez-Bombarelli Massachusetts Institute of Technology Samsung Display America Lab Massachusetts Institute of Technology [email protected]
Abstract

Amorphous molecular solids offer a promising alternative to inorganic semiconductors, owing to their mechanical flexibility and solution processability. The packing structure of these materials plays a crucial role in determining their electronic and transport properties, which are key to enhancing the efficiency of devices like organic solar cells (OSCs). However, obtaining these optoelectronic properties computationally requires molecular dynamics (MD) simulations to generate a conformational ensemble, a process that can be computationally expensive due to the large system sizes involved. Recent advances have focused on using generative models, particularly flow-based models as Boltzmann generators, to improve the efficiency of MD sampling. In this work, we developed a dual-scale flow matching method that separates training and inference into coarse-grained and all-atom stages and enhances both the accuracy and efficiency of standard flow matching samplers. We demonstrate the effectiveness of this method on a dataset of Y6 molecular clusters obtained through MD simulations, and we benchmark its efficiency and accuracy against single-scale flow matching methods.

1 Introduction

Amorphous molecular solids are disordered organic systems with several applications in the field of organic optoelectronics. Their mechanical flexibility and solution processability make them attractive alternatives to inorganic semiconductors [22, 14, 19]. Different molecules tend to pack differently in amorphous solids, which lead to different network extensivity, electronic properties, and transport rates [13]. The influence of packing structure on optoelectronic properties has led to significant interest in computationally simulating these materials with molecular dynamics (MD) simulations [13, 6] to enhance our understanding of the correlations between structural design and corresponding packing and optoelectronic properties. This can inform better design strategies for the next generation of organic electronics devices.

Having the ability to generate amorphous solids in a computationally efficient manner can enable faster sampling from the multi-molecule Boltzmann distribution, as well as property-guided generation of optimally packed configurations. There have been several works developing efficient methods to sample from Boltzmann distributions or their emulated (non re-weighted) versions [17, 11]. Scaling up to larger systems (such as amorphous clusters) is however computationally expensive if generation is to be performed at the all-atom (AA) level. In the field of protein generation, coarse-grained (CG) representations are used to collapse amino acids into single beads, which drastically reduces system size while still retaining higher-level structural features. There have been works that generate proteins in their coarse-grained representations, and other works that back-map coarse-grained proteins into their all-atom structure [12, 4, 24, 10].

In this paper, we develop a unified dual-scale flow matching method that improves accuracy of all-atom flow matching methods by separating training and inference into different stages for coarse-grained and all-atom resolutions. Moreover, we show that by performing a majority of the inference compute at the coarse-grained resolution, one can reduce the time taken for inference integration without sacrificing accuracy, paving the way for more efficient generation of larger systems. As a demonstrative example, in this work, we used a dataset of amorphously packed clusters consisting of 5 Y6 [25] molecules that we obtained from MD simulations. As the coarse-grained representation for our experiments, we pre-defined a mapping scheme that collapses the 505 atoms system into 65 beads (see Figure 1(a)). With gains in efficiency obtained from this method, we plan to scale up to larger systems in the future that are more representative of thin films in devices, as well as test other choices of coarse-graining mappings.

2 Background

Conditional Flow Matching (CFM). This is a new paradigm of training continuous normalizing flows (CNFs) efficiently in a simulation free manner [3, 2, 15]. An ODE parameterized by a learnable vector field vθ(𝐱,t)v_{\theta}(\mathbf{x},t) transforms an easy-to-sample prior distribution p0p_{0}, into the more complex target distribution p1p_{1}. Parameters θ\theta are learned through the regression objective

CFM=𝔼tU[0,1],𝐳p0(𝐱𝟎)p1(𝐱𝟏),𝐱pt(𝐱|𝐳)||vθ(𝐱,t)u(𝐱|𝐳)||2,\mathcal{L}_{CFM}=\mathbb{E}_{t\sim U[0,1],\mathbf{z}\sim p_{0}(\mathbf{x_{0}})p_{1}(\mathbf{x_{1}}),\mathbf{x}\sim p_{t}(\mathbf{x}|\mathbf{z})}||v_{\theta}(\mathbf{x},t)-u(\mathbf{x}|\mathbf{z})||^{2}, (1)

We chose the probability path definition to be pt(𝐱|𝐳)=𝒩(𝐱|t𝐱𝟏+(1t)𝐱𝟎,σ2)p_{t}(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}|t\mathbf{x_{1}}+(1-t)\mathbf{x_{0}},\sigma^{2}) , which gives u(𝐱|𝐳)=𝐱𝟏𝐱𝟎u(\mathbf{x}|\mathbf{z})=\mathbf{x_{1}}-\mathbf{x_{0}} as the conditional vector field. Similar to work by Stark et al. [20], we trained vθv_{\theta} to predict 𝐱𝟏\mathbf{x_{1}} rather than 𝐱𝟏𝐱𝟎\mathbf{x_{1}}-\mathbf{x_{0}}, and parameterized vθv_{\theta} by SE(3) equivariant refinement tensor field network (TFN) [23] layers. We present a comparison against other parameterizations in Section A.1.

Harmonic Prior. Besides the standard gaussian prior, another common choice for prior in molecule generation tasks is the harmonic prior. The harmonic prior was introduced in Jing et al. [9] and used within the flow-matching framework by Stark et al. [20] as a more chemically plausible prior. The prior distribution p0(𝐱𝟎)p_{0}(\mathbf{x_{0}}) is interpreted as a Boltzmann distribution with a quadratic potential energy function, i.e., p0(𝐱𝟎)eE(𝐱)=e𝐱T𝐇𝐱p_{0}(\mathbf{x_{0}})\propto e^{-E(\mathbf{x})}=e^{-\mathbf{x}^{T}\mathbf{H}\mathbf{x}}, where 𝐇\mathbf{H} is chosen such that E(𝐱)=Σi,j𝐱i𝐱j2E(\mathbf{x})=\Sigma_{i,j\in\mathcal{E}}||\mathbf{x}_{i}-\mathbf{x}_{j}||^{2}. We need to have the bond connectivity information \mathcal{E} a-priori to sample from the harmonic prior.

3 Methods

3.1 Model

Dual-Scale CFM. This method decomposes the task of predicting all-atom coordinates 𝐱𝟏\mathbf{x_{1}}, into two sequential steps of predicting coarse-grained coordinates 𝐜𝟏M×3\mathbf{c_{1}}\in\mathbb{R}^{M\times 3}, and predicting all-atom coordinates 𝐱𝟏N×3\mathbf{x_{1}}\in\mathbb{R}^{N\times 3}, where MM and NN are dimensions of coarse-grained and all-atom coordinates respectively. Two separate vector field networks (VFNs) are learned: 1) vθ(𝐜,t)v_{\theta}(\mathbf{c},t) to generate coarse-grained bead positions 𝐜𝟏\mathbf{c_{1}}, and 2) vϕ(𝐱,t|𝐜^𝟏)v_{\phi}(\mathbf{x},t|\mathbf{\hat{c}_{1}}) to generate all-atom positions 𝐱𝟏\mathbf{x_{1}} conditioned on predicted bead positions 𝐜^𝟏\mathbf{\hat{c}_{1}}. Separate priors p0(𝐜𝟎)p_{0}(\mathbf{c_{0}}) and p0(𝐱𝟎)p_{0}(\mathbf{x_{0}}) are used for the two flows since they act on different input dimensionalities.

Training and Inference. Both flows are trained using Eq. 1 with the relevant distributions and vector fields being substituted in. To be precise, the two objectives are:

CG=𝔼tU[0,1],𝐳p0(𝐜𝟎)p1(𝐜𝟏),𝐜pt(𝐜|𝐳)||vθ(𝐜,t)u(𝐜|𝐳)||2,\mathcal{L}_{CG}=\mathbb{E}_{t\sim U[0,1],\mathbf{z}\sim p_{0}(\mathbf{c_{0}})p_{1}(\mathbf{c_{1}}),\mathbf{c}\sim p_{t}(\mathbf{c}|\mathbf{z})}||v_{\theta}(\mathbf{c},t)-u(\mathbf{c}|\mathbf{z})||^{2}, (2)
AA=𝔼tU[0,1],𝐳p0(𝐱𝟎)p1(𝐱𝟏),𝐱pt(𝐱|𝐳)||vϕ(𝐱,t|𝐜𝟏)u(𝐱|𝐳)||2,\mathcal{L}_{AA}=\mathbb{E}_{t\sim U[0,1],\mathbf{z}\sim p_{0}(\mathbf{x_{0}})p_{1}(\mathbf{x_{1}}),\mathbf{x}\sim p_{t}(\mathbf{x}|\mathbf{z})}||v_{\phi}(\mathbf{x},t|\mathbf{c_{1}})-u(\mathbf{x}|\mathbf{z})||^{2}, (3)

vϕv_{\phi} was trained using the ground-truth coarse-grained coordinates 𝐜𝟏\mathbf{c_{1}} as the condition instead of predicted coordinates 𝐜^𝟏\mathbf{\hat{c}_{1}}, which decoupled the two flows. Inference was performed using an Euler ODE solver to integrate from t=0t=0 to 11 for each of the two flows. For the coarse-grained flow, the bead positions were obtained with 𝐜^𝟏=vθ(𝐜,t)\mathbf{\hat{c}_{1}}=v_{\theta}(\mathbf{c},t), and an integration timestep was performed with 𝐜𝐭+𝚫𝐭=𝐜𝐭+Δt(𝐜^𝟏𝐜𝟎)\mathbf{c_{t+\Delta t}}=\mathbf{c_{t}}+\Delta t(\mathbf{\hat{c}_{1}}-\mathbf{c_{0}}). Similarly, for the all-atom flow, 𝐱^𝟏=vϕ(𝐱,t|𝐜^𝟏)\mathbf{\hat{x}_{1}}=v_{\phi}(\mathbf{x},t|\mathbf{\hat{c}_{1}}), and 𝐱𝐭+𝚫𝐭=𝐱𝐭+Δt(𝐱^𝟏𝐱𝟎)\mathbf{x_{t+\Delta t}}=\mathbf{x_{t}}+\Delta t(\mathbf{\hat{x}_{1}}-\mathbf{x_{0}}) were used. A total of 40 integration steps were taken to go from coarse-grained prior sample 𝐜𝟎\mathbf{c_{0}} to all-atom target 𝐱𝟏\mathbf{x_{1}}.

Refer to caption
Figure 1: Coarse-graining mapping scheme, and influence of CG:AA ratio on metrics. (a) Our coarse-graining mapping scheme for an individual Y6 molecule consisting of 13 beads identified with different colors. (b) JSD bond lengths, angles, and inference time on test data generation as a function of CG:AA ratio. As we increased the ratio, we observed a noticeable decreasing trend in inference time with negligible change in JSD values.

3.2 Graph Representation

Besides cartesian coordinates (𝐜/𝐱\mathbf{c}/\mathbf{x}), the vector field predictors vθ/ϕv_{\theta/\phi} also take atom features 𝐟𝐂𝐆/𝐀𝐀\mathbf{f^{CG/AA}} and bond features 𝐛𝐂𝐆/𝐀𝐀\mathbf{b^{CG/AA}} as input. More details on features and their data-types are provided in Table 2.

We pre-defined a mapping BB between node indices of all-atom atoms and coarse-grained beads, based on the basic chemical intuition that atoms present within rings have restricted degrees of freedom. The scheme is shown in Figure 1(a). We needed to define a means of collapsing the atom features 𝐟𝐀𝐀\mathbf{f^{AA}} and bond features 𝐛𝐀𝐀\mathbf{b^{AA}} into corresponding features 𝐟𝐂𝐆\mathbf{f^{CG}} and 𝐛𝐂𝐆\mathbf{b^{CG}} of coarse-grained beads so that this information was not completely lost to the flow vθv_{\theta}.

Atom Features. We aggregated atom features with 𝐟𝐢𝐂𝐆=jB(i)𝐟𝐣𝐀𝐀\mathbf{f^{CG}_{i}}=\bigoplus_{j\in B(i)}\mathbf{f^{AA}_{j}}, where \bigoplus denotes an arbitrary aggregation operator. In our case we used either the elementwise Mean or OR operations depending on the data-type of the atomic feature. We did not include String atom features as input to vθv_{\theta}. We assigned the cartesian coordinates of beads to be the averaged coordinates of all atoms present within the bead, and included them as equivariant node features to vθv_{\theta}.

Bond Features. We assigned bond connectivity to beads based on connectivity of atoms within the beads. If any two atoms belonging to different beads had a bond connecting them in the all-atom structure, we created a bond between the two beads with the corresponding bond features.

3.3 Dataset

We followed the approach developed in Kupgan et al. [13] to simulate an amorphous morphology for the molecules. We first performed geometry optimization of the Y6 molecular structure using the method described in [21]. Then we used PACKMOL [16] to pack 5 structures in a cubic box at a low density (\sim 0.1 g/cm3). The OPLS-AA force field [18] was used via the LigParGen server [5]. The system was equilibrated for 30 ns at 650 K, cooled to room temperature (300 K) at 10 K/ns, and subjected to a 30 ns production run at 300 K. All MD simulations were performed under the NPT ensemble at 1 atm with a timestep of 2 fs, using GROMACS 2023 [1]. We saved 301 frames from the production run and used a random 80:10:10 split into train, validation and test datasets. We unwrapped the molecular structures to remove periodic boundary conditions (PBC) and also removed hydrogen atoms. Each resulting frame contained 505 atoms in total in the all-atom representation.

4 Results and Discussion

We evaluated generated samples on 3 metrics (shown in Table 1). To evaluate distributional matching capabilities, we measured Jensen Shannon Divergence (JSD) between distributions of generated and MD distributions for bond lengths and angles. We evaluated efficiency through the average time taken per integration step during inference on one NVIDIA A100 GPU. We noticed that the dual-scale flow matching method improved upon a single-scale flow matching method (with Gaussian/Harmonic priors) by 15-25% on bond length and angle JSDs, while also decreasing the inference time by \sim85%.

We also tested the influence of the ratio of integration steps taken in coarse-grained and all-atom resolutions, on the 3 metrics to see how much inference time can be saved without sacrificing accuracy (shown in Figure 1(b)). We denote the ratio of number of CG and AA integration steps by CG:AA. It is interesting to see that even for large CG:AA ratios, we significantly decreased inference time with negligible sacrifice to JSD performance.

Table 1: Distribution and inference time metrics. Comparison of single-scale CFM with Gaussian and Harmonic priors, and dual-scale CFM with Gaussian prior. We used 2 distributional metrics (on bond lengths and bond angles) that quantify the Jensen Shannon divergence (JSD) between generated and MD distributions on the test dataset. Inference time is the average time taken per integration step on one NVIDIA A100 GPU.
Model Bond Lengths (JSD) Bond Angles (JSD) Inference Time (s)
Single Scale 0.6563 0.6316 0.2949
Single Scale Harmonic 0.6298 0.6066 0.3039
Dual Scale 30:10 (Ours) 0.5472 0.4610 0.0496

The preliminary results are encouraging, opening up several avenues of further exploration for the future. We plan to scale to larger systems (\sim200 Y6 molecules) which is more representative of active layers in devices, and also test our approach on different chemistries and system sizes. Secondly, while we pre-defined a coarse-graining mapping scheme in this paper, we plan to test other schemes in the future to identify correlations between the chosen schemes and performance. Finally, we plan on expanding to other metrics that are also typically used in Boltzmann generators and conformer generation such as effective sample size (ESS), average minimum RMSD (AMR) and coverage [11, 7, 8].

5 Conclusions

In this work, we developed a dual-scale flow matching method that improves upon single-scale flow matching-based samplers in accuracy as well as inference efficiency, as demonstrated on a dataset of amorphous Y6 clusters. Moreover, it is interesting to see that we can push the ratio of CG:AA during inference integration to large values without sacrificing the prediction accuracy, while boosting inference speed. While we tested on clusters of size 5 in this paper, we plan to expand to larger clusters in the future whose packing properties are more representative of active layers in optoelectronic devices. We also plan on testing the influence of coarse-graining mapping scheme on performance since we tested a single pre-defined scheme in this work.

References

  • Abraham et al. [2015] M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, and E. Lindahl. Gromacs: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX, 1:19–25, 2015.
  • Albergo and Vanden-Eijnden [2022] M. S. Albergo and E. Vanden-Eijnden. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571, 2022.
  • Albergo et al. [2023] M. S. Albergo, N. M. Boffi, and E. Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.
  • Charron et al. [2023] N. E. Charron, F. Musil, A. Guljas, Y. Chen, K. Bonneau, A. S. Pasos-Trejo, J. Venturin, D. Gusew, I. Zaporozhets, A. Krämer, et al. Navigating protein landscapes with a machine-learned transferable coarse-grained model. arXiv preprint arXiv:2310.18278, 2023.
  • Dodda et al. [2017] L. S. Dodda, I. Cabeza de Vaca, J. Tirado-Rives, and W. L. Jorgensen. Ligpargen web server: an automatic opls-aa parameter generator for organic ligands. Nucleic acids research, 45(W1):W331–W336, 2017.
  • Fu et al. [2023] Y. Fu, T. H. Lee, Y.-C. Chin, R. A. Pacalaj, C. Labanti, S. Y. Park, Y. Dong, H. W. Cho, J. Y. Kim, D. Minami, et al. Molecular orientation-dependent energetic shifts in solution-processed non-fullerene acceptors and their impact on organic photovoltaic performance. Nature Communications, 14(1):1870, 2023.
  • Ganea et al. [2021] O. Ganea, L. Pattanaik, C. Coley, R. Barzilay, K. Jensen, W. Green, and T. Jaakkola. Geomol: Torsional geometric generation of molecular 3d conformer ensembles. Advances in Neural Information Processing Systems, 34:13757–13769, 2021.
  • [8] B. Jing, H. Stark, T. Jaakkola, and B. Berger. Generative modeling of molecular dynamics trajectories. In ICML’24 Workshop ML for Life and Material Science: From Theory to Industry Applications.
  • Jing et al. [2023] B. Jing, E. Erives, P. Pao-Huang, G. Corso, B. Berger, and T. Jaakkola. Eigenfold: Generative protein structure prediction with diffusion models. arXiv preprint arXiv:2304.02198, 2023.
  • Jing et al. [2024] B. Jing, B. Berger, and T. Jaakkola. Alphafold meets flow matching for generating protein ensembles. arXiv preprint arXiv:2402.04845, 2024.
  • Klein and Noé [2024] L. Klein and F. Noé. Transferable boltzmann generators. arXiv preprint arXiv:2406.14426, 2024.
  • Kohler et al. [2023] J. Kohler, Y. Chen, A. Kramer, C. Clementi, and F. Noé. Flow-matching: Efficient coarse-graining of molecular dynamics without forces. Journal of Chemical Theory and Computation, 19(3):942–952, 2023.
  • Kupgan et al. [2021] G. Kupgan, X. Chen, and J.-L. Bredas. Molecular packing of non-fullerene acceptors for organic solar cells: distinctive local morphology in y6 vs. itic derivatives. Materials Today Advances, 11:100154, 2021.
  • Li et al. [2022] Y. Li, W. Huang, D. Zhao, L. Wang, Z. Jiao, Q. Huang, P. Wang, M. Sun, and G. Yuan. Recent Progress in Organic Solar Cells: A Review on Materials from Acceptor to Donor. Molecules, 27(6):1800, 2022.
  • Lipman et al. [2022] Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
  • Martínez et al. [2009] L. Martínez, R. Andrade, E. G. Birgin, and J. M. Martínez. Packmol: A package for building initial configurations for molecular dynamics simulations. Journal of Computational Chemistry, 30(13):2157–2164, 2009.
  • Noé et al. [2019] F. Noé, S. Olsson, J. Köhler, and H. Wu. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science, 365(6457):eaaw1147, 2019.
  • Robertson et al. [2015] M. J. Robertson, J. Tirado-Rives, and W. L. Jorgensen. Improved peptide and protein torsional energetics with the opls-aa force field. Journal of chemical theory and computation, 11(7):3499–3509, 2015.
  • Shan et al. [2022] T. Shan, X. Hou, X. Yin, and X. Guo. Organic photodiodes: device engineering and applications. Frontiers of Optoelectronics, 15(1):49, 2022.
  • [20] H. Stark, B. Jing, R. Barzilay, and T. Jaakkola. Harmonic self-conditioned flow matching for joint multi-ligand docking and binding site design. In Forty-first International Conference on Machine Learning.
  • Subramanian et al. [2023] A. Subramanian, K. P. Greenman, A. Gervaix, T. Yang, and R. Gómez-Bombarelli. Automated patent extraction powers generative modeling in focused chemical spaces. Digital Discovery, 2(4):1006–1015, 2023.
  • Sun et al. [2022] L. Sun, K. Fukuda, and T. Someya. Recent progress in solution-processed flexible organic photovoltaics. npj Flexible Electronics, 6(1):89, 2022.
  • Thomas et al. [2018] N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint arXiv:1802.08219, 2018.
  • Yang and Gómez-Bombarelli [2023] S. Yang and R. Gómez-Bombarelli. Chemically transferable generative backmapping of coarse-grained proteins. arXiv preprint arXiv:2303.01569, 2023.
  • Yuan et al. [2019] J. Yuan, Y. Zhang, L. Zhou, G. Zhang, H.-L. Yip, T.-K. Lau, X. Lu, C. Zhu, H. Peng, P. A. Johnson, et al. Single-junction organic solar cell with over 15% efficiency using fused-ring acceptor with electron-deficient core. Joule, 3(4):1140–1151, 2019.

Appendix A Appendix / supplemental material

A.1 Choice of VFN architecture

Our choice of VFN architecture was based on a comparison of various architectures (shown in Table 2). We observed that using TFN as the VFN achieved best JSD performance across bond lengths and angles, as well as across gaussian and harmonic prior choices.

Table 2: Bond length and angle distributions comparison. JSD comparison between TFN, EGNN, and Attentive FP with Gaussian and Harmonic priors across the two collective variables (CV).
CV Prior TFN EGNN Attentive FP
Bond Length Gaussian 0.656 0.791 0.959
Harmonic 0.629 0.781 0.894
Bond Angle Gaussian 0.631 0.732 0.908
Harmonic 0.606 0.770 0.868

A.2 Graph Featurization

Table 3: Atom and Bond Features. Features extracted for atoms and bonds, along with their respective data types used for featurization. Same as was used by Stark et al. [20].
Feature Name Datatype
atomic_num Integer
chirality String
degree Integer
numring Integer
implicit_valence Integer
formal_charge Integer
numH Integer
hybridization String
is_aromatic Boolean
is_in_ring5 Boolean
is_in_ring6 Boolean
bond_type String
bond_stereo String
is_conjugated Boolean