Efficient Generation of Molecular Clusters with Dual-Scale Equivariant Flow Matching
Abstract
Amorphous molecular solids offer a promising alternative to inorganic semiconductors, owing to their mechanical flexibility and solution processability. The packing structure of these materials plays a crucial role in determining their electronic and transport properties, which are key to enhancing the efficiency of devices like organic solar cells (OSCs). However, obtaining these optoelectronic properties computationally requires molecular dynamics (MD) simulations to generate a conformational ensemble, a process that can be computationally expensive due to the large system sizes involved. Recent advances have focused on using generative models, particularly flow-based models as Boltzmann generators, to improve the efficiency of MD sampling. In this work, we developed a dual-scale flow matching method that separates training and inference into coarse-grained and all-atom stages and enhances both the accuracy and efficiency of standard flow matching samplers. We demonstrate the effectiveness of this method on a dataset of Y6 molecular clusters obtained through MD simulations, and we benchmark its efficiency and accuracy against single-scale flow matching methods.
1 Introduction
Amorphous molecular solids are disordered organic systems with several applications in the field of organic optoelectronics. Their mechanical flexibility and solution processability make them attractive alternatives to inorganic semiconductors [22, 14, 19]. Different molecules tend to pack differently in amorphous solids, which lead to different network extensivity, electronic properties, and transport rates [13]. The influence of packing structure on optoelectronic properties has led to significant interest in computationally simulating these materials with molecular dynamics (MD) simulations [13, 6] to enhance our understanding of the correlations between structural design and corresponding packing and optoelectronic properties. This can inform better design strategies for the next generation of organic electronics devices.
Having the ability to generate amorphous solids in a computationally efficient manner can enable faster sampling from the multi-molecule Boltzmann distribution, as well as property-guided generation of optimally packed configurations. There have been several works developing efficient methods to sample from Boltzmann distributions or their emulated (non re-weighted) versions [17, 11]. Scaling up to larger systems (such as amorphous clusters) is however computationally expensive if generation is to be performed at the all-atom (AA) level. In the field of protein generation, coarse-grained (CG) representations are used to collapse amino acids into single beads, which drastically reduces system size while still retaining higher-level structural features. There have been works that generate proteins in their coarse-grained representations, and other works that back-map coarse-grained proteins into their all-atom structure [12, 4, 24, 10].
In this paper, we develop a unified dual-scale flow matching method that improves accuracy of all-atom flow matching methods by separating training and inference into different stages for coarse-grained and all-atom resolutions. Moreover, we show that by performing a majority of the inference compute at the coarse-grained resolution, one can reduce the time taken for inference integration without sacrificing accuracy, paving the way for more efficient generation of larger systems. As a demonstrative example, in this work, we used a dataset of amorphously packed clusters consisting of 5 Y6 [25] molecules that we obtained from MD simulations. As the coarse-grained representation for our experiments, we pre-defined a mapping scheme that collapses the 505 atoms system into 65 beads (see Figure 1(a)). With gains in efficiency obtained from this method, we plan to scale up to larger systems in the future that are more representative of thin films in devices, as well as test other choices of coarse-graining mappings.
2 Background
Conditional Flow Matching (CFM). This is a new paradigm of training continuous normalizing flows (CNFs) efficiently in a simulation free manner [3, 2, 15]. An ODE parameterized by a learnable vector field transforms an easy-to-sample prior distribution , into the more complex target distribution . Parameters are learned through the regression objective
(1) |
We chose the probability path definition to be , which gives as the conditional vector field. Similar to work by Stark et al. [20], we trained to predict rather than , and parameterized by SE(3) equivariant refinement tensor field network (TFN) [23] layers. We present a comparison against other parameterizations in Section A.1.
Harmonic Prior. Besides the standard gaussian prior, another common choice for prior in molecule generation tasks is the harmonic prior. The harmonic prior was introduced in Jing et al. [9] and used within the flow-matching framework by Stark et al. [20] as a more chemically plausible prior. The prior distribution is interpreted as a Boltzmann distribution with a quadratic potential energy function, i.e., , where is chosen such that . We need to have the bond connectivity information a-priori to sample from the harmonic prior.
3 Methods
3.1 Model
Dual-Scale CFM. This method decomposes the task of predicting all-atom coordinates , into two sequential steps of predicting coarse-grained coordinates , and predicting all-atom coordinates , where and are dimensions of coarse-grained and all-atom coordinates respectively. Two separate vector field networks (VFNs) are learned: 1) to generate coarse-grained bead positions , and 2) to generate all-atom positions conditioned on predicted bead positions . Separate priors and are used for the two flows since they act on different input dimensionalities.
Training and Inference. Both flows are trained using Eq. 1 with the relevant distributions and vector fields being substituted in. To be precise, the two objectives are:
(2) |
(3) |
was trained using the ground-truth coarse-grained coordinates as the condition instead of predicted coordinates , which decoupled the two flows. Inference was performed using an Euler ODE solver to integrate from to for each of the two flows. For the coarse-grained flow, the bead positions were obtained with , and an integration timestep was performed with . Similarly, for the all-atom flow, , and were used. A total of 40 integration steps were taken to go from coarse-grained prior sample to all-atom target .

3.2 Graph Representation
Besides cartesian coordinates (), the vector field predictors also take atom features and bond features as input. More details on features and their data-types are provided in Table 2.
We pre-defined a mapping between node indices of all-atom atoms and coarse-grained beads, based on the basic chemical intuition that atoms present within rings have restricted degrees of freedom. The scheme is shown in Figure 1(a). We needed to define a means of collapsing the atom features and bond features into corresponding features and of coarse-grained beads so that this information was not completely lost to the flow .
Atom Features. We aggregated atom features with , where denotes an arbitrary aggregation operator. In our case we used either the elementwise Mean or OR operations depending on the data-type of the atomic feature. We did not include String atom features as input to . We assigned the cartesian coordinates of beads to be the averaged coordinates of all atoms present within the bead, and included them as equivariant node features to .
Bond Features. We assigned bond connectivity to beads based on connectivity of atoms within the beads. If any two atoms belonging to different beads had a bond connecting them in the all-atom structure, we created a bond between the two beads with the corresponding bond features.
3.3 Dataset
We followed the approach developed in Kupgan et al. [13] to simulate an amorphous morphology for the molecules. We first performed geometry optimization of the Y6 molecular structure using the method described in [21]. Then we used PACKMOL [16] to pack 5 structures in a cubic box at a low density ( 0.1 g/cm3). The OPLS-AA force field [18] was used via the LigParGen server [5]. The system was equilibrated for 30 ns at 650 K, cooled to room temperature (300 K) at 10 K/ns, and subjected to a 30 ns production run at 300 K. All MD simulations were performed under the NPT ensemble at 1 atm with a timestep of 2 fs, using GROMACS 2023 [1]. We saved 301 frames from the production run and used a random 80:10:10 split into train, validation and test datasets. We unwrapped the molecular structures to remove periodic boundary conditions (PBC) and also removed hydrogen atoms. Each resulting frame contained 505 atoms in total in the all-atom representation.
4 Results and Discussion
We evaluated generated samples on 3 metrics (shown in Table 1). To evaluate distributional matching capabilities, we measured Jensen Shannon Divergence (JSD) between distributions of generated and MD distributions for bond lengths and angles. We evaluated efficiency through the average time taken per integration step during inference on one NVIDIA A100 GPU. We noticed that the dual-scale flow matching method improved upon a single-scale flow matching method (with Gaussian/Harmonic priors) by 15-25% on bond length and angle JSDs, while also decreasing the inference time by 85%.
We also tested the influence of the ratio of integration steps taken in coarse-grained and all-atom resolutions, on the 3 metrics to see how much inference time can be saved without sacrificing accuracy (shown in Figure 1(b)). We denote the ratio of number of CG and AA integration steps by CG:AA. It is interesting to see that even for large CG:AA ratios, we significantly decreased inference time with negligible sacrifice to JSD performance.
Model | Bond Lengths (JSD) | Bond Angles (JSD) | Inference Time (s) |
---|---|---|---|
Single Scale | 0.6563 | 0.6316 | 0.2949 |
Single Scale Harmonic | 0.6298 | 0.6066 | 0.3039 |
Dual Scale 30:10 (Ours) | 0.5472 | 0.4610 | 0.0496 |
The preliminary results are encouraging, opening up several avenues of further exploration for the future. We plan to scale to larger systems (200 Y6 molecules) which is more representative of active layers in devices, and also test our approach on different chemistries and system sizes. Secondly, while we pre-defined a coarse-graining mapping scheme in this paper, we plan to test other schemes in the future to identify correlations between the chosen schemes and performance. Finally, we plan on expanding to other metrics that are also typically used in Boltzmann generators and conformer generation such as effective sample size (ESS), average minimum RMSD (AMR) and coverage [11, 7, 8].
5 Conclusions
In this work, we developed a dual-scale flow matching method that improves upon single-scale flow matching-based samplers in accuracy as well as inference efficiency, as demonstrated on a dataset of amorphous Y6 clusters. Moreover, it is interesting to see that we can push the ratio of CG:AA during inference integration to large values without sacrificing the prediction accuracy, while boosting inference speed. While we tested on clusters of size 5 in this paper, we plan to expand to larger clusters in the future whose packing properties are more representative of active layers in optoelectronic devices. We also plan on testing the influence of coarse-graining mapping scheme on performance since we tested a single pre-defined scheme in this work.
References
- Abraham et al. [2015] M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, and E. Lindahl. Gromacs: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX, 1:19–25, 2015.
- Albergo and Vanden-Eijnden [2022] M. S. Albergo and E. Vanden-Eijnden. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571, 2022.
- Albergo et al. [2023] M. S. Albergo, N. M. Boffi, and E. Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.
- Charron et al. [2023] N. E. Charron, F. Musil, A. Guljas, Y. Chen, K. Bonneau, A. S. Pasos-Trejo, J. Venturin, D. Gusew, I. Zaporozhets, A. Krämer, et al. Navigating protein landscapes with a machine-learned transferable coarse-grained model. arXiv preprint arXiv:2310.18278, 2023.
- Dodda et al. [2017] L. S. Dodda, I. Cabeza de Vaca, J. Tirado-Rives, and W. L. Jorgensen. Ligpargen web server: an automatic opls-aa parameter generator for organic ligands. Nucleic acids research, 45(W1):W331–W336, 2017.
- Fu et al. [2023] Y. Fu, T. H. Lee, Y.-C. Chin, R. A. Pacalaj, C. Labanti, S. Y. Park, Y. Dong, H. W. Cho, J. Y. Kim, D. Minami, et al. Molecular orientation-dependent energetic shifts in solution-processed non-fullerene acceptors and their impact on organic photovoltaic performance. Nature Communications, 14(1):1870, 2023.
- Ganea et al. [2021] O. Ganea, L. Pattanaik, C. Coley, R. Barzilay, K. Jensen, W. Green, and T. Jaakkola. Geomol: Torsional geometric generation of molecular 3d conformer ensembles. Advances in Neural Information Processing Systems, 34:13757–13769, 2021.
- [8] B. Jing, H. Stark, T. Jaakkola, and B. Berger. Generative modeling of molecular dynamics trajectories. In ICML’24 Workshop ML for Life and Material Science: From Theory to Industry Applications.
- Jing et al. [2023] B. Jing, E. Erives, P. Pao-Huang, G. Corso, B. Berger, and T. Jaakkola. Eigenfold: Generative protein structure prediction with diffusion models. arXiv preprint arXiv:2304.02198, 2023.
- Jing et al. [2024] B. Jing, B. Berger, and T. Jaakkola. Alphafold meets flow matching for generating protein ensembles. arXiv preprint arXiv:2402.04845, 2024.
- Klein and Noé [2024] L. Klein and F. Noé. Transferable boltzmann generators. arXiv preprint arXiv:2406.14426, 2024.
- Kohler et al. [2023] J. Kohler, Y. Chen, A. Kramer, C. Clementi, and F. Noé. Flow-matching: Efficient coarse-graining of molecular dynamics without forces. Journal of Chemical Theory and Computation, 19(3):942–952, 2023.
- Kupgan et al. [2021] G. Kupgan, X. Chen, and J.-L. Bredas. Molecular packing of non-fullerene acceptors for organic solar cells: distinctive local morphology in y6 vs. itic derivatives. Materials Today Advances, 11:100154, 2021.
- Li et al. [2022] Y. Li, W. Huang, D. Zhao, L. Wang, Z. Jiao, Q. Huang, P. Wang, M. Sun, and G. Yuan. Recent Progress in Organic Solar Cells: A Review on Materials from Acceptor to Donor. Molecules, 27(6):1800, 2022.
- Lipman et al. [2022] Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
- Martínez et al. [2009] L. Martínez, R. Andrade, E. G. Birgin, and J. M. Martínez. Packmol: A package for building initial configurations for molecular dynamics simulations. Journal of Computational Chemistry, 30(13):2157–2164, 2009.
- Noé et al. [2019] F. Noé, S. Olsson, J. Köhler, and H. Wu. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science, 365(6457):eaaw1147, 2019.
- Robertson et al. [2015] M. J. Robertson, J. Tirado-Rives, and W. L. Jorgensen. Improved peptide and protein torsional energetics with the opls-aa force field. Journal of chemical theory and computation, 11(7):3499–3509, 2015.
- Shan et al. [2022] T. Shan, X. Hou, X. Yin, and X. Guo. Organic photodiodes: device engineering and applications. Frontiers of Optoelectronics, 15(1):49, 2022.
- [20] H. Stark, B. Jing, R. Barzilay, and T. Jaakkola. Harmonic self-conditioned flow matching for joint multi-ligand docking and binding site design. In Forty-first International Conference on Machine Learning.
- Subramanian et al. [2023] A. Subramanian, K. P. Greenman, A. Gervaix, T. Yang, and R. Gómez-Bombarelli. Automated patent extraction powers generative modeling in focused chemical spaces. Digital Discovery, 2(4):1006–1015, 2023.
- Sun et al. [2022] L. Sun, K. Fukuda, and T. Someya. Recent progress in solution-processed flexible organic photovoltaics. npj Flexible Electronics, 6(1):89, 2022.
- Thomas et al. [2018] N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint arXiv:1802.08219, 2018.
- Yang and Gómez-Bombarelli [2023] S. Yang and R. Gómez-Bombarelli. Chemically transferable generative backmapping of coarse-grained proteins. arXiv preprint arXiv:2303.01569, 2023.
- Yuan et al. [2019] J. Yuan, Y. Zhang, L. Zhou, G. Zhang, H.-L. Yip, T.-K. Lau, X. Lu, C. Zhu, H. Peng, P. A. Johnson, et al. Single-junction organic solar cell with over 15% efficiency using fused-ring acceptor with electron-deficient core. Joule, 3(4):1140–1151, 2019.
Appendix A Appendix / supplemental material
A.1 Choice of VFN architecture
Our choice of VFN architecture was based on a comparison of various architectures (shown in Table 2). We observed that using TFN as the VFN achieved best JSD performance across bond lengths and angles, as well as across gaussian and harmonic prior choices.
CV | Prior | TFN | EGNN | Attentive FP |
---|---|---|---|---|
Bond Length | Gaussian | 0.656 | 0.791 | 0.959 |
Harmonic | 0.629 | 0.781 | 0.894 | |
Bond Angle | Gaussian | 0.631 | 0.732 | 0.908 |
Harmonic | 0.606 | 0.770 | 0.868 |
A.2 Graph Featurization
Feature Name | Datatype |
---|---|
atomic_num | Integer |
chirality | String |
degree | Integer |
numring | Integer |
implicit_valence | Integer |
formal_charge | Integer |
numH | Integer |
hybridization | String |
is_aromatic | Boolean |
is_in_ring5 | Boolean |
is_in_ring6 | Boolean |
bond_type | String |
bond_stereo | String |
is_conjugated | Boolean |