Symmetry From Scratch:
Group Equivariance as a Supervised Learning Task
Abstract
In machine learning datasets with symmetries, the paradigm for backward compatibility with symmetry-breaking has been to relax equivariant architectural constraints, engineering extra weights to differentiate symmetries of interest. However, this process becomes increasingly over-engineered as models are geared towards specific symmetries/asymmetries hardwired of a particular set of equivariant basis functions. In this work, we introduce symmetry-cloning, a method for inducing equivariance in machine learning models. We show that general machine learning architectures (i.e., MLPs) can learn symmetries directly as a supervised learning task from group equivariant architectures and retain/break the learned symmetry for downstream tasks. This simple formulation enables machine learning models with group-agnostic architectures to capture the inductive bias of group-equivariant architectures.
1 Introduction
Equivariance has been crucial to the success of machine learning when working with systems that respect symmetry. From translational invariance used in CNNs for image classification to permutational invariance used in GNNs, their success can be attributed to higher sample efficiency and robustness toward distributional shifts. Nevertheless, enforcing equivariance under the correct problem setting is also essential for better performance. For example, real-life tasks and physical systems may have a lower symmetry level than the expressivity of the equivariant model due to noise or external sources; in such contexts, identifying asymmetries becomes essential for correctly generalizing real-life distributions and applying equivariant models can become too restrictive, capping the model’s performance and leading to underfitting. Therefore, a new objective is to account for symmetries and leverage the inductive bias therein while preserving the capability to account for symmetry-breaking.
Many works (Elsayed et al. 2020; Kaba and Ravanbakhsh 2024; Wang, Walters, and Yu 2022) have focused on adapting equivariant models to account for symmetry-breaking through relaxing known equivariant architectures. This work introduces a much simpler method to approach the problem. Through only supervised learning and existing equivariant models, one could maintain the expressivity of a universal approximator, learn the symmetric architecture of the equivariant model, and proceed with any real-life task that may contain symmetry-breaking data samples – all without having to design intricate architectures that are tailored towards the symmetries of the data distribution.
We summarize our key contributions as follows:
-
•
We provide empirical evidence that universal function approximators can learn symmetries through supervised learning.
-
•
We introduce a simple and novel method for modelling symmetric and symmetry-breaking systems.
-
•
We perform a preliminary set of experiments over different symmetry groups and model architectures to validate the generality of our claims.
1.1 Scope of Work
We present foundational work on a proof-of-concept training schematic that allows unconstrained, group-agnostic models to learn equivariance directly from equivariant architectures. While this result could shed new insight on topics of model distillation (Hinton, Vinyals, and Dean 2015), model extraction (Tramèr et al. 2016), or even further our understanding of neural network training dynamics, in this work, we investigate the efficacy of having learned symmetries as an initial weight condition for group-agnostic models to learn downstream tasks. Additionally, as a proof-of-concept work, we will only focus on feature extraction for images and, therefore, 2D signals over the discrete grid using group-equivariant convolutions.
1.2 Background
For 2D planar signals, current works that model symmetry-breaking with equivariant models are either fixed MLP layers that comply with relaxed equivariance constraints (Kaba and Ravanbakhsh 2024) or constrained to group convolutions (Cohen and Welling 2016a) with steerable filters (Cohen and Welling 2016b) on arbitrarily chosen equivariant bases (Wang, Walters, and Yu 2022). We show that the types of equivariance enforced by group convolution architectures can actually be learned directly via supervised learning by a more general class of group-agnostic architectures (fig.1), i.e., general MLPs111though our methodology easily extends to transformers and other widely used architectures., and once trained, can handle tasks involving both symmetric and symmetry-breaking data samples.
Group Equivariant Convolutions
Most 2D perception tasks have translational symmetry. Let’s consider a discrete linear system for signal processing. The direct consequence of imposing translational equivariance is that any output response of the system is now the convolution of the input signal and its impulse response of the system (i.e., the filter for a convolutional layer). Considering the objective of improved performance for perception tasks, the design decision of layered convolutional neural networks (Krizhevsky, Sutskever, and Hinton 2012) now only seems natural.
Building on the success of translational equivariance, group convolutions (Cohen and Welling 2016a) (GCNNs) were introduced as groundwork for CNNs being, along with translation, equivariant under groups that also included finite rotations and reflections.
(1) | ||||
(2) |
Here, the notion of performing inner products with filters under all possible translational transformations is generalized to all possible group transformations. Through a lifting convolution (eq.1), we first lift signals from over pixel space to signals over groups (i.e., a semi-direct product of all translations and roto-reflections), then follow up with convolutions performed under (eq.2). The critical observation is that while the convolutional operation enforces translational equivariance, roto-reflectional equivariance can be achieved with a stack of transformed filters so that enough information about the transformation is retained over group space (Cohen and Welling 2016a).
Vanilla group convolution can only handle the semi-direct product of translation and discrete groups, but with steerable filters (Cohen and Welling 2016b; Weiler, Hamprecht, and Storath 2018) group convolution can be made to accommodate groups with infinite elements (e.g., the circle group ), effectively expanding CNNs to be equivariant under all isometries (Weiler and Cesa 2021)).
The feature map for the original group convolution can be seen as coefficients that describe the signal with basis chosen such that each axis is associated with a group element. To represent such functions over infinite groups, we leverage that any representation of a compact group can be written as a direct sum of the group’s irreducible representations . Representing in a functional form, we can band-limit the signal and thus represent each feature map with finite Fourier coefficients , and recover the entire group representation via feature vector fields .
Group Equivariance as a Supervised Learning Task
While group equivariant convolutions are, in theory, capable of perfect equivariance on planar symmetry groups, when used in practice and planar signals are sampled from a pixel grid , discretization occurs, equivariance becomes approximate, and complications arise. For example, when designing a steerable basis for , band-limiting is arbitrary (up to aliasing), and the choice of Gaussian radial profiles is fixed – not unlikely that a more engineered basis would be more suitable for the learning task. As such, we postulate that there is room for more expressive models to find more optimal group representations while accounting for noise and asymmetries in real-life tasks.
In our supervised learning formulation, we show that a group-agnostic model can learn the approximate equivariance of group convolutions through only input-output observations and without enforcing any hard constraints on the model’s structure or training process, denoting the process as symmetry-cloning. Consequently, any symmetries learned by the group-agnostic model from group convolutions could be further optimized end-to-end directly from the task at hand. However, as there is no strong theory behind the convergence of supervised learning, each case of symmetries and class of group-agnostic models must be tested separately. We start this effort by demonstrating symmetry-cloning and its efficacy on downstream tasks with MLPs of different levels of complexity on the translational and discrete rotational group.
1.3 Related works
In the more theoretical line of work, tuning higher-symmetry models to lower-symmetry models generally involves some parameter-sharing process. From this perspective, Ravanbakhsh, Schneider, and Poczos (2017) explored designing model parameters to reflect equivariance over discrete group actions. Shakerinava et al. (2024) also introduced a weight-sharing regularization scheme, defining a loss that directly encourages weight-sharing to encourage symmetries in parameter space for machine learning in low-data regimes.
In direct relation to our experimental results, which focus on MLP architectures, is work by Finzi, Welling, and Wilson (2021) on equivariant MLPs (EMLP). They showed that EMLP layers can be constructed by decomposing arbitrary matrix groups (i.e., discrete groups and Lie groups) into their generators. The structure of a single MLP layer under any such group is shown to be the same as the solution of the equivariance constraint over the finite set of generators. While their EMLP architecture is constricted to stacks of EMLP layers with equivariant non-linearities, we show that more general MLP-based architectures, such as the MLP-mixer (Tolstikhin et al. 2021) (Section 2), may also learn group symmetries under a supervised learning context.
Provided that the ground-up approach to equivariant architectures is laborious and computationally expensive, many have worked more broadly on allowing group-agnostic general-purpose architectures to be part of a pipeline that, as a whole and in some limit, can be considered group-equivariant. Notably, frame-averaging introduced by Puny et al. (2022) and probabilistic symmetrization by Kim et al. (2024) both leverage group averaging to convert group-agnostic universal approximators into group-equivariant approximators, while Kaba et al. (2023) focused on mapping all samples to their canonical orientation with a learnable canonicalization function. However, for theoretical guarantees, these works often constrain their scope to work within single groups and do not consider symmetry-breaking.
In other works more specific to symmetry breaking, the angle of attack has been engineering modifications or adding additional weights to insert dependence on previously equivariant group transformations. For example, Kaba and Ravanbakhsh (2024) defined a relaxed equivariance constraint that modified the original construction of an EMLP to handle symmetry breaking. For group convolutions, Elsayed et al. (2020) tested the practicality of relaxing spatial invariance with a linear combination of a basis set of filter banks; and Wang, Walters, and Yu (2022) generalized the idea to arbitrary groups with the construction of relaxed group convolution, reintroducing symmetry-breaking dependence on specific pairs of feature map signals and group transformations.
2 Methods
2.1 Symmetry-Cloning
We propose to use supervised learning for learning group equivariance on a model that is not inherently equivariant. Let be a group representation of and be a -equivariant neural network parameterized by such that :
(3) |
We train a model to become approximately equivariant through supervised learning on dataset , where and . We call this process symmetry-cloning (alg.1).
Input: (-equivariant model),
Output: -cloned
2.2 Benchmarking Tasks

To demonstrate the merits of symmetry-cloning, we limit to be the trivial representation (i.e., group-invariant) for classification tasks and introduce a benchmark consisting of a symmetric task and a symmetry-breaking task. By comparing the performance of group-agnostic models, -cloned models, and group-equivariant models on both tasks, we demonstrate the effectiveness of our method in clonning the symmetries of the groups and , the groups of 2D translations and cyclic rotations by 90 degrees (clockwise), respectively:
(4) | ||||
(5) |
We run the benchmarking tasks for both and group transformations on the well-studied MNIST handwritten dataset (fig.2), respectively. In the symmetric task, we test for symmetries baked into the model by evaluating models on transformed samples not encountered by the model during training, while in the symmetry-breaking task, we evaluate the model’s ability to differentiate between certain symmetries.
More specifically, we specify the tasks as follows:
2.3 Group-agnostic Models
Building towards increasingly general architectures, we start with a simple case of symmetry-cloning: -cloning a single-channel convolutional layer with a 3x3 kernel and -cloning a single-channel group convolutional layer with a 4x3x3 kernel.
-
•
9-block mlp2cnn: We observe that the convolution operation, when unrolled as a single left matrix multiplication with the image, displays a block Toeplitz pattern. Therefore, the most straightforward group-agnostic architecture would be one that learns the permutation matrix, which, along with kernel parameters, would combine to reconstruct the Toeplitz matrix (fig.3). Note that in this case, the MLP layer is constrained to have as many linear components as there are kernel parameters.
Figure 3: Simple MLP layer component with architecture matching the number of convolution kernel parameters. -
•
approx-mlp2cnn: We relax the constraint on requiring the architecture to have as many linear components as there are kernels (i.e., having 7, 8, or 10 layers); instead, we add an additional embedding layer followed by a projection to the number of linear components involved. In effect, this allows kernel parameters to act more as input to the group-agnostic model as depicted in (fig.1).
-
•
9-block mlp2gcnn, approx-mlp2gcnn: In very much the fashion as mlp2cnn, we use four-stacked mlp2cnns to clone a single group-equivariant group convolution layer with a lifting filter of four channels; the approx-mlp2gcnn layers now having four additional projection heads.
In both the mlp2cnn and mlp2gcnn cases, we may use the symmetry-cloned MLP layers analogous to how one would stack a full CNN, apply the appropriate pooling, and add a classification head to produce an MNIST classifier. Abusing notation, we also refer to the -cloned classifiers by their layer names.
Eventually, we wish to apply symmetry-cloning to less constricting architectures on groups with infinite elements. The current procedure for building entire classifiers via stacking symmetry-cloned layers becomes prohibitively expensive as the cloning architectures become more generalized. Nevertheless, without being applied to the benchmarking tasks, we still demonstrate that a more practical architecture, MLP-Mixer, can also be symmetry-cloned.
-
•
mlpmixer2cnn: We apply a first MLP-Mixer layer to encode the concatenated input and kernel parameters, followed by a second to decode the convolutional output. As the architecture is much more general, large enough MLP-Mixers should be able to learn entire CNNs directly, but we leave that for future work.
-
•
mlpmixer2scnn: Instead of regular group convolutions, we extend symmetry-cloning to a more general class of steerable CNN that is -equivariant, albeit on a small input size and show that it can roughly capture the equivariant nature of a steerable CNN.
2.4 KL Weight Regularization
When training symmetry-cloned classifiers on the benchmarking tasks, to better leverage the learned symmetries of symmetry-cloned models, some regularization is needed to ensure that weights do not deviate too quickly from the learned initialization . Therefore, following (Jaques et al. 2017), we use a KL constraint to prevent the weight distribution from drifting away from the equivariant initialization too far too quickly:
(6) |
3 Results and Discussion
For symmetry-cloning, in the simplest case of 9-block mlp2cnn, we can compare the learned parameter matrix with the exact Toeplitz matrix unrolled from convolution (fig.4). However, as the cloning model architectures become less constrained with added layers and non-linearity, it becomes harder to unroll and compare with convolutions. So, instead, we show the equivariance of the cloned models via feature map comparisons. For mlp2cnn (fig.5) and mlp2gcnn models (fig.6), symmetry-cloning works exceptionally well, even when the exact Toeplitz correspondence is broken. The performance drop is barely noticeable, and the output feature maps visibly maintain equivariance. This dramatically contrasts with the feature mapping of an MLP layer that has not gone through symmetry cloning.
We notice a significant increase in computation time required for symmetry-cloning a mlpmixer model. As it becomes nearly prohibitively expensive to train, we revert to much smaller input sizes. Nevertheless, it can be done (fig.7). Training -cloned mlpmixer2scnn, the issue is more pronounced, but we can still see that the feature maps still resemble the feature maps of the steerable CNN, exhibiting signs of equivariance (fig.8).
We also present results from all benchmarking tasks for mlp2cnn and mlp2gcnn compared to their target CNN/GCNN architectures and uncloned MLP counterparts (table.1). Freeze denotes that we freeze all mlp2cnn/gcnn layers, training only the kernel parameters that are input to the mlp2cnn/gcnn layers, effectively utilizing the mlp2cnn/gcnn layer as an approximate CNN/GCNN layer. Unfreeze denotes that we allow all weights to be trainable while applying the KL regularization term. From the table, we conclude that preliminary experimental results demonstrate that symmetry-cloned mlp2cnn/gcnn models can learn both symmetric and symmetry-breaking downstream tasks. Although performance improvement over MLP in the rotational symmetry task is not ideal, we have not performed exhaustive hyperparameter searches or architectural optimization for this case.
With group equivariant convolutions as a starting point both figuratively and literally, symmetry-cloning could be framed as an effective weight initializer, warming up the models with an infinite amount of data, “warm-starting” an architecture with a better initialization. Symmetry-cloning allows models to adapt to a lower symmetry than the one they were trained on. Practically, we show that through simple supervised learning, we can learn a universal function approximator to capture both symmetric and symmetry-breaking features in a dataset without hardcoded feature engineering.





Translated MNIST | Rotational MNIST | |||||
Model | Symmetry | Symmetry Breaking | Model | Symmetry | Symmetry Breaking | |
MLP | MLP | |||||
CNN | GCNN | |||||
mlp2cnn freeze | mlp2gcnn freeze | |||||
- | - | |||||
- | - | |||||
- | - | |||||
mlp2cnn unfreeze | mlp2gcnn unfreeze |
3.1 Limitations
The work empirically presents most results, and as such, there are no theoretical bounds on the approximate clonability of any symmetry with any unconstrained model. Most tunable hyperparameters were chosen arbitrarily to demonstrate our approach’s general applicability. However, this also leaves much room for improvement on finding more rigorous approaches and optimal architectures.
4 Future Work
We present a unique learning task at the intersection of equivariant learning, model distillation, and model extraction. Therefore, finding novel connections to more established fields may allow us to study the efficacy of symmetry-cloning in more practical applications. As immediate next steps to this work, we aim to optimize the symmetry-cloning process, explore ways to speed up the convergence process and make the pipeline applicable to a more extensive range of networks. For example, extending symmetry-cloning to could allow for more optimized learning of molecules in voxel representations using existing 3D-Unet architectures (Özgün Çiçek et al. 2016).
We also lay the groundwork for a wide range of follow-up research. Including but not limited to extending symmetry-cloning on more complex groups with other exciting architectures, the likes of Transformers (Vaswani et al. 2017) and Kolmogorov-Arnold Networks (Liu et al. 2024). Additionally, studying the theoretical aspects of symmetry-cloning, such as convergence properties, data requirements, or even the applicability of symmetry-cloning as a metric for approximate equivariance, could be invaluable. We believe these directions could further validate and broaden the applicability of symmetry-cloning in various machine-learning tasks.
5 Conclusion
In this work, we offer a novel perspective on equivariant architectures and provide a straightforward method to study the effects of equivariance under a broader spectrum of model architectures. We show that with symmetry-cloning, group-agnostic models can still leverage the inductive biases of equivariant models while retaining capabilities to adapt to symmetry-breaking tasks. In particular, we have shown empirically that general MLP architectures can learn group equivariance from group-convolution models directly through supervised learning.
6 Acknowledgements
The authors would like to acknowledge useful discussions with Luca Thiede and Abdulrahman Aldossary. Resources used in preparing this research were provided by the Digital Research Alliance of Canada. A.A.-G. thanks Anders G. Frøseth for his generous support. A.A.-G. also acknowledges the generous support of Natural Resources Canada and the Canada 150 Research Chairs program.
References
- Cohen and Welling (2016a) Cohen, T.; and Welling, M. 2016a. Group Equivariant Convolutional Networks. In Balcan, M. F.; and Weinberger, K. Q., eds., Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, 2990–2999. New York, New York, USA: PMLR.
- Cohen and Welling (2016b) Cohen, T. S.; and Welling, M. 2016b. Steerable CNNs. arXiv:1612.08498.
- Elsayed et al. (2020) Elsayed, G. F.; Ramachandran, P.; Shlens, J.; and Kornblith, S. 2020. Revisiting spatial invariance with low-rank local connectivity. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of ICML’20, 2868–2879. JMLR.org.
- Finzi, Welling, and Wilson (2021) Finzi, M.; Welling, M.; and Wilson, A. G. 2021. A Practical Method for Constructing Equivariant Multilayer Perceptrons for Arbitrary Matrix Groups. arXiv:2104.09459.
- Hinton, Vinyals, and Dean (2015) Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the Knowledge in a Neural Network. arXiv:1503.02531.
- Jaques et al. (2017) Jaques, N.; Gu, S.; Bahdanau, D.; Hernández-Lobato, J. M.; Turner, R. E.; and Eck, D. 2017. Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control. arXiv:1611.02796.
- Kaba et al. (2023) Kaba, S.-O.; Mondal, A. K.; Zhang, Y.; Bengio, Y.; and Ravanbakhsh, S. 2023. Equivariance with learned canonicalization functions. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of ICML’23, 15546–15566. Honolulu, Hawaii, USA: JMLR.org.
- Kaba and Ravanbakhsh (2024) Kaba, S.-O.; and Ravanbakhsh, S. 2024. Symmetry Breaking and Equivariant Neural Networks. arXiv:2312.09016.
- Kim et al. (2024) Kim, J.; Nguyen, T. D.; Suleymanzade, A.; An, H.; and Hong, S. 2024. Learning Probabilistic Symmetrization for Architecture Agnostic Equivariance. arXiv:2306.02866.
- Krizhevsky, Sutskever, and Hinton (2012) Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc.
- Liu et al. (2024) Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T. Y.; and Tegmark, M. 2024. KAN: Kolmogorov-Arnold Networks. arXiv:2404.19756.
- Puny et al. (2022) Puny, O.; Atzmon, M.; Ben-Hamu, H.; Misra, I.; Grover, A.; Smith, E. J.; and Lipman, Y. 2022. Frame Averaging for Invariant and Equivariant Network Design. arXiv:2110.03336.
- Ravanbakhsh, Schneider, and Poczos (2017) Ravanbakhsh, S.; Schneider, J.; and Poczos, B. 2017. Equivariance Through Parameter-Sharing. arXiv:1702.08389.
- Shakerinava et al. (2024) Shakerinava, M.; Sohrabi, M.; Ravanbakhsh, S.; and Lacoste-Julien, S. 2024. Weight-Sharing Regularization. arXiv:2311.03096.
- Tolstikhin et al. (2021) Tolstikhin, I. O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; Lucic, M.; and Dosovitskiy, A. 2021. MLP-Mixer: An all-MLP Architecture for Vision. In Advances in Neural Information Processing Systems, volume 34, 24261–24272. Curran Associates, Inc.
- Tramèr et al. (2016) Tramèr, F.; Zhang, F.; Juels, A.; Reiter, M. K.; and Ristenpart, T. 2016. Stealing Machine Learning Models via Prediction APIs. arXiv:1609.02943.
- Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Wang, Walters, and Yu (2022) Wang, R.; Walters, R.; and Yu, R. 2022. Approximately Equivariant Networks for Imperfectly Symmetric Dynamics. In Proceedings of the 39th International Conference on Machine Learning, 23078–23091. PMLR. ISSN: 2640-3498.
- Weiler and Cesa (2021) Weiler, M.; and Cesa, G. 2021. General -Equivariant Steerable CNNs. arXiv:1911.08251.
- Weiler, Hamprecht, and Storath (2018) Weiler, M.; Hamprecht, F. A.; and Storath, M. 2018. Learning Steerable Filters for Rotation Equivariant CNNs. arXiv:1711.07289.
- Özgün Çiçek et al. (2016) Özgün Çiçek; Abdulkadir, A.; Lienkamp, S. S.; Brox, T.; and Ronneberger, O. 2016. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. arXiv:1606.06650.