This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Canonicalization for Model Agnostic Equivariance

Siba Smarak Panigrahi, Arnab Kumar Mondal
McGill University & Mila
{siba-smarak.panigrahi,arnab.mondal}@mila.quebec
Abstract

This work introduces a novel approach to achieving architecture-agnostic equivariance in deep learning, particularly addressing the limitations of traditional layerwise equivariant architectures and the inefficiencies of the existing architecture-agnostic methods. Building equivariant models using traditional methods requires designing equivariant versions of existing models and training them from scratch, a process that is both impractical and resource-intensive. Canonicalization has emerged as a promising alternative for inducing equivariance without altering model architecture, but it suffers from the need for highly expressive and expensive equivariant networks to learn canonical orientations accurately. We propose a new optimization-based method that employs any non-equivariant network for canonicalization. Our method uses contrastive learning to efficiently learn a canonical orientation and offers more flexibility for the choice of canonicalization network. We empirically demonstrate that this approach outperforms existing methods in achieving equivariance for large pretrained models and significantly speeds up the canonicalization process, making it up to 2 times faster.

1 Introduction

Equivariant deep learning has emerged as a prominent approach within deep learning, aimed at developing neural networks that inherently understand and adapt to the symmetries in their input data [31, 13, 43, 9, 15]. By constructing models that remain unaffected by transformations such as rotations or reflections, these networks preserve the core properties of the data, facilitating more efficient learning and better generalization across tasks. This notion of equivariance proves invaluable in areas such as computer vision [46, 45, 6, 10], scientific applications [25, 19, 7, 5, 37], graphs [8, 18, 20, 21], and reinforcement learning [33, 34, 38, 40, 41, 39], where the ability to recognize patterns and make robust predictions demand a nuanced grasp of underlying data symmetries.

In the realm of equivariant model design, where the focus has traditionally been on creating novel equivariant layers [13, 43, 9, 44, 15, 14], a fresh research direction has emerged that centers around architecture-agnostic approaches. These methods, including symmetrization [4, 3, 28], frame-averaging [36], and canonicalization [27, 35], aim to make models inherently equivariant to the transformation of the data without the need for specialized parameterized layers and activations. These methods significantly simplifies equivariant model design and, in some scenarios, make them more efficient. In particular, canonicalization proved to be a cheap and efficient way to any existing neural network equivariant to a group of transformations [27]. This idea becomes more appealing especially when it comes to making any existing widely used large pre-trained models, including foundation models like SAM [29], completely equivariant [35].

In this work, we focus on enhancing the canonicalization process, specifically addressing its fundamental limitation: the reliance on equivariant architecture for constructing the canonicalization network. We explore an alternate optimization approach and propose a novel method that uses contrastive learning during training to learn a unique canonical orientation for inference. Our technique gives us the flexibility to use any neural network as a canonicalization network, including pretrained ones that further improves the ease of optimization. This further relaxes any architectural constraints required to build equivariant models making them more accessible to the wider deep learning community. Moreover, we demonstrate that our simple approach not only outperforms existing method to build equivariant models using canonicalization but also makes canonicalization process significantly more effcient.

2 Background

Kaba et al. [27] introduces a systematic and general method for equivariant machine learning based on learning mappings to canonical samples. Rather than trying to hand-engineer these canonicalization functions, they propose to learn them in an end-to-end fashion with a prediction neural network. Canonicalization can be seamlessly integrated as an independent module into any existing architecture to make them equivariant to a wide range of transformation groups, discrete or continuous. This approach not only matches the expressive capabilities of methods like frame averaging by Puny et al. [36] but also surpasses them by offering simplicity, efficiency, and a systematic end-to-end learning method that replaces hand-engineered frames with learned mappings for each group.

Refer to caption

Figure 1: Learning equivariant canonicalizer with a non-equivariant canonicalization network. All the transformations of the group are applied to the input image and passed through the canonicalization network in parallel. A dot product of the output of the canonicalization network with a reference vector gives us a distribution over the transformations to canonicalize the input. We also minimize the similarity between the vectors to get a unique canonical orientation.

2.1 Formulation

The approach formulates the invariance requirement for a function as the capability to map all members of a group orbit to the same output. This is achieved by mapping inputs to a canonical sample from their orbit before applying the function. For equivariance, elements are also mapped to a canonical sample and, following function application, transformed back according to their original position in orbit. This can be formalized by writing the equivariant function ff in canonicalized form as

f(𝐱)=c(𝐱)𝐩(c(𝐱)1𝐱)\displaystyle f\left(\mathbf{x}\right)=c^{\prime}\left(\mathbf{x}\right)\mathbf{p}\left(c\left(\mathbf{x}\right)^{-1}\mathbf{x}\right) (1)

where the function 𝐩:𝒳𝒴\mathbf{p}:\mathcal{{X}}\to\mathcal{{Y}} is called the prediction function and the function c:𝒳ρ(𝒢)c:\mathcal{{X}}\to\rho\left(\mathcal{{G}}\right) is called the canonicalization function. Here c(𝐱)1c\left(\mathbf{x}\right)^{-1} is the inverse of the representation matrix and c(𝐱)=ρ(ρ1(c(𝐱)))c^{\prime}\left(\mathbf{x}\right)=\rho^{\prime}\left(\rho^{-1}\left(c\left(\mathbf{x}\right)\right)\right) is the counterpart of c(𝐱)c\left(\mathbf{x}\right) on the output.

Kaba et al. [27] shows that ff is 𝒢\mathcal{{G}}-equivariant for any prediction function as long as the canonicalization function is itself 𝒢\mathcal{{G}}-equivariant, c(ρ(g)𝐱)=ρ(g)c(𝐱)g,x𝒢×𝒳c\left(\rho\left(g\right)\mathbf{x}\right)=\rho\left(g\right)c\left(\mathbf{x}\right)\ \forall\ g,x\in\mathcal{{G}}\times\mathcal{{X}}. This effectively decouples the equivariance and prediction components. Moreover, they also introduce the concept of relaxed equivariance to deal with symmetric inputs in 𝒳\mathcal{{X}}.

2.2 Canonicalization Function

Kaba et al. [27] choose the canonicalization function to be any existing equivariant neural network architecture with the output being a group element, which they call the direct approach. This ensures the 𝒢\mathcal{{G}}-equivariance constraint of the canonicalization function. For example, Group Convolutional Neural Network (G-CNNs) [13] are used to design a canonicalization function that is equivariant to the group of discrete rotations.

They also provide an alternative optimization approach, in which the canonicalization function is defined as

c(𝐱)argminρ(g)ρ(𝒢)s(ρ(g),𝐱)\displaystyle c\left(\mathbf{x}\right)\in\operatornamewithlimits{arg\,min}_{\rho\left(g\right)\in\rho\left(\mathcal{{G}}\right)}s\left(\rho\left(g\right),\mathbf{x}\right) (2)

where s:ρ(𝒢)×Xs:\rho\left(\mathcal{{G}}\right)\times{X}\to\mathbb{R} can be a neural network. In general, a set of elements can minimize ss, from which one of them is chosen arbitrarily. The function ss has to satisfy the following equivariance condition

s(ρ(g),ρ(g1)𝐱)=s(ρ(g1)1ρ(g),𝐱),g,g1𝒢\displaystyle s\left(\rho\left(g\right),\rho\left(g_{1}\right)\mathbf{x}\right)=s\left(\rho\left(g_{1}\right)^{-1}\rho\left(g\right),\mathbf{x}\right),\forall g,g_{1}\in\mathcal{{G}} (3)

and has to be such that argmin is a subset of a coset of the stabilizer of 𝐱\mathbf{x}. 111minimum should be unique in each orbit up to some input symmetry These are sufficient conditions for Eq. 2 to be a suitable canonicalization function [27]. The equivariance condition on ss can now not only be satisfied with equivariant architecture but also using a non-equivariant function u:𝒳u:\mathcal{{X}}\to\mathbb{R} and by defining:

s(ρ(g),𝐱)=u(ρ(g)1𝐱)\displaystyle s\left(\rho\left(g\right),\mathbf{x}\right)=u\left(\rho\left(g\right)^{-1}\mathbf{x}\right)

In this paper, we use this to design a novel and simpler technique for learning an equivariant canonicalization function with any existing neural network.

2.3 Prior Regularization

Mondal et al. [35] extend canonicalization to adapt any existing pretrained neural network to its equivariant counterpart. To enhance the canonicalization process, ensuring input orientations closely match what’s found in our training data, they introduce a novel regularizer known as the Canonicalization Prior (CP). This approach aims to leverage the similarity in orientations between fine-tuning and training datasets to guide canonicalization in closely matching the original orientations of inputs seen by the pretrained network during the pretraining stage.

From a probabilistic standpoint, the canonicalization function maps each data point into a probability distribution across a group of transformations, denoted by 𝒢\mathcal{{G}}. For a specific data point 𝐱\mathbf{x}, let c(𝐱)\mathbb{P}_{c(\mathbf{x})} represent the distribution induced by the canonicalization function over 𝒢\mathcal{{G}}. Assuming a canonicalization prior exists for the dataset 𝒟\mathcal{{D}}, characterized by a distribution 𝒟\mathbb{P}_{\mathcal{{D}}} over 𝒢\mathcal{{G}}, prior regularization aims to minimize the Kullback-Leibler (KL) divergence between 𝒟\mathbb{P}_{\mathcal{{D}}} and c(𝐱)\mathbb{P}_{c(\mathbf{x})}. This leads to the loss function: prior=𝔼𝐱𝒟[DKL(𝒟c(𝐱))]\mathcal{L}_{\text{prior}}=\mathbb{E}_{\mathbf{x}\sim\mathcal{D}}\left[D_{KL}(\mathbb{P}_{\mathcal{{D}}}\parallel\mathbb{P}_{c(\mathbf{x})})\right].

3 Method

We extend the optimization approach to enable the use of any neural network for canonicalization, with a special focus on a group of discrete transformations in this work. The optimization formula for a discrete group, denoted by 𝒢\mathcal{{G}}, is:

gargming𝒢u(ρ(g)1𝐱)\displaystyle g\in\operatornamewithlimits{arg\,min}_{g\in\mathcal{{G}}}u\left(\rho\left(g\right)^{-1}\mathbf{x}\right) (4)

Assuming there are no symmetric elements in the orbit represented by 𝐱𝒢={ρ(g)1𝐱g𝒢}\mathbf{x}^{\mathcal{{G}}}=\{\rho\left(g\right)^{-1}\mathbf{x}\mid g\in\mathcal{{G}}\}, it is important to ensure the function u()u() has a unique minimum to establish a canonical orientation. Additionally, should symmetric elements exist within the orbit, and if the minimum is attained among these symmetric positions, selecting any one of them will yield the correct canonical orientation (see [27, 26]).

In order to design this function u()u(), we resort to learning it using a neural network and minimizing the similarity among the output of the elements in the orbit. We output vectors corresponding to every element in the orbit using any neural network sθ()s_{\theta}(). This allows us to use techniques from the self-supervised learning literature to prevent representation collapse [42, 11, 1] including non-contrastive ones that relies on the maximizing the eigenspectrum of the covariance matrix [2, 47]. In contrast to this, outputting scalars directly makes the optimization harder while limiting us to only contrastive methods. Then, we take a dot product of outputs of sθ()s_{\theta}() with a reference vector vRv_{R}, which we can either learn or keep fixed. We get the distribution induced by canonicalization function c(𝐱)\mathbb{P}_{c(\mathbf{x})} by taking a softmax over {vRsθ(ρ(g)1𝐱)/τg𝒢}\{v_{R}\cdot s_{\theta}\left(\rho\left(g\right)^{-1}\mathbf{x}\right)/\tau\mid g\in\mathcal{{G}}\}, where τ\tau is the temperature parameter of the distribution that controls its sharpness and is set to 11 in our experiments. In this formulation, u()u() becomes the probability mass function. The final optimization formulation becomes:

gargming𝒢exp(vRsθ(ρ(g)1𝐱)/τ)g𝒢exp(vRsθ(ρ(g)1𝐱)/τ)\displaystyle g\in\operatornamewithlimits{arg\,min}_{g\in\mathcal{{G}}}\frac{\exp{(v_{R}\cdot s_{\theta}\left(\rho\left(g\right)^{-1}\mathbf{x}\right)/\tau)}}{\sum_{g^{\prime}\in\mathcal{{G}}}\exp{\left(v_{R}\cdot s_{\theta}\left(\rho\left(g^{\prime}\right)^{-1}\mathbf{x}\right)/\tau\right)}} (5)

Inorder to make this canonicalization process differentiable, we use straight through gradient trick as proposed in [27]. Alternatively, to introduce more augmentation effect during training [35], one can use Gumbel Softmax [24] to sample from c(𝐱)\mathbb{P}_{c(\mathbf{x})} in a differentiable way. Now, to obtain an unique canonical orientation, we train sθ()s_{\theta}() to output different vectors for every unique element in the orbit 𝐱𝒢\mathbf{x}^{\mathcal{{G}}} by minimizing the following loss, Opt\mathcal{L}_{Opt}:

𝔼𝐱𝒟[gi,gj𝒢,gigjsθ(ρ(gi)1𝐱)sθ(ρ(gj)1𝐱)]\displaystyle\mathbb{E}_{\mathbf{x}\in\mathcal{D}}\left[\sum_{g_{i},g_{j}\in\mathcal{{G}},g_{i}\neq g_{j}}s_{\theta}\left(\rho\left(g_{i}\right)^{-1}\mathbf{x}\right)\cdot s_{\theta}\left(\rho\left(g_{j}\right)^{-1}\mathbf{x}\right)\right] (6)

where 𝒟\mathcal{D} is the training dataset. This loss prevents the collapse of learnt vectors in the output space of sθ()s_{\theta}() for different transformations of the input xx by minimizing their similarity measured using elementwise dot product. Fig. 1 shows a schematic of our simple approach. The use of non-contrastive approaches [47, 2] that uses the cross-correlation between these vectors to prevent representation collapse is an interesting avenue of future work.

In the context of training from scratch [27], the loss from Eq. 6 can be jointly optimized with the task loss. Similarly, for fine-tuning or zero-shot adaptation [35], an additional prior regularization loss is used. Assuming the identity transformation to be the prior for natural image dataset [35], the loss prior\mathcal{L}_{prior} is given by:

𝔼𝐱𝒟f[log(exp(vRsθ(𝐱)/τ)g𝒢exp(vRsθ(ρ(g)1𝐱)/τ))]\displaystyle\mathbb{E}_{\mathbf{x}\in\mathcal{D}_{f}}\left[-\log\left(\frac{\exp{(v_{R}\cdot s_{\theta}\left(\mathbf{x}\right)/\tau})}{\sum_{g\in\mathcal{{G}}}\exp{\left(v_{R}\cdot s_{\theta}\left(\rho\left(g\right)^{-1}\mathbf{x}\right)/\tau\right)}}\right)\right] (7)

where 𝒟f\mathcal{D}_{f} is the finetuning dataset. As this formulation transfers the equivariance constraint of Eq. 3 to minimizing the loss in Eq. 6 over the data distribution, we can conveniently start with a pretrained sθ()s_{\theta}() to further ease the optimization process.

Typically, we choose sθ()s_{\theta}() that are smaller and faster than the large prediction network 𝐩()\mathbf{p}(). This is based on the assumption that determining a canonical orientation is simpler than the more complex downstream task that demands a deeper understanding of the input. Therefore, our method requires |𝒢||\mathcal{{G}}| forward passes in parallel through sθ()s_{\theta}() instead of the prediction function 𝐩()\mathbf{p}(), making it significantly more efficient than symmetrization-based methods [4, 3, 36].

4 Results

Pretrained Large Prediction Network \to ResNet50 ViT
Datasets \downarrow Model Acc C4C4-Avg Acc Acc C4C4-Avg Acc
CIFAR10 Vanilla 97.33 ±\pm 0.01 69.72 ±\pm 0.25 98.13 ±\pm 0.04 68.98 ±\pm 0.48
C4C4-Augmentation 95.76 ±\pm 0.01 94.77 ±\pm 0.05 96.61 ±\pm 0.04 95.60 ±\pm 0.03
EquiAdapt 96.19 ±\pm 0.01 96.18 ±\pm 0.02 96.14 ±\pm 0.14 96.12 ±\pm 0.11
EquiOptAdapt 97.16 ±\pm 0.01 97.16 ±\pm 0.01 96.96 ±\pm 0.02 96.96 ±\pm 0.02
STL10 Vanilla 98.30 ±\pm 0.01 88.61 ±\pm 0.34 98.31 ±\pm 0.09 78.63 ±\pm 0.25
C4C4-Augmentation 98.20 ±\pm 0.05 95.84 ±\pm 0.04 97.69 ±\pm 0.07 95.79 ±\pm 0.14
EquiAdapt 97.01 ±\pm 0.01 96.98 ±\pm 0.02 96.15 ±\pm 0.05 96.15 ±\pm 0.05
EquiOptAdapt 98.04 ±\pm 0.05 98.04 ±\pm 0.04 97.32 ±\pm 0.01 97.32 ±\pm 0.01
Table 1: Performance comparison of large pretrained models finetuned on different vision datasets. Both Accuracy (Acc) and C4C4-Average Accuracy (C4C4-Avg Acc) are reported. Acc refers to the accuracy on the original test set, and C4C4-Avg Acc refers to the accuracy on the augmented test set obtained using the group C4C_{4}.

While our method applies to training any equivariant models from scratch, motivated by the practical advantages of using large scale pretrained models, we only focus on their equivariant adaptation by finetuning them using prior regularization loss. This section presents results from experiments on well-known, publicly available pretrained networks. Our method, EquiOptAdapt, enables equivariant adaptation of these models without any additional architecture constraints on the canonicalizer. EquiOptAdapt maintains fine-tuned model performance, increases robustness against known out-of-distribution transformations, and operates faster than conventional equivariant canonicalization approaches.

4.1 Image Classification

Network (\to) MaskRCNN SAM MaskRCNN SAM
Setup (\downarrow) mAP C4-Avg mAP mAP C4-Avg mAP Inference times (\downarrow)
Zero-shot 48.19 29.34 62.32 58.77 23m 53s 2h 28m 43s
EquiAdapt 46.80 46.79 62.10 62.10 27m 09s (+13.68%) 2h 34m 36s (+3.96%)
EquiOptAdapt 48.01 48.01 62.30 62.30 25m 35s (+7.12%) 2h 30m 42s (+1.33%)
Table 2: Zero-shot performance comparison and inference times of large pretrained segmentation models with and without trained canonicalization functions on the validation set of COCO 2017 dataset [32].

Experiment Setup.

The Vanilla setup consists of fine-tuning ResNet50 [22] and Vision Transformer (ViT, [17]), which are widely used for obtaining image embeddings to solve downstream tasks. Both architectures were pretrained on ImageNet-1K [16], and the checkpoints are publicly available. 222Resnet50 checkpoint from PyTorch 333ViT-B/16 checkpoint from PyTorch. Another strong baseline is to fine-tune the pretrained architecture using C4C_{4} group data augmentation, given our prior knowledge that the evaluation is performed on a C4C4-augmented test set.

The EquiAdapt setup [35] uses an equivariant canonicalization network to build a canonicalizer that is placed before the pretrained architecture. Both the networks are finetuned using a cross-entropy loss for the classification task and an additional prior regularization loss is used for the canonicalization network. In comparison to this, the canonicalizer in EquiOptAdapt uses a smaller pretrained ResNet architecture as a canonicalization network sθ()s_{\theta}(). We set the output space of sθ()s_{\theta}() to 128 dimension, and vRv_{R} is a random constant Gaussian vector of the same dimension. Along with the cross-entropy classification task loss and prior\mathcal{L}_{prior}, the final fine-tuning loss includes opt\mathcal{L}_{opt} to learn an equivariant canonicalizer.

Evaluation setup.

We use a similar evaluation protocol as Mondal et al. [35]. Along with the accuracy on the original test set, we use C4C4-Average Accuracy that indicates accuracy on an augmented test set, where each image in the test set was rotated with elements of C4C_{4} group, i.e., group of 4 discrete rotations.

Results.

We present the finetuning results for different setups in Tab. 1 for CIFAR10 [30] and STL10 [12]. Our findings demonstrate that both EquiOptAdapt and EquiAdapt exhibit comparable performance to the Vanilla setup in terms of test-set accuracy, with EquiOptAdapt showcasing superior performance. This suggests that pretrained non-equivariant canonicalization network can further ease the optimization, thereby enhancing their ability to learn the mapping from data input to a unique element within the orbit of the considered group. Similar to Mondal et al. [35], we observe that more expressive canonicalizers lead to higher performance. Further, there is no gap between accuracy and C4C4-average accuracy, demonstrating the successful learning of equivariant canonicalizer, and hence, equivariant adaptation of the considered models. The Vanilla and C-4 Augmentation models perform significantly worse than equivariant adaptation based models while testing on C-4 augmented test set.

4.2 Zero-shot Instance Segmentation

Experiment Setup.

Next, we compare the zero-shot instance segmentation results for MaskRCNN [23] and Segment-Anything Model (SAM, [29]) on COCO 2017 [32]. Particularly, we evaluate promptable instance segmentation for the SAM, with bounding boxes as prompts. We keep the same setups as Sec. 4.1 where fine-tuning is replaced with zero-shot performance. Similar to the strategy in Mondal et al. [35], where a canonicalizer is trained on the COCO dataset with prior regularization prior\mathcal{L}_{prior}, we only train our canonicalizer with an additional optimaztion loss opt\mathcal{L}_{opt} to make the canonicalization process equivariant. Similar to Sec. 4.1, we initialize our non-equivariant canonicalizers with pretrained WideResNet-50 architecture.

Evaluation setup.

We use the mean-average precision (mAP) and C4C4-Average mAP scores. Here, again, C4C4-Average mAP score indicates the mAP score on an augmented val set of COCO 2017, where each image (and bounding boxes) was rotated with elements of C4C_{4} group while mAP indicates the mAP score on the original val set.

We also compare the relative wall clock time (in minutes) to learn the prior distribution c(x)\mathbb{P}_{c(x)} during training with [35]. Given that our chosen c(x)\mathbb{P}_{c(x)} is effectively a δ\delta-distribution centred on the identity element ee of the group, we evaluate the accuracy of learning this prior as the identity metric.

Results.

The results for various setups are presented in Table 2. Our analysis reveals that EquiAdapt and EquiOptAdapt effectively achieve architecture-agnostic equivariant adaptation of large pretrained models while maintaining their mean Average Precision (mAP) performance. Notably, again, EquiOptAdapt outperforms EquiAdapt in this regard. Additionally, we provide comprehensive insights into the total inference times for each setup in Tab. 2. The inference times for EquiOptAdapt and EquiAdapt indicate that the canonicalization process is 2×\times faster for EquiOptAdapt.

Moreover, Figure 2 plots the relative wall-time for EquiOptAdapt and EquiAdapt against the identity metric. We demonstrate that our proposed EquiOptAdapt is able to learn the prior distribution faster than EquiAdapt. This results from the ability to use any exisiting non-equivariant pretrained WideResNet model that trains and run faster than an Equivariant WideResNet architecture used in EquiAdapt [35]. Therefore, our findings suggest that EquiOptAdapt generally offers better performance and faster training and inference times compared to EquiAdapt.

Refer to caption

Figure 2: Identity metric vs. Relative wall-time (in minutes). We define the identity metric as the percentage of input images mapped to the identity group element ee, which is our prior distribution c(x)\mathbb{P}_{c(x)}. This figure demonstrates that our EquiOptAdapt is able to learn the prior faster than EquiAdapt.

5 Conclusion

Generalizing to out-of-distribution data remains a considerable obstacle for state-of-the-art deep learning models, particularly due to input transformations like rotations, scalings, and orientation changes. Large pretrained models can be made equivariant to such transformations through canonicalization [35]. However, existing approaches such as [27, 35] use equivariant networks for canonicalization which acts as a bottleneck for learning canonical orientations. This paper proposes EquiOptAdapt to address this expressivity constraint by leveraging an optimization-based approach with contrastive learning techniques enabling the use of any neural network architecture for canonicalization. Our experiments show that EquiOptAdapt preserves the performance of large pretrained models and surpasses existing methods on robust generalization to transformations of the data while significantly accelerating the canonicalization process. These findings highlight the practicality and effectiveness of our approach in achieving robust equivariant adaptation, marking an important advancement in improving out-of-distribution generalization and equivariant model design.

6 Limitations and Future Work

An important limitation of our current work lies in its focus on the group of discrete transformations. Prior experiments with continuous groups, such as the group of 2D rotations SO(2)SO(2) [35], have revealed the limited ability of E(2)E(2) steerable networks [43] to learn mappings from inputs to canonical orientations with prior regularization. This limitation can be potentially mitigated by utilizing more expressive unconstrained pretrained neural networks as the canonicalization network, which could lead to enhanced optimization. However, using continuous group will require test time optimization using the output energy values, which can make inference significantly more expensive. We plan to investigate this to find a workaround and introduce continuous rotations in future work.

In addition to continuous rotations, we intend to incorporate higher-order discrete rotations and compare them. The finer rotation angles present an intriguing challenge for both continuous and higher-order discrete rotations due to the artifacts introduced at the corners of images. To address this, we aim to design novel techniques to make the canonicalization network robust to the effect of artifacts. Moreover, exploring other non-contrastive correlation-based methods to train the canonicalizer is another interesting direction for future research.

Finally, automating prior discovery based on the performance of the pretrained model over different transformations of the input in the fine-tuning data can significantly impact the current limitation of manually deciding the prior. This can make the Equivariant Adaptation technique more general and agnostic to the choice of model, task, and data.

References

  • Balestriero et al. [2023] Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, et al. A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210, 2023.
  • Bardes et al. [2021] Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
  • Basu et al. [2023a] Sourya Basu, Pulkit Katdare, Prasanna Sattigeri, Vijil Chenthamarakshan, Katherine Driggs-Campbell, Payel Das, and Lav R Varshney. Efficient equivariant transfer learning from pretrained models. In Advances in Neural Information Processing Systems, pages 4213–4224. Curran Associates, Inc., 2023a.
  • Basu et al. [2023b] Sourya Basu, Prasanna Sattigeri, Karthikeyan Natesan Ramamurthy, Vijil Chenthamarakshan, Kush R Varshney, Lav R Varshney, and Payel Das. Equi-tuning: Group equivariant fine-tuning of pretrained models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6788–6796, 2023b.
  • Batatia et al. [2022] Ilyes Batatia, David P Kovacs, Gregor Simm, Christoph Ortner, and Gábor Csányi. Mace: Higher order equivariant message passing neural networks for fast and accurate force fields. Advances in Neural Information Processing Systems, 35:11423–11436, 2022.
  • Bekkers et al. [2018] Erik J Bekkers, Maxime W Lafarge, Mitko Veta, Koen AJ Eppenhof, Josien PW Pluim, and Remco Duits. Roto-translation covariant convolutional networks for medical image analysis. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part I, pages 440–448. Springer, 2018.
  • Bogatskiy et al. [2022] Alexander Bogatskiy, Sanmay Ganguly, Thomas Kipf, Risi Kondor, David W Miller, Daniel Murnane, Jan T Offermann, Mariel Pettee, Phiala Shanahan, Chase Shimmin, et al. Symmetry group equivariant architectures for physics. arXiv preprint arXiv:2203.06153, 2022.
  • Brandstetter et al. [2021] Johannes Brandstetter, Rob Hesselink, Elise van der Pol, Erik J Bekkers, and Max Welling. Geometric and physical quantities improve e (3) equivariant message passing. In International Conference on Learning Representations, 2021.
  • Cesa et al. [2021] Gabriele Cesa, Leon Lang, and Maurice Weiler. A program to build e (n)-equivariant steerable cnns. In International conference on learning representations, 2021.
  • Chen et al. [2023] Dongdong Chen, Mike Davies, Matthias J Ehrhardt, Carola-Bibiane Schönlieb, Ferdia Sherry, and Julián Tachella. Imaging with equivariant deep learning: From unrolled network design to fully unsupervised learning. IEEE Signal Processing Magazine, 40(1):134–147, 2023.
  • Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  • Coates et al. [2011] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
  • Cohen and Welling [2016] Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999. PMLR, 2016.
  • Cohen et al. [2018] Taco S Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical cnns. arXiv preprint arXiv:1801.10130, 2018.
  • Deng et al. [2021] Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacchi, and Leonidas J Guibas. Vector neurons: A general framework for so (3)-equivariant networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12200–12209, 2021.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  • Duval et al. [2023a] Alexandre Duval, Simon V Mathis, Chaitanya K Joshi, Victor Schmidt, Santiago Miret, Fragkiskos D Malliaros, Taco Cohen, Pietro Liò, Yoshua Bengio, and Michael Bronstein. A hitchhiker’s guide to geometric gnns for 3d atomic systems. arXiv preprint arXiv:2312.07511, 2023a.
  • Duval et al. [2023b] Alexandre Agm Duval, Victor Schmidt, Alex Hernández-Garćia, Santiago Miret, Fragkiskos D Malliaros, Yoshua Bengio, and David Rolnick. Faenet: Frame averaging equivariant gnn for materials modeling. In International Conference on Machine Learning, pages 9013–9033. PMLR, 2023b.
  • Gasteiger et al. [2019] Johannes Gasteiger, Janek Groß, and Stephan Günnemann. Directional message passing for molecular graphs. In International Conference on Learning Representations, 2019.
  • Gasteiger et al. [2021] Johannes Gasteiger, Florian Becker, and Stephan Günnemann. Gemnet: Universal directional graph neural networks for molecules. Advances in Neural Information Processing Systems, 34:6790–6802, 2021.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  • Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  • Kaba and Ravanbakhsh [2022] Oumar Kaba and Siamak Ravanbakhsh. Equivariant networks for crystal structures. Advances in Neural Information Processing Systems, 35:4150–4164, 2022.
  • Kaba and Ravanbakhsh [2023] Sékou-Oumar Kaba and Siamak Ravanbakhsh. Symmetry breaking and equivariant neural networks. arXiv preprint arXiv:2312.09016, 2023.
  • Kaba et al. [2023] Sékou-Oumar Kaba, Arnab Kumar Mondal, Yan Zhang, Yoshua Bengio, and Siamak Ravanbakhsh. Equivariance with learned canonicalization functions. In International Conference on Machine Learning, pages 15546–15566. PMLR, 2023.
  • Kim et al. [2023] Jinwoo Kim, Dat Nguyen, Ayhan Suleymanzade, Hyeokjun An, and Seunghoon Hong. Learning probabilistic symmetrization for architecture agnostic equivariance. In Advances in Neural Information Processing Systems, pages 18582–18612. Curran Associates, Inc., 2023.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • Krizhevsky et al. [2009] Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
  • [31] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • Mondal et al. [2020] Arnab Kumar Mondal, Pratheeksha Nair, and Kaleem Siddiqi. Group equivariant deep reinforcement learning. arXiv preprint arXiv:2007.03437, 2020.
  • Mondal et al. [2022] Arnab Kumar Mondal, Vineet Jain, Kaleem Siddiqi, and Siamak Ravanbakhsh. Eqr: Equivariant representations for data-efficient reinforcement learning. In International Conference on Machine Learning, pages 15908–15926. PMLR, 2022.
  • Mondal et al. [2023] Arnab Kumar Mondal, Siba Smarak Panigrahi, Oumar Kaba, Sai Rajeswar Mudumba, and Siamak Ravanbakhsh. Equivariant adaptation of large pretrained models. In Advances in Neural Information Processing Systems, pages 50293–50309. Curran Associates, Inc., 2023.
  • Puny et al. [2021] Omri Puny, Matan Atzmon, Heli Ben-Hamu, Ishan Misra, Aditya Grover, Edward J Smith, and Yaron Lipman. Frame averaging for invariant and equivariant network design. arXiv preprint arXiv:2110.03336, 2021.
  • Schütt et al. [2021] Kristof Schütt, Oliver Unke, and Michael Gastegger. Equivariant message passing for the prediction of tensorial properties and molecular spectra. In International Conference on Machine Learning, pages 9377–9388. PMLR, 2021.
  • Van der Pol et al. [2020] Elise Van der Pol, Daniel Worrall, Herke van Hoof, Frans Oliehoek, and Max Welling. Mdp homomorphic networks: Group symmetries in reinforcement learning. Advances in Neural Information Processing Systems, 33:4199–4210, 2020.
  • van der Pol et al. [2021] Elise van der Pol, Herke van Hoof, Frans A Oliehoek, and Max Welling. Multi-agent mdp homomorphic networks. arXiv preprint arXiv:2110.04495, 2021.
  • Wang et al. [2022a] Dian Wang, Mingxi Jia, Xupeng Zhu, Robin Walters, and Robert Platt. On-robot learning with equivariant models. arXiv preprint arXiv:2203.04923, 2022a.
  • Wang et al. [2022b] Dian Wang, Robin Walters, Xupeng Zhu, and Robert Platt. Equivariant qq learning in spatial action spaces. In Conference on Robot Learning, pages 1713–1723. PMLR, 2022b.
  • Wang and Isola [2020] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning, pages 9929–9939. PMLR, 2020.
  • Weiler and Cesa [2019] Maurice Weiler and Gabriele Cesa. General e (2)-equivariant steerable cnns. Advances in neural information processing systems, 32, 2019.
  • Worrall et al. [2017] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5028–5037, 2017.
  • Wu et al. [2023] Hai Wu, Chenglu Wen, Wei Li, Xin Li, Ruigang Yang, and Cheng Wang. Transformation-equivariant 3d object detection for autonomous driving. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2795–2802, 2023.
  • Yu et al. [2022] Hong-Xing Yu, Jiajun Wu, and Li Yi. Rotationally equivariant 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1456–1464, 2022.
  • Zbontar et al. [2021] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International conference on machine learning, pages 12310–12320. PMLR, 2021.