This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SuperSAM: Crafting a SAM Supernetwork via Structured Pruning and Unstructured Parameter Prioritization
thanks: Github: Here is a link to the repository.

Waqwoya Abebe1, Sadegh Jafari1, Sixing Yu1, Akash Dutta1,
Jan Strube2, Nathan R. Tallent2, Luanzheng Guo2,
Pablo Munoz3, Ali Jannesari1
1Iowa State University, 2Pacific Northwest National Laboratory,
3Intel Labs
{wmabebe, sadegh, yusx, adutta, jannesar}@iastate.edu,
{jan.strube, nathan.tallent, lenny.guo}@pnnl.gov,
[email protected]
Abstract

Neural Architecture Search (NAS) is a powerful approach of automating the design of efficient neural architectures. In contrast to traditional NAS methods, recently proposed one-shot NAS methods prove to be more efficient in performing NAS. One-shot NAS works by generating a singular weight-sharing supernetwork that acts as a search space (container) of subnetworks. Despite its achievements, designing the one-shot search space remains a major challenge. In this work we propose a search space design strategy for Vision Transformer (ViT)-based architectures. In particular, we convert the Segment Anything Model (SAM) into a weight-sharing supernetwork called SuperSAM. Our approach involves automating the search space design via layer-wise structured pruning and parameter prioritization. While the structured pruning applies probabilistic removal of certain transformer layers, parameter prioritization performs weight reordering and slicing of MLP-blocks in the remaining layers. We train supernetworks on several datasets using the sandwich rule. For deployment, we enhance subnetwork discovery by utilizing a program autotuner to identify efficient subnetworks within the search space. The resulting subnetworks are 30-70% smaller in size compared to the original pre-trained SAM ViT-B, yet outperform the pretrained model. Our work introduces a new and effective method for ViT NAS search-space design.

Index Terms:
Neural Architecture Search, Segment Anything Model

I Introduction

Vision Transformers (ViTs) Dosovitskiy (2020) have transformed the landscape of computer vision by leveraging the self-attention mechanism originally developed for natural language processing. The Segment Anyting Model (SAM) Kirillov et al. (2023a), is a ViT-based foundation model for image segmentation. SAM was trained on the SA1B dataset Kirillov et al. (2023a) which consists of 11M images and 1.1B mask annotations. SAM’s model architecture is composed of three major components, a large ViT-based image encoder, a lightweight prompt encoder and a lightweight mask decoder. Given the size of image encoder and its computationally expensive attention mechanism, several works have been conducted in compressing the model for deployment in resource constrained environments Zhang et al. (2023); Xiong et al. (2023); Fu et al. (2024).

Neural Architecture Search (NAS) techniques are advanced methods used to automatically discover efficient architectures for deep learning models. By automating the architectural design process, NAS methods aim to identify architectures that offer state-of-the-art performance while minimizing human intervention and computational cost. Recently, one-shot NAS methods have proven to be more efficient compared to traditional NAS techniques White et al. (2023). This is because one-shot NAS trains a single weight-sharing supernetwork that acts as a container of other subnetworks rather than train new architectures from scratch. As a result, the subnetworks directly inherit their weights from the same overparametrized supernetwork. This approach provides a means of training an exponential number of architectures for linear computation cost.

A major challenge in NAS involves designing of the the architecture search space Muñoz et al. (2024). The search space is the set of all possible sub-architecture configurations (subnetworks) that could be derived by pruning the supernetwork (i.e. which is first initialized as the pre-trained SAM model). The supernetwork acts as an over-parameterized architecture that contains all possible sub architectures White et al. (2023). Unsurprisingly, the search space is combinatorial making it quite infeasible to enumerate and optimize for the NAS task. In general, a NAS search space is first designed to bound the pool of potential architectures that can be derived. But even after bounding the search space, it could potentially contain millions of possible subnetwork configurations. Hence, an ideal search space not only reduces the number of candidate subnetworks but also contains promising subnetworks that require less resources to train.

The NAS training process iteratively samples subnetworks from the search space and optimizes them on every batch of training data. In particular, our approach follows the ‘sandwich rule’ Yu and Huang (2019), where for a single batch of training data, the maximal subnetwork (supernetwork), the minimal subnetwork and a randomly selected subnetwork are sequentially sampled and optimized. After optimization, gradients of the sampled subnetworks are pushed back into the supernetwork (main model) for updating its parameters.

Several search space design strategies have been proposed in the past. For instance, Cai et al. (2019) proposes Once-for-All, where 4-dimensional elasticity is applied to train a weight-sharing supernetwork for CNN architectures. They also propose progressive shrinking, a training strategy where the size of sampled subnetworks shrinks as the NAS training progresses. Recently, EFTNAS Muñoz et al. (2024) proposed performance aware search-space design coupled with first-order weight-reordering for transformer based models. NASViTGong and Wang (2022) designs efficient small and medium-sized models. Its search space is inspired by LeViTGraham et al. (2021), which utilizes a hybrid architecture combining convolutions and transformers. For each CNN block, the search focuses on optimizing channel widths, block depths, expansion ratios, and kernel sizes. For each transformer block, it explores the optimal number of windows, hidden feature dimensions, depths, and MLP expansion ratios. The BigNASYu et al. (2020) search space includes multiple dimensions such as kernel sizes, channel numbers, input resolutions, and network depths. These dimensions are simultaneously searched to identify optimal child models.

In this work, we introduce SuperSAM, by transforming SAM into an ‘elastic’ supernetwork, enabling the derivation of subnetworks with varying architectures tailored to a wide range of resource constraints. The subnetworks derived from SuperSAM exhibit comparable performance to the pre-trained SAM with just a fraction of its parameters. To do so, we propose combining probabilistic layer-wise pruning along side row/column-wise Wanda Sun et al. (2023b) parameter prioritization to design the search space and guide the subnetwork selection method. This approach first ranks the transformer layers and maintains the most important ones while the rest are assigned pruning probabilities. Moreover, the Wanda row/column-wise prioritization performs row/column reordering and slicing of MLP blocks within remaining transformer layers to retain the most salient parameters. This proposed dual elasticity acts as a hierarchical mechanism where the structured pruning allows the discovery of a few high-performing subnetworks whereas the reordering and slicing operations expand the intermediate search space to discover a wide range of robust architectures.

In summary:

  • We propose a novel algorithm for search-space design for transformer-based NAS.

  • We utilize this scheme to convert the SAM model into a weight-sharing supernetwork.

  • We train supernetworks on multiple datasets demonstrating a higher training efficiency.

  • We apply opentuner for subnetwork discovery and extract efficient subnetworks with comparable performance to the supernetwork.

The remainder of the paper is organized as follows: section II discusses related work, section III presents the methodology, section IV presents the evaluation. And we finally conclude the paper in section V.

II Related works

Refer to caption
Figure 1: After freezing the prompt encoder and applying 2D elasticity on the image encoder, the image encoder and mask decoder are jointly optimized using the sandwich rule. As shown above, while medium sized subnets may or may not overlap with each other, the smallest subnetwork lies in the intersectional region of all other subnetworks.

Neural Architecture Search:

The search space, consisting of a set of neural network architectures, is explored by NAS methods, which apply search and evaluation strategies to identify high-performing architectures that are often smaller and more efficient than human-crafted ones. Elsken et al. (2019) Traditional NAS methods require training each architecture from scratch, which is costly. In contrast, one-shot approachesWhite et al. (2023), introduced in 2022, train all architectures in the search space simultaneously by using a single ”supernetwork,” an over-parameterized model containing all possible subnetworks. In one-shot weight-sharing approaches, it has been demonstrated through careful experimental analysis that it is possible to efficiently identify promising architectures from a complex search space Bender et al. (2018). After training, search algorithms, such as reinforcement learning, evolutionary algorithms, or gradient-based methods, can be used to explore the vast space for possible architectures under certain constraints.

Various techniques exist for training the generated super-network. For instance, Once-for-All (OFA)Cai et al. (2019) introduces progressive shrinking, which enforces a training order that starts with large subnetworks and gradually moves to smaller ones. NASVITGong and Wang (2022) offers a set of methods, including a gradient projection algorithm, switchable layer scaling design, and a streamlined approach to data augmentation and regularization, all of which significantly enhance the convergence and performance of subnetworks. BigNAS Yu et al. (2020) challenges the conventional view that post-processing of weights is required for good prediction accuracy. Without additional retraining or post-processing, it trains a single set of shared weights. A sandwich sampling rule with inplace knowledge distillation (KD) is used to simultaneously optimize the supernet and sub-networks for each mini-batch, stabilizing training and improving convergence. EFTNAS Muñoz et al. (2024) proposes using first-order weight reordering to improve the search space design. In particular, a column-wise reordering of attention and MLP layers is used to improve results of elasticity operation.

Network Pruning:

Pruning is a popular technique for model compression. Generally, pruning can be categorized as structured or unstructured depending on the size of network components removed by the pruning operation. Structured pruning involves the removal of large network componentsCheng et al. (2024). For example, in ShortGPTMen et al. (2024), redundant layers are removed based on a metric that measures layer importance. LLM-PrunerMa et al. (2023) adopts a structured pruning approach by selectively removing non-critical coupled structures using gradient information. Similarly, BlockPrunerZhong et al. (2024) splits each transformer layer into Multi-Head Attention (MHA) and MLP blocks, evaluates their importance using perplexity measures, and employs a heuristic search for iterative pruning to optimize model efficiency. Additionally, Isomorphic PruningFang et al. (2024) demonstrates its effectiveness across various network architectures, such as Vision Transformers and CNNs, and consistently delivers competitive performance across models of different sizes, further showcasing the versatility of structured pruning techniques.

Unstructured pruning, also known as weight-wise pruning, targets individual weights by eliminating redundant connections in neural networks, setting the corresponding weights to zero. For example, magnitude pruningHan et al. (2015) reduces storage and computation requirements by learning and retaining only the most important connections, achieving this reduction without sacrificing accuracy. Wanda Sun et al. (2023a) removes weights with the smallest magnitudes, adjusted by the corresponding input activations, on a per-output basis. Similarly, SparseGPTFrantar and Alistarh (2023) approaches pruning as a layer-wise sparse regression problem, solving it approximately through a sequence of efficient Hessian updates and weight reconstructions, further optimizing neural network efficiency.

SAM:

Despite the impressive performance of the Segment Anything Model (SAM), its large ViT-based image encoder imposes a substantial inference cost, making it challenging to deploy in resource-constrained environments. Several methods have been proposed to address this issue. e.g., the Fast Segment Anything (FastSAM)Zhao et al. (2023) introduces FastSAM, which employs YOLOv8-seg, an object detector adapted for instance segmentation, to significantly reduce computational overhead. MobileSAMZhang et al. (2023) was developed by applying decoupled knowledge distillation from ViT-H image encoder to a new tiny-ViT image encoder. EfficientSAMsXiong et al. (2023) utilizes SAMI(masked image pretraining) to enhance visual representation learning by reconstructing features from SAM’s image encoder. By combining SAMI-pretrained lightweight encoders with a mask decoder, EfficientSAMs achieve both efficiency and effectiveness. The SAM-LighteningSonga et al. (2024) introduces Dilated Flash Attention, a re-engineered attention mechanism that increases inference speed by approximately 30 times compared to the original SAM.

III Methodology

Refer to caption
Figure 2: The NAS search space design involves two main operations. 1. Identify prunable layers and assign a pruning probability. 2. Apply weight reordering in MLP blocks of surviving layers before a slicing window operation. These operations significantly reduce the size of the subnetworks in the search space as well as improve the quality of the subnetwork candidates.

Search Space design:

Following the intuition of previous compression attempts, we apply our proposed NAS technique on the image encoder. Unlike previous works that use direct distillation on a single architecture (eg. Zhang et al. (2023)), we first apply elasticity to the image encoder and conduct NAS training. In NAS literature, elasticity describes the potential variability a certain architectural component can have across subnetworks Muñoz et al. (2021). For instance, two subnetworks (αi\alpha^{i} and αj\alpha^{j}) derived from the same supernetwork could have varying number of channels for the same layer kk, i.e. αkiαkj\alpha_{k}^{i}\neq\alpha_{k}^{j}. This makes layer kk elastic.

We apply 2-dimensional elasticity to the SAM ViT image encoder. The first elastic dimension is applied to the transformer layers as a form of structured pruning (similar to Men et al. (2024)). In this case, we utilize few-shot evaluation to determine the importance of the transformer layers with respect to a performance metric. This allows us to identify a set of prunable layers P{0,1,,11}P\in\{0,1,\ldots,11\} that will be assigned a pruning probability θi\theta_{i} during subnetwork selection. In particular, during NAS training, the pruning probability determines the selection of the corresponding transformer layer in the construction of the currently sampled subnetwork.

The second elasticity dimension applies weight reordering and slicing (windowing) of MLP-blocks in remaining (unpruned) layers. As shown in Fig. 2, MLP-blocks in remnant layers contain two linear layers. In the ViT-B version, these linear layers have a dimension of (3072 x 768) and (768 x 3072). Reordering and slicing is applied to the intermediate dimension (of size 3072). In particular, the rows of the first linear layer and the columns of the second linear layer are first ranked using the Wanda Sun et al. (2023a) metric. The row and column ranks are computed by aggregating Wanda scores of individual parameters in the rows and columns as shown in Eq. (1) below:

Si=i=0n|Wij|Xj2S_{i}=\sum_{i=0}^{n}{|W_{ij}|\cdot||X_{j}||_{2}} (1)

where SiS_{i} is score for ii-th column, WijW_{ij} is the magnitude of the ijij-th parameter and Xj2||X_{j}||_{2} is the l2l_{2} norm of the jj-th input features aggregated across NN input batches with sequence length LL.

Refer to caption
Figure 3: Comparing mIoU and model sizes by pruning one or more layers from the SAM ViT-B image encoder.

Corresponding row and column ranks are averaged to compute a mean rank over the intermediate dimension (i.e. for both linear layers). For instance, the rank of the 3rd row in linear layer 1 is averaged with the rank of the 3rd column in linear layer 2 to compute the average rank of the third row-column pair. The rank averaging is computed across all corresponding row-column pairs to compute a mean rank over the intermediate dimension. The mean rank is then used to reorder corresponding rows and columns of linear layers 1 and 2 in descending order such that the reordered rows/columns align without distorting the forward/backward pass of the MLP-block. After reordering, a window wifori{1,2,,m}w_{i}\;\text{for}\;i\in\{1,2,\ldots,m\} is applied to the linear layers to slice the layers by retaining the first wiw_{i} rows and columns. Note that slicing neither affects the dimension of input to linear layer 1 (i.e. 768), nor the output dimension of linear layer 2 (i.e. 768). Rather, it only shrinks the intermediate dimension (i.e. 3072). This approach effectively prioritizes and retains the most salient parameters in the linear layers significantly improving the quality of the search space.

The proposed 2D elasticity is not arbitrary. Instead, we observe that the layer-wise pruning serves as starting points (anchors) in the search space whereas windowing operations expand on those anchors. Together, the layer-wise pruning and windowing act as a hierarchical mechanism for designing the search space. This is because, as previous work Men et al. (2024) has shown, and as we observed from preliminary experiments, transformer layers are not equally important. While some are critical, others can be redundant so much so that pruning (removing) layers may result in either a dramatic drop or marginal decrease in performance as shown in Fig. 3. Layer pruning significantly reduces the model size facilitating the presence of a few high performing subnetworks. In contrast, when applying only windowing, the search space produces a substantial number of low-quality subnetworks. Nevertheless, we cannot solely depend on layer-wise pruning, as this method results in only a limited set of high-quality subnets (anchors), typically numbering just a few dozen. As such, it poses a limitation in exploring a broader search space that includes smaller subnets dispersed between the anchors. This phenomenon is demonstrated in section IV-B.

NAS Training:

We implement the sandwich rule Yu and Huang (2019) to train the supernetworks. In particular, for each training batch, we sample the maximal subnet, the smallest subnet and a randomly selected subnet for sequential optimization. Here, we compute the cost using dice loss as shown in Eq. 2 below.

Dice=12×|GTPred||GT|+|Pred|\mathcal{L}_{\text{Dice}}=1-\frac{2\times|GT\cap Pred|}{|GT|+|Pred|} (2)

where GT is ground truth mask and Pred is predicted mask.

In all three cases, we keep the prompt encoder and mask decoder architecturally intact across all subnetworks. i.e. Elasticity is only applied to the image encoder. This simplifies implementation and avoids slicing into smaller architectural components allowing all subnets to directly inherit the pre-trained architectures and weights. To train the SAM supernet, images are first resized to 1024 ×\times 1024 and fed to the model along with prompts. SAM prompts can be one or more coordinate points and/or boxes on or around objects of interest Kirillov et al. (2023b). We freeze the mask decoder during training and jointly optimize the image encoder and mask decoder as shown in Fig. 1.

Deployment:

Once the SAM model is transformed into a supernetwork, its search space will have developed sufficiently to include subnetworks that perform on par with the supernetwork. As a result, a search algorithm can be used to identify subnets that meet specific resource constraints. For instance, we can query the search space to identify subnets with sizes under 60M parameters and mIoU performance \geq 80% on a given task.

Once SuperSAM is trained, therefore, we are now able to deploy opentuner Ansel et al. (2014), a program auto-tuner to execute the constrained search. OpenTuner uses an ensemble of disparate search techniques simultaneously; dynamically allocating a larger proportion of tests to promising techniques. The ensemble techniques are themselves organized by a meta technique. For instance, an AUC Bandit meta technique can combine greedy mutation, differential evolution and hill climber instances to tackle the search problem. In the following section, we describe various experiments showcasing the efficacy of the proposed search space design strategy.

IV Evaluation

Refer to caption
Figure 4: Comparing pareto frontier of different reordering strategies on different tasks.

We evaluate the proposed approach by training supernetworks using the SAM ViT-B model on different downstream tasks. These tasks include the SA1B Kirillov et al. (2023b) dataset that was used to train the SAM model, as well as MS-COCO Lin et al. (2014), ADE20K Zhou et al. (2017) and Mitochondria Lucchi et al. (2013) datasets. SA1B was used to train the SAM model, it contains around 11 million images and over 1.1 billion masks. In our experiments, we utilize 0.01% of SA1B to train our supernetwork. i.e. 10,000 images for training and 100 for validation. We set a cutoff limit of 64 objects per image, by slicing the number of objects in the image if the objects exceed the cutoff limit and randomly repeating objects in case there are fewer objects. On the other hand, we utilize the entire MS-COCO dataset about 92K images for training and 5K for validation. Similar to the SA1B case, here, we set an object limit of 8 per image. Mitochondria is a small domain specific dataset containing gray-scale images of cells. We ‘patchify’ the images into 256x256 patches to generate around 800 images for training and 800 for validation. Finally we use the entire train set, 20K images for training and a subset of 100 images for validation when using the ADE20K dataset. We present example outputs of the smallest subnet in Figs. 6 and 7.

We train the supernets on the instance segmentation task where either a single point prompt per object or bounding box is applied to segment the objects of interest. In particular, while we utilize box prompts for Mitochondria, we use point prompts for the other datasets. To account for prompt noise, we generate the prompts by selecting a random positive point inside the ground truth. Similarly, in the case of box prompts, we generate the box prompt around the object with a randomly chosen padding size of [0-20] pixels. We train our subnets using the DiceLoss Sudre et al. (2017) cost function and apply a learning rate of 1e51e-5. We also apply a lambda learning rate decay until the learning rate shrinks to 1% of its original value.

In designing the search space, we use a fraction of the SA1B, 100 images, to compare the importance of transformer layers of the model. Similar to observations made by ShortGPT Men et al. (2024), we notice that not all layers of the ViT-B image encoder are equally important. As shown in Fig 3, layer 0 exhibits highest importance such that removing it causes a \approx60% drop in mIoU. Moreover, after pruning layers [1,2,5,6,9] we notice that performance only drops by about \approx8% whereas the model size has shrunk to 62% of the original pre-trained model. With this insight, we design our search space by applying structured elasticity to layers [1,2,5,6,9]. Next, we apply column/row-wise elasticity windows of size [768, 1020, 1536, 2304], on remaining MLP blocks. Because of layer 0th importance, we keep it intact by shielding it from both layer-wise and row/column-wise elasticity operations.

IV-A Reordering techniques

TABLE I: Comparing computation cost for the smallest subnet to reach a target mIoU.
Dataset (Task) Target mIoU Reordering technique Iterations Training Cost
Energy (KWhr) Time (hrs.) Carbon footprint (kg CO2e)
SA1B 70% No reordering 840 5.35 7.68 2.34
Magnitude reordering 1056 6.73 9.66 2.94
Wanda reordering 672 4.28 6.14 1.87
MS-COCO 70% No reordering 4312 10.12 8.12 4.42
Magnitude reordering 5432 12.75 12.23 5.57
Wanda reordering 3192 7.49 6.01 3.27
Mitochondria 84% No reordering 1000 0.95 10.10 0.42
Magnitude reordering 1050 0.99 10.60 0.43
Wanda reordering 650 0.61 6.56 0.27
ADE20K 55% No reordering 808 0.32 1.07 0.14
Magnitude reordering 1212 0.47 1.61 0.21
Wanda reordering 606 0.23 0.81 0.10
Carbon footprint computed using local emission factor 0.437 MT CO2e/MWh

We start by training supernets on various tasks including, SA1B, MS-COCO, Mitochondria and ADE20K. In these experiments, we maintain identical layerwise elasticity settings and vary only the reordering techniques used to train the supernets. After training, we sample 100 subnetworks encompassing a wide range of sizes from the trained supernets. In particular, sampled architectures range in the size of 40M to 73M parameters in size compared to the supernet that has around 90M parameters. In Fig. 4, we compare the mIoU performance of the sampled architectures . For the SA1B and MS-COCO tasks, our proposed Wanda reordering strategy provides marginal performance gains over the “no reordering” and magnitude reordering. Interestingly, the “no reordering” performs slightly better on the ADE20K dataset, while it is outperformed on the domain specific mitochondria dataset.

A significant challenge in NAS is the large amount of computational resources required to train the subnetworks. While one-shot NAS approaches help reduce the incurred costs, it remains expensive to train a supernet with a large search space. Therefore, it is imperative not just to train a supernet but also to consider the resources required to train it. In this regard, we compare the resources required to train the supernets to a given milestone, i.e. a target accuracy for the smallest subnet in the search space. Table I shows that our proposed approach offers the fastest convergence on all tasks requiring fewer iterations to reach the target accuracy. Consequently, it requires less time and consumes less wattage to train.

IV-B Effect of structured pruning

Refer to caption
Figure 5: Comparing pareto frontier of different search space design strategies. Supernet A was generated via 1D elasticity (just windowing), Supernet B uses the proposed 2D elasticity.

We conducted an experiment to demonstrate the effect of structured pruning in designing the search space. In particular, we generated supernets on the SA1B and ADE20K tasks using two approaches. In the SA1B task, we designed the search space for the first supernet (A) using just windowing in all MLP-Blocks (except layer-0). For the second supernet (B), we applied the proposed method, i.e. structured pruning in conjunction with the same windowing strategy as in supernet A. As shown in Fig. 5, Supernet B generates more robust subnetworks than Supernet A, despite comprising significantly smaller subnetworks that add to the training burden. Hence, despite the smallest subnet in B (\approx 33.7M parameters) dragging its convergence, subenets in B exhibit superior performance than comparable subnets in A whose smallest subnet is \approx 51.4M parameters in size.

In the ADE20K task, we set out to implement a more fair comparison by involving smaller subnets in supernet A. We achieved this by adding an additional window wiw_{i} for the MLP-Block [256,768,1020,1536,2304], as well as adding a new window for the KQV-Blocks [512,768]. This effectively reduced the smallest subnet in A to \approx 33.8M parameters. In this scenario, we notice that supernet B has a significant performance edge over A as shown in Fig. 5. This supports our intuition that structured pruning can enhance the design of the search space.

Refer to caption
Figure 6: Sample data points selected from the SA1B validation set segmented using the smallest subnetwork (33.7M parameters) trained via proposed method.
Refer to caption
Figure 7: Sample data points selected from the Mitochondria validation set segmented using the smallest subnetwork (33.7M parameters) trained via proposed method.

V Conclusion

In this paper, we propose a new approach of search space design for training one-shot NAS in transformer-based vision models. We propose 2-dimensional elasticity where the first dimension applies probabilistic layer-wise structured pruning while the second dimension applies a Wanda-based row/column-wise reordering and slicing of MLP-Blocks of remaining layers. We show this hierarchical (2D) approach exploits the strengths of structured pruning to discover high performing subnetworks and utilizes windowing (slicing) to expand the search space to further discover robust subnetworks that lie in the intermediate space. We demonstrate the proposed technique on the Segment Anything Model (SAM), a newly developed foundation model for image segmentation. After training supernets on several tasks, we are able to discover and extract smaller networks with robust performance. In the SA1B case for instance, our subnetworks outperform the pretrained input model, SAM-ViT-B, despite the supernet being trained on just 0.1% of the dataset. Moreover, our proposed Wanda-based reordering and windowing technique shows superior convergence compared to ‘no reordering’, and ‘magnitude reordering’ cases. This is quite important as one of the challenges of NAS is the computation requirements required to train subnetworks in the search space.

Acknowledgments

This research is supported by the U.S. Department of Energy (DOE) through the Office of Advanced Scientific Computing Research’s “Orchestration for Distributed & Data-Intensive Scientific Exploration” and the “Cloud, HPC, and Edge for Science and Security” LDRD at Pacific Northwest National Laboratory. PNNL is operated by Battelle for the DOE under Contract DE-AC05-76RL01830. This research is also supported by the National Science Foundation (NSF) under Award Number 2211982. We thank Intel Labs and the Research IT team of Iowa State University for their support.

References

  • Ansel et al. [2014] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 303–316, 2014.
  • Bender et al. [2018] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In International conference on machine learning, pages 550–559. PMLR, 2018.
  • Cai et al. [2019] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019.
  • Cheng et al. [2024] Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • Dosovitskiy [2020] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Elsken et al. [2019] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1–21, 2019.
  • Fang et al. [2024] Gongfan Fang, Xinyin Ma, Michael Bi Mi, and Xinchao Wang. Isomorphic pruning for vision models. arXiv preprint arXiv:2407.04616, 2024.
  • Frantar and Alistarh [2023] Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR, 2023.
  • Fu et al. [2024] Jianhai Fu, Yuanjie Yu, Ningchuan Li, Yi Zhang, Qichao Chen, Jianping Xiong, Jun Yin, and Zhiyu Xiang. Lite-sam is actually what you need for segment everything. arXiv preprint arXiv:2407.08965, 2024.
  • Gong and Wang [2022] Chengyue Gong and Dilin Wang. Nasvit: Neural architecture search for efficient vision transformers with gradient conflict-aware supernet training. ICLR Proceedings 2022, 2022.
  • Graham et al. [2021] Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. Levit: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12259–12269, 2021.
  • Han et al. [2015] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  • Kirillov et al. [2023a] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023a.
  • Kirillov et al. [2023b] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023b.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • Lucchi et al. [2013] Aurélien Lucchi, Yunpeng Li, and Pascal Fua. Learning for structured prediction using approximate subgradient descent with working sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1987–1994, 2013.
  • Ma et al. [2023] Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720, 2023.
  • Men et al. [2024] Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, 2024.
  • Muñoz et al. [2021] J. Pablo Muñoz, Nikolay Lyalyushkin, Yash Akhauri, Anastasia Senina, Alexander Kozlov, and Nilesh Jain. Enabling nas with automated super-network generation. arXiv preprint arXiv:2112.10878, 2021.
  • Muñoz et al. [2024] J. Pablo Muñoz, Yi Zheng, and Nilesh Jain. Eftnas: Searching for efficient language models in first-order weight-reordered super-networks. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 5596–5608, 2024.
  • Songa et al. [2024] Yanfei Songa, Bangzheng Pua, Peng Wanga, Hongxu Jiang, Dong Donga, and Yiqing Shen. Sam-lightening: A lightweight segment anything model with dilated flash attention to achieve 30 times acceleration. arXiv preprint arXiv:2403.09195, 2024.
  • Sudre et al. [2017] Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3, pages 240–248. Springer, 2017.
  • Sun et al. [2023a] Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023a.
  • Sun et al. [2023b] Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023b.
  • White et al. [2023] Colin White, Mahmoud Safari, Rhea Sukthanker, Binxin Ru, Thomas Elsken, Arber Zela, Debadeepta Dey, and Frank Hutter. Neural architecture search: Insights from 1000 papers. arXiv preprint arXiv:2301.08727, 2023.
  • Xiong et al. [2023] Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, Raghuraman Krishnamoorthi, and Vikas Chandra. Efficientsam: Leveraged masked image pretraining for efficient segment anything. arXiv:2312.00863, 2023.
  • Yu and Huang [2019] Jiahui Yu and Thomas S Huang. Universally slimmable networks and improved training techniques. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1803–1811, 2019.
  • Yu et al. [2020] Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender, Pieter-Jan Kindermans, Mingxing Tan, Thomas Huang, Xiaodan Song, Ruoming Pang, and Quoc Le. Bignas: Scaling up neural architecture search with big single-stage models. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 702–717. Springer, 2020.
  • Zhang et al. [2023] Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
  • Zhao et al. [2023] Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything. arXiv preprint arXiv:2306.12156, 2023.
  • Zhong et al. [2024] Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, and Liangzhi Li. Blockpruner: Fine-grained pruning for large language models. arXiv preprint arXiv:2406.10594, 2024.
  • Zhou et al. [2017] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.