YOLOBench: Benchmarking Efficient Object Detectors on Embedded Systems

Ivan Lazarevich Matteo Grimaldi Ravish Kumar Saptarshi Mitra
Shahrukh Khan Sudhakar Sah
Deeplite
[email protected]

Abstract

We present YOLOBench, a benchmark comprised of 550+ YOLO-based object detection models on 4 different datasets and 4 different embedded hardware platforms (x86 CPU, ARM CPU, Nvidia GPU, NPU). We collect accuracy and latency numbers for a variety of YOLO-based one-stage detectors at different model scales by performing a fair, controlled comparison of these detectors with a fixed training environment (code and training hyperparameters). Pareto-optimality analysis of the collected data reveals that, if modern detection heads and training techniques are incorporated into the learning process, multiple architectures of the YOLO series achieve a good accuracy-latency trade-off, including older models like YOLOv3 and YOLOv4. We also evaluate training-free accuracy estimators used in neural architecture search on YOLOBench and demonstrate that, while most state-of-the-art zero-cost accuracy estimators are outperformed by a simple baseline like MAC count, some of them can be effectively used to predict Pareto-optimal detection models. We showcase that by using a zero-cost proxy to identify a YOLO architecture competitive against a state-of-the-art YOLOv8 model on a Raspberry Pi 4 CPU. The code and data are available at https://github.com/Deeplite/deeplite-torch-zoo.

1 Introduction

Object detection constitutes a pivotal task in the field of computer vision, entailing the critical process of identifying and localizing objects present within an image. Applications of object detection models include autonomous vehicles, surveillance, robotics, and augmented reality [7]. The central problem of deploying deep learning-based object detection solutions on embedded hardware platforms is the amount of computation, memory, and power required for their inference [18]. This necessitates the development of efficient object detection models specialized for low-footprint hardware devices to achieve an optimal trade-off of accuracy and latency.

Refer to caption — Figure 1: Pareto frontiers of YOLOBench models fine-tuned on the VOC dataset (on several target resolutions) from COCO-pretrained weights on 4 different hardware platforms. Each point represents a single model in the mAP-latency space, with the model family coded with color and marker shape (all YOLOv6-3.0 models are represented by the same color). Refer to Appendix B for Pareto frontier plots on other datasets.

For years, the state-of-the-art (SOTA) deep learning approach to object detection has been the series of YOLO architectures [31]. In recent years, remarkable strides have been taken in advancing YOLO-like single-stage object detectors, prioritizing real-time operation while simultaneously striving for higher accuracy and deployability on low-power devices. These advancements have primarily focused on enhancing various components of the detection pipeline. Key areas of improvement include the design of accurate and efficient backbone and neck structures within the network [34], exploration of different detection head designs (e.g. anchor-based [34] vs. anchor-free [11]), utilization of diverse loss functions [22], and implementation of novel training procedures including innovative data augmentation techniques [16]. These collective efforts have continually refined and evolved YOLO-like architectures, enhancing object detection effectiveness and efficiency in real-time scenarios. The differences between consecutive YOLO versions, such as YOLOv5 [14] and YOLOv6 [20], span various pipeline components, making it challenging to isolate their individual contributions. This paper aims to address these challenges by providing a fair comparison of recent YOLO versions under controlled conditions (e.g. same training loop for all models) to demonstrate the impact of the backbone and neck structure of YOLO-based models in embedded inference applications. We also use the collected accuracy and latency data for multiple YOLO-based detector variations to empirically evaluate training-free performance predictors commonly used in neural architecture search [3]. We summarize our contributions as follows:

•

We provide a latency-accuracy benchmark of $550$ + YOLO-based object detection models on $4$ different datasets, called YOLOBench. All the models are validated on $4$ different embedded hardware platforms (Intel CPU, ARM CPU, Nvidia GPU, NPU),
•

We show that if modern detection heads and training techniques are implemented for the detector training pipeline, multiple backbone and neck variations, including those of older architectures (e.g. YOLOv3 and YOLOv4), can be used to achieve state-of-the-art latency-accuracy trade-off,
•

Looking at YOLOBench as a neural architecture search (NAS) space, we demonstrate that, while most of the state-of-the-art zero-cost (training-free) proxies for model accuracy estimation are outperformed by simple baselines such as MAC count, the NWOT estimator [27] can be effectively used to identify potential Pareto-optimal YOLO detectors in a training-free manner,
•

We showcase the effectiveness of the NWOT estimator for optimal detector prediction by using it to identify a YOLO-like model with FBNetV3 backbone that outperforms YOLOv8 on the Raspberry Pi 4 ARM CPU.

2 Related Work

Table 1: Pareto-optimal YOLOBench models on 3 datasets and 3 hardware platforms. Shown are the best models in terms of mAP_50-95 under a given latency threshold (max. latency). For each model, the scaling parameters are given (d33w25 means depth factor

=0.33

and width factor

=0.25

), corresponding input resolution of the models is indicated in brackets.

HW/max.	VOC	VOC	SKU-110k	SKU-110k	WIDERFACE	WIDERFACE
latency	model	mAP_50-95	model	mAP_50-95	model	mAP_50-95
Nano/0.1 sec	YOLOv7	0.657	YOLOv8	0.567	YOLOv7	0.336
	d1w5 (288)		d1w25 (480)		d1w25 (480)
VIM3/0.05 sec	YOLOv6l	0.620	YOLOv6s	0.556	YOLOv6m	0.318
	d67w25 (416)		d33w25 (480)		d67w25 (480)
Raspi4/0.5 sec	YOLOv6l	0.669	YOLOv4	0.569	YOLOv7	0.336
	d67w5 (384)		d1w25 (480)		d1w25 (480)

There has been a tremendous amount of progress in efficient object detection in recent years pushing the accuracy-latency frontier, including architectures like YOLOv7 [34], YOLOv6-3.0 [20], DAMO-YOLO [37], RTMDet [26], RT-DETR [25] and PP-YOLOE [36]. These works oftentimes improve upon state-of-the-art latency-accuracy trade-offs, providing comparisons of several generations of detectors on the COCO dataset. Benchmarks of different model families are also provided by framework developers, such as MMYOLO [5] and Ultralytics [15]. Additionally, there exist third-party benchmarks of several architectures from the YOLO series on server-grade and embedded GPUs as well as specialized accelerators [17, 28, 10, 39]. We identify a few limitations of the existing efficient detector benchmarks that have served as motivation for YOLOBench:

•

Comparisons of different YOLO versions are frequently done either by using a proxy metric for the actual latency like MAC count and number of parameters or by reporting latency values on server-grade GPUs, neither of which is directly indicative of latency on embedded devices,
•

Accuracy metrics are usually reported on the COCO dataset, which could be considered too large-scale with respect to actual practical use cases,
•

Some architecture parameters (like input resolution) are often considered to be fixed in detector benchmarking, while it is known that they serve as important factors in optimal CNN scaling [13],
•

Different YOLO variations being compared to one another are typically trained with different training codebases, training techniques (loss functions, data augmentations), and hyperparameter values, making it hard to disentangle the contribution of the training pipeline improvements vs. better architecture design.

To address these issues, we conduct a thorough accuracy and latency benchmarking of state-of-the-art YOLO detector versions in controlled, fixed conditions to study the impact of backbone and neck design proposed by several YOLO model families.

3 Methodology

Table 2: YOLOBench architecture space (variation of backbone/neck, depth, width, and input resolution).

Model	Backbone	Neck
YOLOv3 [29]	DarkNet53	FPN
YOLOv4 [4]	CSPDNet53	SPP-PAN
YOLOv5 [14]	CSPDNet53-C3	SPPF-PAN-C3
YOLOv6s-3 [20]	EfficientRep	RepBiFPAN
YOLOv6m-3 [20]	CSPBep (e=2/3)	CSPRepBiFPAN
YOLOv6l-3 [20]	CSPBep (e=1/2)	CSPRepBiFPAN
YOLOv7 [34]	E-ELAN	SPPF-ELAN-PAN
YOLOv8 [15]	CSPDNet53-C2f	SPPF-PAN-C2f

Width factor $\in$ {0.25, 0.5, 0.75, 1.0}
Depth factor $\in$ {0.33, 0.67, 1.0}
Input resolution $\in$ {160:480:32}

The purpose of the current study is to thoroughly study the impact of the backbone and neck and its parameters (width, depth, input resolution) on the performance of YOLO detectors in terms of their accuracy and latency. For the rest of the factors influencing the accuracy-latency trade-off, such as choice of the detection head, loss function, training pipeline, and hyperparameters, we aim to have a fixed, controlled setup, so that we can isolate the effect of backbone and neck design on model performance. For this reason, we use the anchor-free decoupled detection head of YOLOv8 [31], as well as CIoU and DFL losses for bounding box prediction used in YOLOv8, as they have been shown to produce state-of-the-art results on the COCO dataset. Anchor-free detection in YOLO models has been also shown to provide latency benefits in the end-to-end detection pipelines [25]. Hence, the main source of variation in YOLOBench models is the structure and parameters of the backbone and neck. We also use the same training code and hyperparameters for all models, as set by default in the YOLOv8 training code released by Ultralytics [15], which provides a relatively simple training loop capable of producing SOTA results.

The flow of candidate model generation, pre-selection, and training is shown in Figure 2. First, we generate the full architecture space consisting of about $1000$ models by independently varying the backbone/neck structure, depth factor, width factor, and input resolution (Table 2). For each architecture, we consider its variations trained and tested on $11$ different input resolutions (from $160$ x $160$ to $480$ x $480$ with a step of $32$ ) and $12$ variations in depth and width, aside from $4$ usually considered scaling variants ( $n$ , $s$ , $m$ and $l$ ). The only exception is the YOLOv7 models, for which we only vary the width factor producing $4$ variations of the model. For YOLOv6 models, we use the v3.0 version [20], for which provided $s$ , $m$ and $l$ variations actually represent different architectures aside from different depth and width factors (see Table 2). Hence, we consider YOLOv6s, YOLOv6m and YOLOv6l as different model families and generate the same 12 depth-width combinations for each one.

Latency measurements. The actual inference latency for each model might vary significantly depending on the deployment environment and runtime. Therefore, we collect the latency measurements for each of the models by running inference on $4$ different hardware platforms (runtime and inference precisions specified in brackets):

•

NVIDIA Jetson Nano GPU (ONNX Runtime, FP32)
•

Khadas VIM3 NPU (AML NPU SDK, INT16)
•

Raspberry Pi 4 Model B CPU (TFLite with XNNPACK, FP32)
•

Intel^® Core™i7-10875H CPU (OpenVINO, FP32)

We did not consider latency measurements for INT8 precision, as depending on the quantization scheme (e.g. per-tensor vs. per-channel) and approach (e.g. post-training quantization vs. quantization-aware training), there can be a varied impact of INT8 quantization on accuracy. Adding INT8 results for both accuracy and latency in YOLOBench is a matter of future work. All latency measurements were performed with a batch size of $1$ averaged over $200$ inference cycles (with $5$ warmup steps). We measured the inference time required to execute the YOLO model graph, without taking bounding box post-processing (e.g. non-maximum suppression) into account. Note that for VIM3 NPU measurements, the bounding box decoding post-processing operations (operations after the last convolutional layers of the network) were also skipped due to the limitations of VIM3 SDK.

Training pipeline. To obtain the accuracy metric values for the models, we consider the following $4$ datasets: (i) PASCAL VOC (20 object categories) [9], (ii) SKU-110k (1 class, retail item detection) [12], (iii) WIDER FACE (1 class, face detection) [38], (iv) COCO (80 object categories) [24]. Our motivation to include several smaller-scale (with respect to COCO) but challenging datasets stems from the fact that for many practical deployment use cases, the number of object categories to detect and the amount of available data might be limited. The metric of interest for all datasets is mAP_50-95. For all selected models, the training procedure starts with pretraining on the COCO dataset (for $300$ epochs, with a batch size of $64$ and $640$ x $640$ resolution), afterward the best COCO weights are used as initialization for other datasets, on which we perform fine-tuning for $100$ epochs (batch size of $64$ ) on all $11$ YOLOBench resolutions and select the best weights (in terms of mAP_50-95 value) for each one. For the COCO dataset, we do not perform fine-tuning on target resolutions, rather we evaluate the model trained on $640$ x $640$ images on all target resolutions (to mimic the deployment of pre-trained COCO weights). All other training hyperparameters are set as per default values of the Ultralytics YOLOv8 codebase [15]. Model weights are randomly initialized for all experiments (e.g. no transfer of ImageNet weights for the backbone is performed).

Candidate model pre-selection. In order to reduce the number of training runs on the COCO dataset, we filter out some of the least promising model candidates from the YOLOBench architecture space as an initial step of our benchmarking procedure. To determine the most promising models in terms of the accuracy-latency trade-off, we compute a proxy metric that is well correlated with the final mAP values of the models fine-tuned on the target datasets. A natural choice for such a metric is model performance when trained on a smaller-scale representative dataset. For this purpose, we use the VOC dataset to train all the model candidates from scratch (random initialization) for $100$ epochs and use the resulting mAP_50-95 value as a proxy metric to predict performance on all target datasets (with models fine-tuned on these datasets from COCO pre-trained weights). We observe a good correlation of such a training-based accuracy proxy with final metrics on all considered datasets (even on datasets from other domains, like SKU-110k; see Appendix C). We also examine the performance of training-free accuracy estimators for this task and compare it to mAP of VOC training from scratch (see Section 4.2).

Once we have the accuracy proxy values and latency measurements for all models in the dataset, we determine the models with the best accuracy-latency trade-off (the Pareto frontier models). We use the OApackage software library [8] to determine the Pareto optimal elements in the latency-accuracy space. We define the second Pareto set as the set of models that are Pareto-optimal if the initial Pareto set models are removed (so that the “second best” models in terms of latency-accuracy trade-off become the best). Correspondingly, we define the $N$ -th Pareto set.

For our model pre-selection procedure, we consider the models contained in the first and second Pareto fronts (in terms of mAP_50-95 in VOC training from scratch), with latency for each considered hardware platform separately. We merge all the first and second Pareto sets for each HW platform to form the list of promising architectures to be selected for COCO pre-training. After the COCO pre-training phase is finished for a model, variations of that architecture on multiple resolutions are considered in the benchmark.

Table 3: Performance of training-free accuracy predictors on YOLOBench models and two datasets (VOC and SKU-110k, from COCO-pretrained weights) compared to using metrics of models trained from scratch on the VOC dataset as a predictor. Refer to Appendix C for the data on all considered zero-cost metrics.

	VOC, mAP_50-95			SKU-110k, mAP_50-95
Predictor metric	global $\tau$	top-15% $\tau$	%Pareto pred. (GPU)	global $\tau$	top-15% $\tau$	%Pareto pred. (GPU)
JacobCov	0.095	-0.078	0.015	0.541	0.136	0.025
ZiCo	0.195	0.016	0.015	0.115	0.081	0.025
Zen	0.255	0.092	0.062	0.146	0.121	0.050
Fisher	0.280	0.156	0.015	-0.380	-0.096	0.025
SNIP	0.336	0.217	0.015	-0.290	-0.059	0.025
#params	0.399	0.372	0.031	0.256	0.119	0.050
SynFlow	0.558	0.227	0.062	0.512	0.254	0.100
MACs	0.739	0.520	0.123	0.604	0.314	0.125
NWOT	0.756	0.622	0.262	0.703	0.321	0.200
NWOT (pre-act)	0.827	0.623	0.292	0.765	0.406	0.200
VOC training	0.847	0.665	0.369	0.739	0.374	0.425
from scratch (mAP_50-95)

4 Results

4.1 Pareto-optimal YOLO models

By computing the proxy metric for model accuracy (mAP_50-95 in VOC training from scratch) and latency values for the whole YOLOBench architecture space on several hardware platforms, we determine the Pareto sets containing the most promising models (in terms of latency-accuracy trade-off) for each HW platform. The first and second Pareto sets for each device are merged into a unified list of best architectures, which is comprised of 52 backbone/neck combinations for COCO pre-training. Same architectures with different input resolutions are considered as the same data points in this list since COCO pre-training is regardless done on a fixed resolution of 640x640. The COCO pre-training phase is followed by fine-tuning at 11 different resolutions (from 160x160 to 480x480 with a step of 32) on all downstream datasets (except for COCO), resulting in 572 models total for each dataset.

Finally, with the obtained fine-tuned model accuracy on several datasets and latency measurements on several devices, we compute the actual Pareto sets for each particular dataset/HW platform combination. Figure 1 shows the Pareto frontiers of YOLOBench models fine-tuned on the VOC dataset on 4 different devices. Notably, significant differences emerge in these Pareto frontiers between different devices. In particular, the Pareto-optimal set for VIM3 NPU is mostly comprised of YOLOv6 models, with some YOLOv5, YOLOv7, and YOLOv8 models present in the higher accuracy region. This is not found to be the case for the Pareto sets of Intel and ARM CPUs. Despite containing a few YOLOv6 models in the lower latency region, these sets also encompass numerous YOLOv5 and YOLOv7 variations, with limited representation from other model families such as YOLOv3 and YOLOv4. While the Pareto sets for Intel and ARM CPUs exhibit a certain degree of similarity, the Jetson Nano GPU stands out from the rest of the devices. It showcases a non-uniform distribution of model families, with YOLOv5, YOLOv6, YOLOv7, and YOLOv8 models all represented across the entire accuracy/latency space. Table 1 shows representative Pareto-optimal models for 3 different datasets (VOC, SKU-110k, WIDERFACE) and 3 hardware platforms under certain latency thresholds. Note that although there are similarities of model family distributions in Pareto sets computed for different datasets (see Appendix B), the exact optimal model for a given latency threshold depends on the specific dataset of interest.

Next, we analyze the statistics of Pareto-optimal models depending on the dataset and hardware platform. Figure 3 shows the distribution of depth factor, width factor, and input resolution values in Pareto frontier models for VOC and SKU-110k datasets on Jetson Nano GPU (data for other datasets and devices are available in Appendix B). The general trend indicates that models at lower input resolutions mostly have lower depth and width factors. This suggests that achieving an optimal latency-accuracy trade-off involves scaling down both the architecture’s depth and width before reducing the input resolution. This effect is more pronounced in some datasets (SKU-110k and WIDERFACE), where almost all optimal models are either at the maximal resolution we considered (480x480) with variation in width and depth, or at lower resolutions with minimal width and depth factors. This effect is dataset-dependent, as a more relaxed trend is observed for VOC and COCO datasets, where many optimal models with a variation in width and depth factor are found at resolutions lower than 480x480.

To summarize, we demonstrate that with a state-of-the-art training pipeline and detection head structure, YOLO-based models with various backbone/neck combinations could achieve good latency-accuracy trade-offs in various deployment scenarios, including older backbone/neck structures from YOLOv4 and YOLOv3 models. Furthermore, we show that depth/width reduction precedes input resolution down-scaling in optimal YOLO-based detectors.

4.2 Ranking training-free accuracy predictors

With an increasing number of architecture blocks and hyperparameter combinations, the size of the candidate model space in YOLOBench can further grow exponentially. Hence, it is important to develop efficient methods of filtering out bad architecture proposals before running them through the full training pipeline, including pre-training on the COCO dataset. In the field of neural architecture search, recent works have proposed a handful of training-free, zero-cost (ZC) estimators that have been shown to perform well on various (relatively simple) benchmarks [27, 3, 21].

Zero-cost estimators were originally proposed by Mellor et al. [27], and later expanded by Abdelfattah et al. [3] as a means to quickly evaluate the performance of an architecture using only a mini-batch of data. These estimators work by extracting statistics obtained from a forward (and/or backward) pass of a few mini-batches of data through the network, hence eliminating the need for full training of the model. Despite the fact that over 20 different zero-cost accuracy estimators have been introduced in recent years, simple baselines like the number of parameters and MAC count are still found to be hard to outperform [21].

The vast search space of YOLO-like architectures necessitates the development of effective training-free estimators to filter out bad candidates and reduce the search space. We examine the performance of a representative subset of zero-cost estimators on YOLOBench, namely: Fisher [32], GradNorm [3], GraSP [33], JacobCov [3], Plain [3], SNIP [19], SynFlow [30], ZiCo [21], Zen-score [23] and NWOT [27]. The NWOT metric is computed by measuring the Hamming distance between binary codes produced by each layer’s activations [27]. Although originally proposed for ReLU-based networks, we observe that it works well in practice for YOLO variations, most of which contain SiLU activations. The NWOT metric can also be computed by taking the signs of each layer’s output features before the activation layer to form the binary code. We refer to that version of the NWOT metric as NWOT (pre-act) (”pre-activation”) and find that its performance might differ significantly from the original NWOT metric, primarily because the binary codes are computed before the normalization layers followed by the activations. We also compare the performance of the zero-cost predictors with simple baselines such as the number of trainable parameters and MAC count, as well as with a training-based proxy that we have used to pre-select models for YOLOBench (mAP_50-95 in training from scratch on the VOC dataset).

All zero-cost metrics are computed on randomly initialized models with the same loss function as used for training of all YOLOBench models and using a single mini-batch of data with a corresponding image resolution (except for ZiCo, which requires two different mini-batches of data [21]). We empirically evaluate the considered set of zero-cost proxies on YOLOBench using the following metrics:

•

Kendall $\tau$ (global): Kendall rank correlation coefficient evaluated on all YOLOBench models
•

Kendall $\tau$ (top-15%): Kendall rank correlation coefficient evaluated on the top-15% performing YOLOBench models (in terms of mAP_50-95 value)
•

Percentage of all actual Pareto-optimal models in the Pareto set determined with the zero-cost estimator in the zero-cost proxy-latency space (recall for Pareto-optimal model prediction using the ZC-based Pareto set)

The last metric effectively measures how accurate the computed Pareto set would be if the proxy values are used instead of actual mAP to rank models. It is calculated by determining Pareto fronts for model rankings based on zero-cost proxies (and real latency measurements) and then estimating how many models present in the actual Pareto set are also present in the ZC-based Pareto set. In other words, a recall value of 0.7 would mean that by taking the models from the ZC-based Pareto set as candidates, we find 70% of all actual Pareto-optimal models in that candidate set. We report values for Pareto fronts computed with latency measurements on the Jetson Nano GPU in Table 3.

We generally find that all of the zero-cost predictors we consider (except for NWOT) are outperformed by the simple baseline of MAC count both in terms of Kendall-Tau scores as well as in the percentage of predicted Pareto-optimal models (see Table 3). Furthermore, when compared with using mAP_50-95 on VOC training from scratch as a predictor, we observe that only NWOT comes close to it in terms of ranking scores. We also find that the pre-activation version of NWOT tends to work better than standard NWOT on YOLOBench. For the task of predicting mAP_50-95 of models fine-tuned on SKU-110k, we notably observe that pre-activation NWOT outperforms VOC training from scratch metric in terms of Kendall-Tau scores (possibly due to domain difference between VOC and SKU-110k datasets), but the VOC-based proxy metric still performs better for Pareto-optimal model prediction on SKU-110k. For the data on the sensitivity of NWOT predictions to hyperparameter values please refer to Appendix C.

Table 4: COCO test and minival mAP and inference latency on Raspberry Pi 4 CPU (TFLite, FP32) for YOLOv8s vs. a model identified from the NWOT-latency Pareto frontier. For latency, mean and standard deviation over 5 runs (each run done for 100 iterations) are shown, with 640x640 input resolution. For mAP, the mean and standard deviation over three random seeds are shown.

Model	mAP ${}^{test}_{50-95}$	mAP ${}^{val}_{50-95}$	Latency, ms
YOLOv8s	43.17% (0.12%)	44.43% (0.23%)	1476.09 (1.49)
YOLOv8s (HSwish)	42.90% (0.00%)	44.23% (0.10%)	1381.62 (7.34)
YOLO-FBNetV3-D-PAN-C3	43.87% (0.05%)	45.30% (0.08%)	1355.21 (9.93)

In trying to capture all the real Pareto-optimal models using ZC scores, one could expand the ZC-based candidate pool by calculating subsequent Pareto sets (second, third, fourth, and so forth) and incorporating them into the candidate pool. By applying this strategy, it’s possible to identify the complete set of actual Pareto-optimal models while examining only a subset of the entire dataset (e.g., the first $N$ ZC-based Pareto fronts). In this context, we compute candidate pools consisting of $N$ ZC Pareto fronts for each ZC metric and look at the percentage of actual Pareto-optimal models found in the pool versus the pool size (as % of the full dataset size). Looking at the pool size is motivated by the observation that the number of models in ZC-based Pareto fronts can significantly vary depending on the specific ZC metric used.

Figure 4 shows the percentage of predicted real Pareto-optimal models on the VOC dataset contained in pools of $N$ first Pareto fronts for 4 different predictors (VOC training from scratch, NWOT, pre-activation NWOT, and MAC count). For ARM and Intel CPUs, we observe a general trend of VOC training from scratch being the best predictor and MAC count being the worst at all points. Interestingly, for Jetson Nano GPU NWOT performs close to VOC training from scratch for $N=1,2$ but starts to perform worse with more models in the pool. Surprisingly, MAC count and pre-activation NWOT, which are training-free predictors, outperform VOC training from scratch in predicting Pareto-optimal models on VIM3 NPU.

4.3 Pareto-optimal detector identification using NWOT score

To demonstrate the potential of using ZC-based Pareto sets in identifying promising detector architectures with good accuracy-latency trade-off, we additionally generate multiple candidate architectures based on CNN backbones provided by the timm library [35]. The architectures are generated by using one of the 347 CNN-based backbones available in timm as a feature extractor followed by a modified Path Aggregation Network (PAN) (same structure with C3 blocks as in YOLOv5 is used, with the number of channels corresponding to YOLOv5s, without the SPPF block) and a YOLOv8 detection head, as in all other YOLOBench models.

We compute the pre-activation NWOT scores as well as measure inference latency on Raspberry Pi 4 ARM CPU with TFLite for all candidate models. We then use the NWOT score and latency values for each model to compute the Pareto frontier in the NWOT-latency space (see Appendix D). We then train one of the models identified to belong to the NWOT-based Pareto frontier (YOLO with FBNetV3-D backbone) on the COCO dataset with a similar setup used to pre-train YOLOBench models (640x640 input resolution, 500 epochs, batch size 256, other hyperparameters set to default of Ultralytics YOLOv8 [15])¹¹1Note that YOLOv8s results provided by Ultralytics [15] are slightly higher than the ones we report. However, no script to reproduce those results has been released to date.. The resulting model is found to be more accurate and faster than YOLOv8s (a model in a similar latency range) when tested on Raspberry Pi 4 CPU with TFLite (FP32, XNNPACK backend) (see Table 4). Furthermore, we look at the accuracy and latency of a YOLOv8s modification with SiLU activations replaced with Hardswish activations (Table 4), as we observe the choice of activation function to be a significant factor affecting TFLite inference latency. We find that the identified NWOT-Pareto model (also containing Hardswish activations in the backbone, neck, and head) still outperforms YOLOv8s-HSwish in terms of latency and accuracy.

5 Conclusion

In this work, we present YOLOBench, a latency-accuracy benchmark of several hundred YOLO-based models on 4 different object detection datasets and 4 different hardware platforms. The accuracy and latency data are collected in a fixed, controlled environment with the only variation in backbone/neck structure and input image resolution of the detectors. We use these data to demonstrate that it is possible to achieve Pareto-optimal results with a range of different backbone structures, including those of the older architectures in the YOLO series, such as YOLOv3 and YOLOv4. We also observe that depth and width scaling precede input resolution scaling in optimal YOLO-based detectors.

Finally, we use YOLOBench to evaluate zero-cost accuracy predictors, and find that, while many of the existing state-of-the-art predictors perform poorly, pre-activation NWOT score can be effectively used to identify Pareto-optimal detectors for specific target devices of interest. We demonstrate that by using NWOT to find a YOLO backbone (FBNetV3-D) that outperforms a state-of-the-art YOLOv8 model when deployed on a Raspberry Pi 4 ARM CPU.

References

[1] Coco detection challenge. https://codalab.lisn.upsaclay.fr/competitions/7384.
[2] Pycocotools pypi package. https://pypi.org/project/pycocotools/.
[3] Mohamed S Abdelfattah, Abhinav Mehrotra, Łukasz Dudziak, and Nicholas D Lane. Zero-cost proxies for lightweight nas. arXiv preprint arXiv:2101.08134, 2021.
[4] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
[5] MMYOLO Contributors. MMYOLO: OpenMMLab YOLO series toolbox and benchmark. https://github.com/open-mmlab/mmyolo, 2022.
[6] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021.
[7] Tausif Diwan, G Anirudh, and Jitendra V Tembhurne. Object detection using yolo: Challenges, architectural successors, datasets and applications. multimedia Tools and Applications, 82(6):9243–9275, 2023.
[8] Pieter Thijs Eendebak and Alan Roberto Vazquez. Oapackage: A python package for generation and analysis of orthogonal arrays, optimal designs and conference designs. Journal of Open Source Software, 4(34):1097, 2019.
[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, 2012.
[10] Haogang Feng, Gaoze Mu, Shida Zhong, Peichang Zhang, and Tao Yuan. Benchmark analysis of yolo performance on edge intelligence devices. Cryptography, 6(2):16, 2022.
[11] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
[12] Eran Goldman, Roei Herzig, Aviv Eisenschtat, Jacob Goldberger, and Tal Hassner. Precise detection in densely packed scenes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5227–5236, 2019.
[13] Kai Han, Yunhe Wang, Qiulin Zhang, Wei Zhang, Chunjing Xu, and Tong Zhang. Model rubik’s cube: Twisting resolution, depth and width for tinynets. Advances in Neural Information Processing Systems, 33:19353–19364, 2020.
[14] Glenn Jocher. YOLOv5 by Ultralytics, May 2020.
[15] Glenn Jocher, Ayush Chaurasia, and Jing Qiu. YOLO by Ultralytics, Jan. 2023.
[16] Parvinder Kaur, Baljit Singh Khehra, and Er Bhupinder Singh Mavi. Data augmentation for object detection: A review. In 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), pages 537–543. IEEE, 2021.
[17] Stereo Labs. Performance benchmark of yolo v5, v7 and v8. https://www.stereolabs.com/blog/performance-of-yolo-v5-v7-and-v8/, 2023.
[18] Nicholas D Lane, Sourav Bhattacharya, Akhil Mathur, Petko Georgiev, Claudio Forlivesi, and Fahim Kawsar. Squeezing deep learning into mobile and embedded devices. IEEE Pervasive Computing, 16(3):82–88, 2017.
[19] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
[20] Chuyi Li, Lulu Li, Yifei Geng, Hongliang Jiang, Meng Cheng, Bo Zhang, Zaidan Ke, Xiaoming Xu, and Xiangxiang Chu. Yolov6 v3. 0: A full-scale reloading. arXiv preprint arXiv:2301.05586, 2023.
[21] Guihong Li, Yuedong Yang, Kartikeya Bhardwaj, and Radu Marculescu. Zico: Zero-shot nas via inverse coefficient of variation on gradients. arXiv preprint arXiv:2301.11300, 2023.
[22] Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing Systems, 33:21002–21012, 2020.
[23] Ming Lin, Pichao Wang, Zhenhong Sun, Hesen Chen, Xiuyu Sun, Qi Qian, Hao Li, and Rong Jin. Zen-nas: A zero-shot nas for high-performance image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 347–356, 2021.
[24] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2014. cite arxiv:1405.0312Comment: 1) updated annotation pipeline description and figures; 2) added new section describing datasets splits; 3) updated author list.
[25] Wenyu Lv, Shangliang Xu, Yian Zhao, Guanzhong Wang, Jinman Wei, Cheng Cui, Yuning Du, Qingqing Dang, and Yi Liu. Detrs beat yolos on real-time object detection. arXiv preprint arXiv:2304.08069, 2023.
[26] Chengqi Lyu, Wenwei Zhang, Haian Huang, Yue Zhou, Yudong Wang, Yanyi Liu, Shilong Zhang, and Kai Chen. Rtmdet: An empirical study of designing real-time object detectors. arXiv preprint arXiv:2212.07784, 2022.
[27] Joe Mellor, Jack Turner, Amos Storkey, and Elliot J Crowley. Neural architecture search without training. In International Conference on Machine Learning, pages 7588–7598. PMLR, 2021.
[28] Sovit Rath and Vikas Gupta. Performance comparison of yolo object detection models – an intensive study. https://learnopencv.com/performance-comparison-of-yolo-models/, 2022.
[29] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
[30] Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in neural information processing systems, 33:6377–6389, 2020.
[31] Juan Terven and D Cordova-Esparza. A comprehensive review of yolo: From yolov1 and beyond. arXiv preprint arXiv:2304.00501, 2023.
[32] Jack Turner, Elliot J Crowley, Michael O’Boyle, Amos Storkey, and Gavin Gray. Blockswap: Fisher-guided block substitution for network compression on a budget. arXiv preprint arXiv:1906.04113, 2019.
[33] Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376, 2020.
[34] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7464–7475, 2023.
[35] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
[36] Shangliang Xu, Xinxin Wang, Wenyu Lv, Qinyao Chang, Cheng Cui, Kaipeng Deng, Guanzhong Wang, Qingqing Dang, Shengyu Wei, Yuning Du, et al. Pp-yoloe: An evolved version of yolo. arXiv preprint arXiv:2203.16250, 2022.
[37] Xianzhe Xu, Yiqi Jiang, Weihua Chen, Yilun Huang, Yuan Zhang, and Xiuyu Sun. Damo-yolo: A report on real-time object detection design. arXiv preprint arXiv:2211.15444, 2022.
[38] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Wider face: A face detection benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5525–5533, 2016.
[39] Jiawei Zhu, Haogang Feng, Shida Zhong, and Tao Yuan. Performance analysis of real-time object detection on jetson device. In 2022 IEEE/ACIS 22nd International Conference on Computer and Information Science (ICIS), pages 156–161. IEEE, 2022.

Appendix A Latency measurements.

Details regarding hardware platforms used to collect latency measurements are outlined in Table S5. Figures S5 and S6 show the difference of latency value distributions between devices computed for the full initial YOLOBench architecture space consisting of $\sim$ 1000 models. While generally good correlation is observed between model inference latencies on different devices (see also Figure S7), notably latency values measured on Khadas VIM3 NPU differ significantly from latency values on other devices. That is, for models with roughly the same latency on Jetson Nano GPU or Raspi4 ARM CPU, the difference in VIM3 NPU latency could be up to several times. This difference between NPU values from other common GPU/CPU-based platforms highlights the necessity to develop hardware-aware architecture design and search methods. The difference in the NPU benchmark is also reflected in the structure of model Pareto frontiers (Figs. 1, S10, S8) and the performance of zero-cost predictors in identifying Pareto-optimal models (Figs. 4, S22).

Appendix B YOLOBench Pareto frontiers for different datasets.

YOLOBench Pareto frontiers for SKU-110k, WIDER FACE, and COCO datasets are shown in Figs. S8, S9, S10, correspondingly. Note that while mAP_50-95 values for VOC, SKU-110k, and WIDER FACE datasets are obtained by fine-tuning COCO pre-trained weights (all trained at 640x640 image resolution) on multiple image resolutions considered in YOLOBench (11 values from 160 to 480 with a step of 32), the mAP_50-95 values on the COCO dataset are obtained by directly evaluating pre-trained COCO weights, without fine-tuning on the corresponding target image resolutions. This corresponds to the situation of deployment of pre-trained COCO weights without any additional training.

Table S6 shows the identified Pareto-optimal YOLO models on 3 different datasets and 4 hardware platforms under several latency thresholds. It can be noted that under the same latency threshold on a given hardware platform, the optimal YOLO model family and input image resolution are typically dataset-dependent.

Figures S11 and S12 show the statistics of architecture scaling parameters (width factor, depth factor, image resolution) in Pareto-optimal models on Raspberry Pi4 CPU and VIM3 NPU, respectively. Although some differences are observed between devices and datasets (in particular depth factor distributions), there is a general trend in all computed Pareto fronts where a variation in depth/width factors is observed at higher resolutions, and resolution is reduced when the depth/width factors (especially the width factor) already have low values.

Table S5: Details on hardware platforms and corresponding runtimes used for benchmarking.

	Raspberry Pi 4 Model B	Jetson Nano (NVIDIA)	Khadas VIM3	Lambda tensorbook
CPU	Quad Core Cortex-A72, 64-bit SoC @1.8GHz	Quad Core Cortex-A57 MPCore, 64-bit SoC @1.43GHz	Quad Core Cortex-A73 @2.2Ghz, Dual Core Cortex-A53 @1.8Ghz	Intel^® Core™i7-10875H CPU @ 2.30GHz
Memory	4GB LPDDR4-3200 SDRAM	4 GB 64-bit LPDDR4, 1600MHz 25.6 GB/s	4GB LPDDR4/4X	64GB DDR4 SDRAM
AI-chip	-	NVIDIA Maxwell GPU, 128 NVIDIA CUDA^® cores	Custom NPU INT8 inference up to 1536 MAC	NVIDIA RTX 2080 Super Max-Q
Ops	-	472 GFLOPs	5.0 TOPS	-
Framework/runtime	TensorFlow Lite (FP32, XNNPACK backend)	ONNX Runtime (FP32, GPU)	AML NPU SDK (INT16)	OpenVINO (CPU, FP32)

Appendix C Performance of zero-cost accuracy predictors on YOLOBench.

The performance of zero-cost accuracy predictors used in neural architecture search [3] is empirically evaluated on YOLOBench models on VOC and SKU-110k. Table S7 shows the Kendall-Tau scores and Pareto-optimal model prediction recall values obtained by a variety of zero-cost predictors. The zero-cost predictor values are computed using a randomly sampled batch of test set data with batch size $=16$ (the used batch was the same for all ZC metrics). MAC count and the number of parameters are computed for models in evaluation mode, with normalization layers fused into preceding convolutions (if possible), and RepVGG-style blocks [6] also fused, if present in the model. Hence, the performance of MAC and parameter counts might slightly differ if computed for models in training mode. Most predictors perform poorly and are outperformed by the MAC count baseline, except for the NWOT score (in particular the pre-activation version of it). The good performance of NWOT can be also observed in Fig. S14, where scatter plots of fine-tuned model mAP_50-95 vs. zero-cost predictor value are shown for a few predictors. Some predictors (notably parameter count, ZiCo, and Zen-score) can be observed to produce very close values for subsets of models with significantly different accuracy. This is an indication of the fact that these predictors perform poorly in estimating accuracy differences in models when the underlying architecture is fixed, but the input image resolution is varied.

We also test the performance of a training-based predictor on YOLOBench which is the mAP_50-95 values of models trained on a representative dataset (VOC) from scratch for 100 epochs. This predictor sets a strong baseline to be outperformed by training-free predictors, as it is generally found to perform well on a variety of datasets (see Fig. S13), including datasets from different visual domains (e.g. SKU-110k).

We further look into the robustness of the results obtained with the pre-activation NWOT estimator. Since this zero-cost estimator does not require computing the loss function, the main parameters that could influence its performance are the exact batch of data sampled, the batch size, and the dataset split (training or test data) used to sample the batch. Figure S15 shows the global Kendall-Tau scores achieved with pre-activation NWOT on VOC YOLOBench models with different batches sampled, different batch sizes and different data splits used. There is an observed variance in performance depending on the sampled batch, which is higher when the test set data are used (with an absolute difference of up to $0.05$ in global Kendall-Tau score). Notably, scores computed on training set data (with augmentations) performed better on average compared to test set data, and performance is observed to decrease with increasing batch size. Furthermore, Table S8 shows the mean and standard deviation of Kendall-Tau scores for the standard and pre-activation versions of NWOT on VOC YOLOBench models computed on 5 different batches of size 16. We also estimate the performance of the mean predictor values averaged over the 5 sampled batches, which is expectedly found to outperform predictors computed on single batches. Moreover, we compute the pre-activation NWOT scores for all layers in YOLO models except the ones contained in detection heads. This is motivated by the fact that the larger distances between binary activation codes in NWOT are meant to correlate with better performance for the feature extraction layers (e.g. layers in the backbone and neck of YOLO), not the last layers used to compute model predictions. We find an overall performance boost in terms of Kendall-Tau scores for the case when the NWOT score is computed only for the layers in the backbone and neck (Table S8).

Appendix D Pareto-optimal model prediction using training-free proxies.

We evaluate the training-free accuracy predictors (and the training-based one, VOC training from scratch) for the task of predicting Pareto-optimal models. That is, if one computes the ZC values for each model and determines the Pareto set of models in the ZC value-real latency two-dimensional space, we want to estimate how many models in that Pareto set are going to also be present in the actual Pareto frontier (computed in the two-dimensional mAP_50-95-latency space). Two metrics are of importance here: recall (how many of actual mAP-latency Pareto-optimal models are captured by a ZC-based Pareto set) and precision (how many of ZC-based Pareto set models are actually Pareto optimal in the real mAP-latency space). Additionally, one could consider the first $N$ ( $N=1,2,3,...$ ) ZC-based Pareto sets to expand the set of potential model candidates. We look at how precision and recall values change with $N$ for a few well-performing predictors (NWOT, pre-activation NWOT, MAC count, and VOC training from scratch) with latency values taken from different target devices.

Recall values for several zero-cost predictors for Pareto models on Jetson Nano GPU and VOC dataset are shown in Fig. S16. Corresponding precision values for a few well-performing predictors on 3 different HW platforms are shown in Fig. S17. Recall values for these best-performing predictors on the SKU-110k dataset are shown in Fig. S22.

A different way to evaluate the predictors on YOLOBench is to treat models with the same architectures but different input image resolutions as identical data points. That is, if a certain architecture is predicted by ZC-based Pareto front to be optimal on a certain resolution, we count that as a correct prediction if that same architecture on a different resolution is found to be really Pareto-optimal. Such a way to evaluate ZC performance stems from the fact that in practice one typically wishes to predict the most promising architectures, not necessarily the particular optimal image resolution (since that architecture would be pre-trained with a certain fixed resolution, e.g. 640x640 on a dataset like COCO for further fine-tuning on the target dataset). Recall and precision values for such an evaluation protocol for the VOC dataset are shown in Figs. S18, S19.

We also evaluate the performance of the best training-free predictor (pre-activation NWOT) in predicting Pareto-optimal models, when the latency values used are different from actual latency measurements, but either are computed via a latency proxy like MAC count or measurement on another device. Note that in the case of MAC count as a latency predictor, the whole Pareto-frontier computation process is zero-cost: the approximation for mAP is given by the pre-activation NWOT score, the approximation for latency by MAC count. One might wonder how such a fully zero-cost approach performs in practice. Figures S20 and S21 show the recall and precision values when accuracy predictor is taken to be pre-activation NWOT and latency predictors are varied from MAC count to latencies from other (proxy) devices. Interestingly, MAC count is found to perform relatively well in terms of recall, specifically for Raspberry Pi 4 CPU. Notably, none of the latency proxies work well to predict Pareto-optimal models on VIM3 NPU. Also, perhaps not surprisingly, using Intel CPU latency measurements works well to predict Pareto-optimal models on Raspberry Pi 4 CPU, but does not significantly outperform MAC count.

Finally, we test the pre-activation NWOT accuracy estimator to predict potentially well-performing models out of a set of YOLO models we generated with different CNN backbones from the timm package [35]. We have computed the NWOT-latency Pareto set for YOLO-PAN-C3 models with timm backbones on input images of 480x480 resolution, with latency measured on Raspberry Pi 4 ARM CPU (TFLite, FP32). The neck structure (PAN-C3) for each of the candidate models was taken to be that of YOLOv5s and the detection head to be that of YOLOv8 (same as for all YOLOBench models), with Hardswish activations in the neck and head, and activation function(s) in the backbone kept the same as originally implemented in timm. Table S9 shows examples of predicted Pareto-optimal models (a subset of the full NWOT-latency Pareto set). Based on these observations, we have selected FBNetV3-D as a potential backbone of a YOLO model to be trained on the COCO dataset and compared it to a reference YOLOv8 model in a similar latency range (YOLOv8s).

Table S11 shows COCO minival mAP_50-95 and inference latency results for a YOLO-FBNetV3-D-PAN-C3 model trained on the COCO dataset for 300 epochs and profiled on 640x640 input resolution on Raspi4 CPU with TFLite. We observe that the choice of activation function significantly affects TFLite model inference latency, so for a more fair comparison we also train and profile a Hardswish-based version of YOLOv8s in addition to its default SiLU-based version. While we observe a significant reduction in inference latency with a negligible mAP drop shifting from SiLU to Hardswish, the FBNetV3-based model still outperforms YOLOv8s-HSwish. Furthermore, we train and profile a ReLU-based version of YOLO-FBNetV3-D-PAN-C3 (with activation functions in the backbone kept to be those of the original backbone, i.e. Hardswish, but neck and detection head activations replaced with ReLU) and observe further latency improvements at the cost of $\sim 0.56\%$ drop in mAP_50-95. However, this model is still found to outperform YOLOv8s in terms of both accuracy and latency (see Table S11). Furthermore, we train the same models for 500 epochs with a batch size of $256$ , which is found to achieve better results on COCO minival and test (Table 4). Although we could not exactly reproduce COCO minival mAP results for YOLOv8s reported by Ultralytics [15], we find that the FBNetV3-based model outperforms both our YOLOv8s mAP results as well as those of Ultralytics, with lower latency on Raspberry Pi 4 CPU. The COCO minival mAP_50-95 values reported in Table 4 were obtained using pycocotools [2] (with IoU threshold for NMS $=0.6$ and object confidence threshold for detection $=0.001$ ), and mAP values on test-dev were obtained using the same evaluation parameters by submitting to the competition server [1]. More details on the performance comparison of models on COCO test-dev are shown in Table S10.

Table S6: Pareto-optimal YOLOBench models on 3 datasets and 4 hardware platforms. Shown are the best models in terms of mAP_50-95 under a given latency threshold (max. latency). For each model, the scaling parameters are given (d33w25 means depth factor

=0.33

and width factor

=0.25

), corresponding input resolution of the models is indicated in brackets.

HW/max.	VOC	VOC	SKU-110k	SKU-110k	WIDERFACE	WIDERFACE
latency	model	mAP_50-95	model	mAP_50-95	model	mAP_50-95
Nano/0.5 sec	YOLOv8	0.726	YOLOv7	0.593	YOLOv7	0.382
	d67w1 (448)		d1w75 (480)		d1w75 (480)
Nano/0.3 sec	YOLOv7	0.701	YOLOv7	0.589	YOLOv7	0.369
	d1w5 (480)		d1w5 (480)		d1w5 (480)
Nano/0.1 sec	YOLOv7	0.657	YOLOv8	0.567	YOLOv7	0.336
	d1w5 (288)		d1w25 (480)		d1w25 (480)
VIM3/0.3 sec	YOLOv8	0.726	YOLOv7	0.593	YOLOv7	0.382
	d67w1 (448)		d1w75 (480)		d1w75 (480)
VIM3/0.1 sec	YOLOv6l	0.669	YOLOv8	0.567	YOLOv6m	0.350
	d67w5 (384)		d1w25 (480)		d33w5 (480)
VIM3/0.05 sec	YOLOv6l	0.620	YOLOv6s	0.556	YOLOv6m	0.318
	d67w25 (416)		d33w25 (480)		d67w25 (480)
Intel/0.08 sec	YOLOv8	0.719	YOLOv7	0.593	YOLOv7	0.382
	d1w75 (416)		d1w75 (480)		d1w75 (480)
Intel/0.04 sec	YOLOv7	0.701	YOLOv7	0.589	YOLOv7	0.369
	d1w5 (480)		d1w5 (480)		d1w5 (480)
Intel/0.02 sec	YOLOv6l	0.682	YOLOv6l	0.576	YOLOv6l	0.346
	d6w5 (448)		d33w5 (480)		d33w5 (480)
Raspi4/3 sec	YOLOv8	0.719	YOLOv7	0.593	YOLOv7	0.382
	d1w75 (416)		d1w75 (480)		d1w75 (480)
Raspi4/1 sec	YOLOv7	0.701	YOLOv7	0.589	YOLOv7	0.369
	d1w5 (480)		d1w5 (480)		d1w5 (480)
Raspi4/0.5 sec	YOLOv6l	0.669	YOLOv4	0.569	YOLOv7	0.336
	d67w5 (384)		d1w25 (480)		d1w25 (480)

Table S7: Performance of training-free accuracy predictors on YOLOBench models and two datasets (VOC and SKU-110k, from COCO-pretrained weights) compared to using mAP_50-95 of models trained from scratch on the VOC dataset as a predictor.

	VOC, mAP_50-95			SKU-110k, mAP_50-95
Predictor metric	global $\tau$	top-15% $\tau$	%Pareto pred. (GPU)	global $\tau$	top-15% $\tau$	%Pareto pred. (GPU)
GraSP	-0.011	-0.068	0.062	0.040	0.032	0.025
Plain	0.029	0.069	0.015	-0.388	-0.176	0.025
JacobCov	0.095	-0.078	0.015	0.541	0.136	0.025
ZiCo	0.195	0.016	0.015	0.115	0.081	0.025
Zen	0.255	0.092	0.062	0.146	0.121	0.050
GradNorm	0.262	0.173	0.015	-0.331	-0.072	0.025
Fisher	0.280	0.156	0.015	-0.380	-0.096	0.025
L2 norm	0.326	0.090	0.015	0.189	0.118	0.025
SNIP	0.336	0.217	0.015	-0.290	-0.059	0.025
#params	0.399	0.372	0.031	0.256	0.119	0.050
SynFlow	0.558	0.227	0.062	0.512	0.254	0.100
MACs	0.739	0.520	0.123	0.604	0.314	0.125
NWOT	0.756	0.622	0.262	0.703	0.321	0.200
NWOT (pre-act)	0.827	0.623	0.292	0.765	0.406	0.200
VOC training	0.847	0.665	0.369	0.739	0.374	0.425
from scratch (mAP_50-95)

Table S8: Mean and standard deviation of the global Kendall-Tau scores for NWOT metrics computed for 5 different randomly sampled batches of size 16 on VOC YOLOBench models. The metric denoted as ”no head” was computed only for the layers contained in the neck and backbone of YOLO models, not in the detection head. The second column shows Kendall-Tau scores for prediction with the mean ZC metric values averaged over the 5 batches.

ZC metric	global $\tau$	global $\tau$ (prediction with mean ZC value)
NWOT	0.7839 (0.0159)	0.7895
NWOT (pre-act)	0.8402 (0.0191)	0.8486
NWOT (pre-act, no head)	0.8472 (0.0194)	0.8570

Table S9: Example YOLO-PAN-C3 models with timm backbones identified in the NWOT-latency Pareto frontier, with pre-activation NWOT score computed on the VOC dataset. Latency values are measured on Raspberry Pi 4 ARM CPU with TFLite (FP32), batch size 1.

Model name	Input resolution	Raspi4 CPU latency, sec	NWOT (pre-act)
yolo_pan_efficientnet_b4	480	1.72	511.84
yolo_pan_tf_efficientnet_b4_ap	480	1.71	511.77
yolo_pan_gc_efficientnetv2_rw_t	480	1.41	508.73
yolo_pan_tf_efficientnet_lite4	480	1.08	506.67
yolo_pan_fbnetv3_d	480	0.71	502.71
yolo_pan_tf_efficientnet_lite1	480	0.61	493.48
yolo_pan_efficientnet_lite1	480	0.61	493.32
yolo_pan_mobilenetv2_110d	480	0.54	480.92
yolo_pan_mobilenetv2_075	480	0.45	480.14
yolo_pan_tf_mobilenetv3_large_075	480	0.45	468.85
yolo_pan_mobilenetv2_035	480	0.37	457.41
yolo_pan_tf_mobilenetv3_small_minimal_100	480	0.36	451.10

Table S10: COCO test mAP values and inference latency on Raspberry Pi 4 CPU (TFLite with XNNPACK backend, FP32) for YOLOv8s vs. a model identified from the NWOT-latency Pareto frontier (YOLO-FBNetV3-D-PAN). For mAP values, the mean and standard deviation over three random seeds are shown. For inference time, mean and standard deviation of inference time over 5 runs (each one 100 iterations) are shown.

Model	AP ${}^{test}_{50-95}$	AP ${}^{test}_{50}$	AP ${}^{test}_{75}$	AP ${}^{test}_{S}$	AP ${}^{test}_{M}$	AP ${}^{test}_{L}$	Latency, ms
YOLOv8s	43.17% (0.12%)	60.53% (0.09%)	46.5% (0.08%)	22.7% (0.14%)	47.13% (0.17%)	57.0% (0.22%)	1476.09 (1.49)
YOLOv8s-HSwish	42.90% (0.0%)	60.3% (0.0%)	46.30% (0.0%)	22.46% (0.09%)	47.0% (0.08%)	56.39% (0.08%)	1381.62 (7.34)
YOLO-FBNetV3-D-PAN	43.87% (0.05%)	61.53% (0.09%)	47.23% (0.05%)	22.67% (0.19%)	47.87% (0.05%)	58.36% (0.12%)	1355.21 (9.93)

Table S11: COCO minival mAP and inference latency on Raspberry Pi 4 CPU (TFLite with XNNPACK backend, FP32) for YOLOv8s vs. a model identified from the NWOT-latency Pareto frontier (YOLO-FBNetV3-D-PAN). Mean and standard deviation of inference time over 5 runs (each one 100 iterations) are shown. The input image resolution used was 640x640, batch size

=1

for latency measurements. Models were trained for 300 epochs, with batch size

=64

Model	COCO mAP ${}^{val}_{50-95}$	Raspberry Pi 4 ARM CPU latency, ms
YOLOv8s	43.64%	1476.09 (1.49)
YOLOv8s-HSwish	43.55%	1381.62 (7.34)
YOLO-FBNetV3-D-PAN	44.63%	1355.21 (9.93)
YOLO-FBNetV3-D-PAN-ReLU	44.07%	1344.50 (8.06)