Revisiting DNN Training for Intermittently-Powered Energy-Harvesting Micro-Computers

Cyan Subhra Mishra, Deeksha Chaudhary, Jack Sampson, Mahmut Taylan Knademir, Chita Das
The Pennsylvania State University {cyan, dmc6955, jms1257, mtk2, cxd12}@psu.edu

Abstract

The deployment of Deep Neural Networks (DNNs) in energy-constrained environments, such as Energy Harvesting Wireless Sensor Networks (EH-WSNs), presents unique challenges, primarily due to the “intermittent” nature of power availability. To address these challenges, this study introduces and evaluates a novel training methodology tailored for DNNs operating within such contexts. In particular, we propose a dynamic dropout technique that adapts to both the architecture of the device and the variability in energy availability inherent in energy harvesting scenarios.

Our proposed approach leverages a device model that incorporates specific parameters of the network architecture and the energy harvesting profile to optimize dropout rates dynamically during the training phase. By modulating the network’s training process based on predicted energy availability, our method not only conserves energy but also ensures sustained learning and inference capabilities under power constraints. Our preliminary results demonstrate that this strategy provides $6\%$ – $22\%$ accuracy improvements compared to the state of the art with $\leq 5\%$ additional compute. This paper details the development of the device model, describes the integration of energy profiles with intermittency aware dropout and quantization algorithms, and presents a comprehensive evaluation of the proposed approach using real-world energy harvesting data. The work also includes a new dataset towards deploying energy harvesting based computation in real world.

1 Introduction

The increasing demand for ubiquitous, sustainable and energy-efficient computing combined with the maturation of energy harvesting systems has spurred a plethora of recent research and advancements Gobieski et al. (2019); Resch et al. (2020); Mishra et al. (2021); Saffari et al. (2021); Afzal et al. (2022) in the direction of battery-less devices. Such platforms represent the future of the Internet of Things (IoT), and energy harvesting wireless sensor networks (EH-WSNs): equipped with modern machine learning (ML) techniques, these devices can revolutionize computing, monitoring and analytics in remote, risky and critical environments such as oil wells, mines, deep forests, oceans, remote industries and smart cities. However, the intermittent and limited energy income of these deployments demand optimizations for the ML applications starting from the algorithm Yang et al. (2017); Shen et al. (2022); Mendis et al. (2021), orchestration Maeng and Lucia (2018); Mishra et al. (2021), compilation Gobieski et al. (2018), and hardware development Qiu et al. (2020); Islam et al. (2022); Mishra et al. (2024) layers. Despite these advancements, achieving consistent and accurate inference – thereby meeting service level objectives (SLOs) – in such intermittent environments remains a significant challenge. This difficulty is exacerbated by unpredictable resources, form-factor limitations, and variable computational availability, particularly when employing task-optimized deep neural networks (DNNs).

There are two major problems with performing DNN inference with intermittent power. (I.) Energy Variability: Even though DNNs can be tailored to match the average energy income of the energy harvesting (EH) source through pruning, quantization, distillation or network architecture search (NAS) Yang et al. (2018, 2017); Mendis et al. (2021), there is no guarantee that the energy income consistently meets or exceeds this average. When the income falls below the threshold, the system halts the inference, and checkpoints the intermediate states (via software or persistent hardware) Maeng and Lucia (2018); Qiu et al. (2020), and resumes upon energy recovery. Depending on the EH profile, this might lead to significant delays and SLO violations. (II.) Computational Approximation: To address (I) and maintain continuous operation, EH-WSNs may skip some of the compute during energy shortfalls, by dropping neurons (zero padding) or by approximating (quantization). Adding further approximation to save energy atop an already heavily reduced network can propagate errors through the layers leading to significant accuracy drop Islam and Nirjon (2019); Kang et al. (2022); Lv and Xu (2022); Kang et al. (2020), further violating SLOs.

In certain energy critical scenarios, even EH-WSNs applying state of the art techniques fail to consistently meet SLOs, at times skipping entire inferences to deliver at least some results on time. Fundamentally, while current DNNs can be trained or fine-tuned to fit within a given resource budget – be it compute, memory or energy – they are not trained to expect a variable or intermittent resource income. Although intermittency-aware NAS Mendis et al. (2021) could alleviate certain problems, it still falls into the first bucket many times. This calls for revisiting the entire training process, i.e., NAS is not the sole approach given the resource intermittency and we need to train the DNN in such a way that it is aware of the intermittency and adapts to it. runtime.

Motivated by these challenges, we propose NExUME (Neural Execution Under InterMittent Environment), a novel learning method designed specifically for environments with intermittent power and EH-WSNs, with potential applications in any ultra-low-power inference system. NExUME incorporates energy variability aware network architecture search (NAS) conducted across multiple commercial off-the-shelf IoT devices using a dataset of EH traces. This approach enables the identification of the optimal network architecture tailored to the specific constraints of a given hardware platform and resource budget. Recognizing that oracular knowledge of energy availability in intermittent environments is impractical, NExUME learns to dynamically adapt the DNN’s inference computations based on the current energy profile. This adaptation involves an innovative strategy of learning an instantaneous energy-aware dynamic dropout and quantization selection. The method also includes targeted fine-tuning – both during and after training – that not only regularizes the model but also prevents over-fitting and enhances the robustness of the network to fluctuations in resource availability. The key contributions of this paper can be summarized as follows:

•

DynAgent: A dynamic “co-optimization manager” that coordinates network search and dynamic training processes, adapting to varying environmental, platform, model-specific, and SLO requirements. DynAgent, serving as a repository of EH traces and device configurations, significantly enhances the flexibility and efficiency of resource utilization across different operational contexts.
•

DynFit: A novel “training optimizer” that embeds energy variability awareness directly into the DNN training process. This optimizer allows for “dynamic adjustments” of dropout rates and quantization levels based on real-time energy availability, thus maintaining learning stability and improving model accuracy under power constraints.
•

DynInfer: An intermittency- and platform-aware “task scheduler” that optimizes computational tasks for intermittent power supply, ensuring consistent and reliable DNN operation. DynInfer takes advantage of software-compiler-hardware codesign to manage, fuse, and deploy tasks. With the help of DynAgent and DynFit DynInfer provides $6\%$ – $22\%$ accuracy improvements with $\leq 5\%$ additional compute.
•

Dataset: A first-of-its-kind machine status monitoring dataset, which involves mounting multiple types of EH sensors at various locations on a Bridgeport machine to monitor its activity status.

2 Background and Related Work

Energy Harvesting and Intermittent Computing: The exploding usage of IoTs, connected devices, and wearable electronics project the number of battery operated devices to be 24.1 Billion by 2030 Insights (2023). This has a significant economic (users, products and data generating dollar value) as well as environmental (battery and e-waste) impact Mishra et al. (2024). In fact, advances in EH has lead to a staggering development in intermittently powered battery-free devices Maeng and Lucia (2018); Gobieski et al. (2019); Qiu et al. (2020); Saffari et al. (2021); Afzal et al. (2022). A typical EH setup consists of 5 components, namely, energy capture (solar panel, thermocouple, etc), power conditioning, voltage regulation (buck or boost converter), energy storage (super capacitor) and compute unit (refer §Appendix B for details about each of them). To cater towards the sporadic power income and failures, an existing body of works explores algorithms, orchestration, compiler support, and hardware development Yang et al. (2017, 2018); Mendis et al. (2021); Maeng and Lucia (2018); Gobieski et al. (2018); Qiu et al. (2020); Islam et al. (2022); Mishra et al. (2024, 2021); Ma et al. (2016, 2017); Liu et al. (2015). Most of these works rely on software checkpointing (static and dynamic Maeng and Lucia (2018), refer §Appendix C) to save and restore, while some of the prior works developed nonvolatile hardware Ma et al. (2016, 2017) which inherently takes care of the checkpointing. Considering the scope of these initiatives, it is crucial to acknowledge that, despite the substantial support for energy harvesting and intermittency management, developing intermittency-aware applications and hardware necessitates multi-dimensional efforts that span from theoretical foundations to circuit design.

Intermittent DNN Execution/Training: As the applications deployed on such EH devices demand analytics, executing DNNs on EH devices and EH-WSNs have become prominent Lv and Xu (2022); Gobieski et al. (2019); Qiu et al. (2020); Mishra et al. (2021). However, due to computational constraints, limited memory capacity and restricted operating frequencies, many of these applications fail to complete inference execution with satisfactory SLOs, despite comprehensive software and hardware support Mishra et al. (2021). While the works relying on loop-decomposition or task partition (e.g., see Qiu et al. (2020); Gobieski et al. (2019) and the references therein) ensure “forward progress”, they do not guarantee an inference completion while meeting SLOs. Optimizing DNNs for the energy constraints Yang et al. (2018, 2017), or performing early exit and depth-first slicing Lv and Xu (2022); Islam and Nirjon (2019) does ensure more forward progress, but such approaches compromise accuracy while often imposing scheduling overheads and higher memory footprint. One major issue is, most of the works leverage “pre-existing” DNNs, which are typically designed for running on a stable resource environment, while being deployed on an intermittent environment with pseudo notion of stability via check-pointing, and therefore, one direction of works Mendis et al. (2021) looks for performing network architecture search for intermittent devices. However, this research direction only accounts for fixed lower and upper bounds of energy and compute capacities, overlooking the “sporadic” nature of energy availability and the elasticity of the compute hardware (i.e., the ability to dynamically scale frequency, compute, and memory). Moreover, while the DNN is designed to operate within a specific power window, it is not trained to adapt to these fluctuations. Consequently, during extended periods of energy scarcity, the system lacks mechanisms for computational approximation, such as dynamic dropouts (neuron skipping) and dynamic quantization. Essentially, the DNN is trained to manage within a static resource budget, ignoring the “dynamism” of the resources. In contrast, our work prioritizes the integration of this dynamism in both the network architecture search (NAS) and the training phases, adapting more effectively to fluctuating energy and compute conditions.

3 NExUME Framework

To address the issues with “intermittency-aware” DNN training, we propose NExUME: (Neural Execution Under InterMittent Environment). NExUME has 3 different components: (1) Intermittency- and platform-aware neural architecture search (DynNAS+); (2) Intermittency- and platform-aware DNN training with dynamic dropouts and quantization (DynFit); and finally, (3) Intermittency and platform aware task scheduling (DynInfer). Each of these components could work individually towards optimizing DNNs for intermittent environment. However, the combination of all of them is expected to provide the best results. Another major component of NExUME is the Resource Model (DynAgent) – which works like an agent to understand the environment (energy income and consumption) and platform (compute, memory, and activation constraints) as well as the model. In this section, we elaborate on the different components of NExUME and their roles.

3.1 DynAgent: Intermittency Aware Resource Model

One of the first step to design an intermittency aware DNN execution is to understand the execution framework. This includes: 1. EH environment: the type, fidelity and magnitude of EH; 2. Hardware platform: accessing the compute capabilities, parallelization, memory, elasticity and energy consumption; 3. SLOs: determining acceptable accuracy, latency and the allowable slack for these metrics. This model aids in determining the constraints on the network design and hyper-parameter tuning. Towards this, we introduce the Intermittency-Aware Resource Model (DynAgent). DynAgent serves as an environment manager for the NExUME framework. That is, given the EH environment and hardware platform, DynAgent interfaces with DynNAS+ and DynFit (section 3.2) to give an estimation of instantaneous resource availability, hardware configuration and algorithmic framework, to facilitate intermittency aware NAS and training. DynAgent is essentially a repository of multiple EH traces (photovoltaic power with outdoor sun, photovoltaic power with indoor lighting, RF power with WiFi in a home setting, RF power with WiFi in an office setting, thermal power with home HVAC, piezoelectric power from the vibration of industrial machines), hardware platforms (TI MSP430FR5994 Instruments (2024a), Arduino Nano Arduino (2024), Adafruit TensorFlow kit Adafruit (2024)), and available computation libraries. DynAgent uses micro-profiling ( $\mu-profile$ ) with predetermined ordered access along with size and stride sweep to understand the capabilities of hardware platform, i.e., to know the cache sizes, cache access latencies, memory access latency, etc. DynAgent implements a sweep of memory accesses to better profile the hardware and know its limitations. Although many of the device specifications are typically well known, the $\mu-profile$ helps us understand the capabilities of a new hardware tailored for the target applications. Similarly, to understand the latency of kernel execution, DynAgent has a $\mu-profile$ of different DNN kernels (GeMM, Matrix Vector Multiplication, HadamardProduct2D, conv1D, conv2D, DepthWiseSeparableConv2D, etc.) with their execution latencies. Note that, all of these kernels are built with loop decomposition for checkpointing under intermittent execution (refer to Appendix C.1 for an example of such a for GeMM with loop decomposition and checkpointing).

Along with the database of the hardware profile, DynAgent also includes an “accelerator library” targeted for CNN applications. Most modern micro-controllers are equipped with digital signal processing (DSP) libraries, and hardware Instruments (2024b) which mostly include multiply accumulation function (MADD) and 1D convolution (conv1D) functions (typically implemented as inline assembly). Although these libraries suffice for 1D time-series data (audio, accelerometer, etc.), they lack the functions like 2D convolution (Conv2D), depth-wise separable convolution (DWSConv2D), etc. To bridge this gap, DynAgent implements the CNN-specific functions using the available primitives for most efficient execution.¹¹1Note that, it is possible to implement these functions without relying on libraries, but doing so generally results in a significant increase in energy and memory consumption, which is impractical for intermittent systems. Implementing Conv2D using conv1D is straightforward, and involves treating each row or column of the input matrix as a separate 1D input. On the other hand, depth-wise separable convolution, which involves a separate filter for each input channel followed by a 1x1 convolution, is not straightforward as it requires integrating both channel-wise and spatial filtering which complicates the usage of simple 1D convolutions as a direct primitive. To cater towards this, DynAgent implements an algorithm (a detailed explanation of this implementation using a micro-controller environment is given in the Appendix F) to represent depth-wise separable convolution as a function of the standard available micro-controller primitives. DynAgent further interacts with the rest of the components for successful NAS and training.

3.2 DynFit: Intermittency-Aware Learning

Traditional energy scavenging systems employ deep neural networks (DNNs) optimized via post-training techniques such as resource-aware pruning, quantization, and distillation Yang et al. (2018, 2017). While these methods are practical, they can compromise accuracy and scalability. To address these limitations, Mendis et al. (2021) introduced iNAS, tailored for intermittent computing systems. However, iNAS faces significant challenges: firstly, it presumes the existence of task-based DNN deployment facilitating efficient checkpointing, which demands comprehensive knowledge of the system architecture as well as the energy-harvesting (EH) environments. Inaccurate task partitioning can lead to suboptimal checkpointing, causing energy waste or computational inaccuracies. Secondly, it relies on conventional load-store Von Neumann architecture. Emerging research in intermittent computing advocates for non-volatile and non-Von Neumann architectures, such as resistive RAM-based crossbar architectures, to enhance energy efficiency and enable hardware-driven checkpointing. To overcome the said issues, we propose iNAS+, an iteration of iNAS that integrates support for non-Von Neumann architectures and utilizes a broad spectrum of EH profiles (DynAgent 3.1). This modification expands the hardware search space, facilitating the evaluation and selection of optimal network configurations. Despite these enhancements, determining effective checkpointing strategies in intricate task-based systems remains a persistent challenge.

Towards this, DynFit introduces QuantaTask – a discrete task unit executable without interruption under a specific EH profile and hardware setup. Our profiling suggests that loops, particularly in the convolutional (conv) and fully connected (FC) layers, dominate the inference process (accounting for 83% to 90% of execution time but only about 25% of program coverage in assembly code). This disparity renders these loops prone to intermittency failures, often resulting in erroneous outcomes or necessitating re-executions. An analysis using WiFi EH for human activity recognition with accelerometer data indicates a completion rate of about 60%, consistent across different datasets and energy conditions . To mitigate these failures, DynFit defines execution quanta as the “smallest task unit” that can be performed uninterrupted. It optimizes the number of loop iterations in GeMM or conv kernels, treating them as trainable parameters to achieve optimal task partitioning. DynFit closely integrates with DynAgent, which serves as a “repository” for EH profiles and hardware characteristics. Let $\mathcal{Q}$ denote the set of execution quanta, where each quanta $q\in\mathcal{Q}$ is defined by a tuple $(l,e)$ , with $l$ representing the number of loop iterations and $e$ the estimated energy required for these iterations. The optimization objective is: $\text{minimize}\quad\sum_{q\in\mathcal{Q}}E_{q}\quad\text{subject to}\quad E_{q}\leq E_{b}$ , where $E_{q}$ is the energy consumed by quanta $q$ and $E_{b}$ is the energy budget. We iterate through various loop iteration lengths to identify the most optimal quanta. If executing multiple quanta exceeds the energy budget, we fuse the tasks to minimize overheads:

l_{i}=\underset{l_{i}}{\text{argmin}}\,\sum_{q\in\mathcal{Q}}E_{q}\quad\text{subject to}\quad E_{q}\leq E_{b}.

This optimization occurs during the forward pass of the DNN, leveraging energy estimations from DynAgent. By incorporating these mechanisms into the training process, DynFit ensures that the DNN operates efficiently within the constraints of energy scavenging systems . In scenarios where the available energy allows the execution of multiple quanta, DynFit fuses tasks to minimize load-store and checkpointing overheads. An implementation example of QuantaTask-based convolution using microcontroller primitives is provided in Algorithm 3. While this approach significantly reduces the likelihood of intermittency-related failures, it does not guarantee the completion of the entire inference process. Systems may resort to approximation techniques, such as neuron skipping or quantization, to fulfill computation within given SLOs.

3.2.1 Dynamic Dropout for Energy-Constrained Environments

In environments characterized by extremely low energy availability, approximations in computation may become necessary. One such strategy involves skipping some neurons in each layer by dynamically enabling dropout during execution, thereby ensuring the completion of all layers under constrained conditions. However, traditional neural network architectures do not support the incorporation of dropout during inference phases. To address this, we propose the adoption of “dynamic dropout” during the training phase to prepare the network for such operational scenarios. We employ various algorithms (see Appendix D for the list and formulation) to dynamically adjust dropout rates based on specific criteria, each designed to optimize network resilience under energy constraints.

Conceptually, let $\mathcal{D}$ represent the set of dynamic dropout algorithms, defined as: $\mathcal{D}=\{d_{1},d_{2},d_{3},d_{4},d_{5},d_{6}\}$ . Let $\mathbf{W}$ denote the weights of the neural network, and $\mathbf{X}$ represent the input data. The output $\mathbf{Y}$ of the network can be modeled as: $\mathbf{Y}=f(\mathbf{W},\mathbf{X},\mathbf{m})$ , where $\mathbf{m}$ is the dropout mask applied to the weights during the forward pass. Consider $\mathcal{Q}$ as the set of execution quanta, where each quanta $q\in\mathcal{Q}$ is defined by a tuple $(l,e)$ , with $l$ representing the number of loop iterations and $e$ is the estimated energy required for these iterations. The energy consumption $E_{q}$ for a given quanta $q$ and the overall energy budget $E_{b}$ for the inference process are defined. The objective is to optimize the network’s performance under these constraints:

\text{minimize}\quad\mathcal{L}(\mathbf{Y},\mathbf{\hat{Y}})\quad\text{subject to}\quad\sum_{q\in\mathcal{Q}}E_{q}\leq E_{b}.

3.2.2 Dynamic Quantization for Energy-Constrained Environments

Neuron skipping, while effective for managing energy constraints, often results in significant accuracy degradation. To address this issue, we propose dynamic quantization of neurons that are candidates for dropout to lower bitwidths, such as reducing from 16-bit to 12-bit, 8-bit, or even 4-bit. This strategy allows for some level of computation to persist, potentially enhancing accuracy compared to outright neuron dropout. Define $\mathbf{W}$ as the weights of the neural network, $\mathbf{X}$ as the input data, and $\mathbf{m}$ as the dropout mask applied during the forward pass. We introduce $\mathbf{Q}$ as the set of quantization levels: $\mathbf{Q}=\{q_{1},q_{2},q_{3},q_{4}\}$ , where $q_{1}=16\text{ bits}$ , $q_{2}=12\text{ bits}$ , $q_{3}=8\text{ bits}$ , and $q_{4}=4\text{ bits}$ . During training, we optimize not only for the weights $\mathbf{W}$ but also for the selection of quantization levels $\mathbf{q}$ for neurons at the risk of being dropped. The network’s output $\mathbf{Y}$ can thus be expressed as: $\mathbf{Y}=f(\mathbf{W},\mathbf{X},\mathbf{m},\mathbf{q})$ , where $\mathbf{q}$ is the vector representing quantization levels applied to the weights. To effectively integrate dynamic quantization into the training phase, we propose adding a term to the loss function that penalizes higher bit-widths, thereby promoting configurations that are more energy-efficient. Define the modified loss function $\mathcal{L}^{\prime}$ as follows:

\mathcal{L}^{\prime}=\mathcal{L}(\mathbf{Y},\mathbf{\hat{Y}})+\lambda\sum_{i=1}^{n}c_{i}\cdot q_{i}.

Here, $n$ represents the number of neurons, $c_{i}$ denotes the cost associated with the quantization level $q_{i}$ , and $\lambda$ is a regularization parameter that balances accuracy against energy efficiency. The energy consumption $E_{q}$ for a given quantization level $q$ and the total energy budget $E_{b}$ for the inference process are defined as: $\text{minimize}\quad\mathcal{L}^{\prime}\quad\text{subject to}\quad\sum_{q\in\mathbf{Q}}E_{q}\leq E_{b}$ .

By embedding this dynamic quantization strategy into the training regimen, we enable the neural network to adaptively quantize neurons based on the available energy. While dynamic dropouts only update the non-dropped weights, with dynamic quantization all the weights are updated, providing a robust network. This approach strikes a delicate balance between maintaining accuracy and adhering to stringent energy constraints which not only is essential for intermittent environments, but also helpful for (ultra) low power platforms.

3.2.3 Fine-tuning for Regularization and Prevention of Overfitting

Training deep neural networks under intermittent energy conditions introduces unique challenges, particularly in maintaining uniform parameter updates. The implementation of dynamic dropout and quantization, designed to adapt to fluctuating energy levels, can result in certain weights being under-trained. This discrepancy may cause the network to overfit on weights that are consistently updated. To address this issue, we propose an adaptive regularization strategy that actively monitors and adjusts the update frequency of each weight throughout the training process. Define $U_{i}(t)$ as the update status of weight $i$ at training iteration $t$ , with $U_{i}(t)=1$ indicating an update and $U_{i}(t)=0$ otherwise. A threshold parameter $\theta$ represents the minimum proportion of iterations a weight must be updated to avoid being considered under-trained. The update ratio for weight $i$ is computed as $\text{update\_ratio}_{i}=\frac{1}{T}\sum_{t=1}^{T}U_{i}(t)$ . If $\text{update\_ratio}_{i}<\theta$ after $T$ iterations, these under-trained weights undergo additional training cycles without dropout or quantization, ensuring uniform training across all weights we train additional cycles on $w_{i}$ if $\text{update\_ratio}_{i}<\theta$ . After completing the standard training process, weights that have met or exceeded the threshold are frozen, and fine-tuning phases focus specifically on the previously under-trained weights. This strategy has been empirically shown to enhance model robustness and improve generalization under varying operational conditions. $\text{Fine-tune on }w_{i}\text{ for }i\text{ where }\text{update\_ratio}_{i}<\theta$ . This approach not only ensures that all parts of the model receive appropriate training attention but also mitigates the risk of overfitting, thus preserving the model’s ability to generalize across diverse energy availability scenarios.

3.3 DynInfer: Intermittency Aware Task Scheduling for Inference

Effective task scheduling in intermittently-powered environments requires precise control over computational tasks to align with fluctuating energy availability. Contrary to the other components of NExUME, DynInfer takes a system and architecture-level approach to implement a task scheduler that adjusts in real-time to the energy conditions reported by DynAgent. This ensures that the network operations are not only energy-efficient but also robust against power uncertainties. DynInfer utilizes information about the current energy state and computational demands to decompose deep neural network (DNN) operations into smaller, manageable tasks. These tasks are then scheduled based on their priorities and energy requirements. The primary goals during this decomposition are to minimize latency and avoid SLO violations by prioritizing tasks critical to inference completion and to ensure that no individual task exceeds the current available energy, thus avoiding mid-operation failures. Tasks are prioritized according to a dynamically-generated schedule that considers energy profile (predicted short-term energy availability from DynAgent), task criticality (importance of the task in contributing to the accuracy and stability of the DNN output, e.g., important kernels are scheduled first), and deadline sensitivity (tasks closer to their deadlines are given higher priority to reduce the risk of SLO violations). The scheduler adjusts task execution order in real-time, responding to updated forecasts and actual energy harvests, thus maintaining a balance between operational demands and available resources. To further optimize the processing efficiency, DynInfer employs the dynamic task fusion strategy. In anticipation of power failures, DynInfer integrates tightly with hardware-level and software-level checkpointing mechanisms. Hardware checkpointing utilizes non-volatile memory components to quickly save the state of computation at minimal energy cost, whereas software checkpointing manages the state of higher-level features and data structures that are not handled by hardware checkpointing. Upon recovery, DynInfer coordinates with DynAgent to efficiently restore the computation state and resume task execution, minimizing data loss and computational redundancy. DynInfer’s effectiveness is enhanced through its integration with DynAgent, which provides real-time data on energy availability, and DynFit, which adjusts computational tasks according to the current system state and energy forecasts. A detailed description of the DynInfer execution flow with block diagrams are given in Appendix C.2

4 Experimental Results

NExUME can be seamlessly integrated as a “plug-in” for both training and inference frameworks in deep neural network (DNN) applications, specifically designed for intermittent and (ultra) low-power deployments. In this section, we discuss the effectiveness of NExUME across two distinct types of environments, highlighting its versatility and broad applicability. Firstly, we evaluate NExUME using publicly available datasets (§4.2) commonly utilized in embedded applications across multiple modalities—including image, time series sensor, and audio data. These datasets represent typical use cases in embedded systems where energy efficiency and minimal computational overhead are crucial. We use both commercial-off-the-shelf (COTS) hardware and state-of-the-art ReRAM Xbar-based hardware for this evaluation. Secondly, we introduce a novel dataset aimed at advancing research in predictive maintenance and Industry 4.0 Lasi et al. (2014), and test NExUME on a real manufacturing testbed (§4.3) with COTS hardware. We have developed a first-of-its-kind machine status monitoring dataset, available at https://hackmd.io/@Galben/rk7YN6jmR, which involves mounting multiple types of sensors at various locations on a Bridgeport machine to monitor its activity status.

4.1 Development and Profiling of NExUME

NExUME uses a combination of programming languages and technologies to optimize its functionality in intermittent and low-power computing environments. The software stack comprises Python3 (2.7k lines of code), CUDA (1.1k lines of code), and Embedded C (2.1k lines of code, not including DSP libraries). Our training infrastructure utilizes NVIDIA A6000 GPUs with 48 GiB of memory, supported by a 24-core Intel Xeon Gold 6336Y CPU. We employ Pytorch v2.3.0 coupled with CUDA version 11.8 as our primary training framework. To assess the computational overhead introduced by DynFit, a component of NExUME, we use NVIDIA Nsight Compute. During the training sessions enhanced by DynFit, we observed an increase in the number of instructions ranging from a minimum of 11.4% to a maximum of 34.2%. While the overhead in streaming multi-processor (SM) utilization was marginal (within 5%), there was a noticeable increase in memory bandwidth usage, ranging from 6% to 17%. Moreover, we have implemented a modified version of the matrix multiplication operation that strategically skips the loading of rows and/or columns from the input matrices into the GPU’s shared memory and register files. This adaptation is guided by the dropout mask vector and the specific type of sparse matrix operation being performed. This technique effectively reduces the number of load operations by an average of 12%, thereby enhancing the efficiency of computations under energy constraints and contributing to the overall performance improvements in NExUME.

4.2 NExUME on Publicly Available Datasets

Datasets: For image data, we consider the fashion-MNSIT Xiao et al. (2017) and CIFAR10 Alex (2009) datasets; for time series sensor data, we focus on popular human activity recognition (HAR) datasets, MHEALTH Banos et al. (2014) and PAMAP2 Reiss and Stricker (2012); and for audio, we use the audioMNIST Becker et al. (2023) dataset.

Inference Deployment Embedded Platforms: For commercially off-the-shelf micro-controllers, we choose Texas Instruments MSP430fr5994 Instruments (2024a), and Arduino Nano 33 BLE Sense Arduino (2024) as our deployment platform with a Pixel-5 phone as the host device. The host device is used for data logging – collecting SLOs, violations, power failures, etc., along with running the “baseline” inferences without intermittency.

Datasets	Full Power	TI MSP on RF Power from WiFi					Arduino Nano on vibration power with Piezoelectric
Datasets	Full Power	AP	PT	iNAS+PT	NExUME	F1 Score	AP	PT	iNAS+PT	NExUME	F1 Score
FMNIST	98.70	79.20	83.60	87.10	93.50	0.93	67.28	77.35	78.73	85.05	0.92
CIFAR10	89.81	62.13	67.86	70.00	82.71	0.79	49.55	59.04	63.94	72.09	0.81
MHEALTH	89.62	67.50	72.10	76.60	86.88	0.93	54.23	63.34	69.34	77.05	0.87
PAMAP	87.30	64.23	69.50	73.33	82.93	0.85	54.67	61.70	62.52	70.78	0.91
AudioMNIST	88.20	71.40	76.33	79.82	84.71	0.85	64.84	70.31	70.04	74.89	0.88

Table 1: Accuracy and F1 score of NExUME over other approaches.

Baseline: Since we are the first work to propose a new training approach targeted for intermittent devices and inference optimizations, we take the combination of best available approaches as “baseline”. All these DNNs are executed with the state-of-the-art checkpointing and scheduling approach Maeng and Lucia (2018). Baseline Full Power is a DNN designed by iNAS for running while the system is battery-powered and have to hit a target SLO (latency < 500ms). Baseline AP is a DNN compressed to fit to the average power of the EH environment using iNAS. On the other hand, baseline PT takes the Full Power DNN and uses techniques proposed by Yang et al. (2018) and Yang et al. (2017) to prune, quantize, and compress the model. Baseline iNAS+PT designs the network from ground-up while combining the work of Mendis et al. (2021), Yang et al. (2018) and Yang et al. (2017).

Table 1 shows the accuracy of our approach against the baselines described above using TI MSP running on RF Power from WiFi and Arduino Nano on Piezo Electric Power. The inferences meeting the SLO requirements are the only ones considered for accuracy, i.e., a correct classification violating the latency SLO is considered as “incorrect”. While NExUME is unable to surpass the accuracy of a fully-powered system (due to power failures), we observe it outperforming the iNAS+PT baseline by $\approx 14\%$ over all the power traces and embedded platforms tested (a pictorial setup of NExUME and additional results with alternate power sources and platforms are available in Appendix A).

4.3 NExUME on Machine Status Monitoring [Our New Dataset]

Setup and Sensor Arrangement: Two different types of 3-axis accelerometers (with 100Hz and 200Hz sampling rate) were placed in three different locations of a Bridgeport machine to collect and analyze data under different operating status. There were 5 operating status: three different speed of rotation of the spindle (R1, R2, R3 with no job), spindle under job (SJ), and spindle idle (SL). We collected over 700,000 samples over a period of 2 hours for each of the sensors. The sensor data were cleaned, normalized, and converted to the power spectrum density for further analysis.

iNAS+ Cofigs	Perplexity (CNN)		SLO
iNAS+ Cofigs	Validation	Test	Given	Latency Actual
4 x CONV2D: 8[3x3], 8[5x5], 16[5x5], 16[5x5], AvgPooL, L2Drop, FC	89.9	85.3	Accuracy: 88%; Latency: 300ms; Energy Profile: Piezo; HW Profile: TIMSP	380ms
3 x CONV2D: 8[3x3], 16[3x3], 16[3x3], AvgPooL, OBD_Drop, FC	88.3	84.75		260ms
3 x CONV2D: 8[3x3], 16[5x3], 16[5x3], AvgPooL, L2Drop, FC	89.2	85.1		280ms

Table 2: Sample configurations generated by DynNAS+ with their constraints, SLOs, and perplexity.

We use DynNAS+ and DynFit to evaluate the energy traces and perform a NAS to find the DNNs meeting the energy income. Table 2 shows the different configurations generated under the given constraints. We use DynFit to train these models and report the perplexity of the top-3 models. Note that, among the 1st and the 3rd models (in Table 2), the prior one offered minimal accuracy benefits compared to the latter while requiring more latency and was hence rejected as a configuration. Table 7 in Appendix A shows the accuracy of classification tasks against the baseline. We observe an anomalous behavior with R3 (rotating at 300 RPM). This was because of the interference of another machine running at the same rate next to it. The anomaly was accentuated because the rotations being in resonance with the main motor (working at 60 RPM).

4.4 Sensitivity and Ablation Studies of NExUME

To elucidate the influence of variable SLOs and hardware-specific settings on system performance, we conducted a comprehensive sensitivity study. This study involved adjusting the acceptable latency and the capacitance of the energy harvesting (EH) setup to assess their impacts on accuracy. As shown in Figure 1(a), the accuracy improves with increased latency, but with diminishing returns. Similarly, Figure 1(b) demonstrates that, while increasing capacitance should theoretically stabilize the system, its charging characteristics can lead to extended charging times, thus exceeding the latency SLO. Notably, some anomalies in the data were attributed to abrupt power failures, a common challenge in intermittent energy harvesting systems.

An ablation study evaluates the contributions of individual components within NExUME. The results, plotted in Figure 1(c), indicate that the greatest improvements are derived from the “synergistic operation” of all components, particularly DynFit and DynInfer. Although iNAS+ enhances network selection, its lack of intermittency awareness significantly impacts accuracy.

4.5 Limitations and Discussion

NExUME is especially advantageous in intermittent environments and its utility extends to ultra-low power or energy scavenging systems. However, the efficacy of DynAgent is contingent upon the breadth and depth of the available database. Additionally, profiling devices to ascertain their energy consumption, computational capabilities and memory footprint necessitates detailed micro-profiling using embedded programming. This process, while informative, yields only approximate models that are inherently prone to errors. DynFit, with its stochastic dropout features, occasionally leads to overfitting, necessitating meticulous fine-tuning.

While effective in smaller networks, our studies involving larger datasets (such as ImageNet) and more complex network architectures (like MobileNetV2 and ResNet) reveal challenges in achieving convergence without precise fine-tuning. DynFit tends to introduce multiple intermediate states during the training process, resulting in approximately 14% additional wall-time on average. The development of DynInfer requires an in-depth understanding of microcontroller programming and compiler directives. The absence of comprehensive library functions along with the need of computational efficiency frequently necessitates the development of in-line assembly code for certain computational kernels.

5 Conclusions

This paper introduces and evaluates NExUME (Neural Execution Under Intermittent Environment), a versatile framework enhancing DNN operations in energy-harvesting, intermittently-powered settings like remote sensing and wearable technologies. It dynamically integrates neural architecture search, dropout, and quantization adapted to energy variability, thereby significantly advancing the efficiency of DNN deployments. Key innovations include 1) DynNAS+ for optimizing DNN architectures under energy constraints, 2) DynFit for real-time adaptive dropout and quantization, and 3) DynInfer for responsive task scheduling. Further, another component, DynAgent, monitors and adapts to the shifting energy and hardware landscape. These contributions notably elevate IoT sustainability, promote accessibility in power-unstable regions, and catalyze advancements in energy harvesting and embedded ML technologies. NExUME sets a new standard for future research and deployment of energy-efficient, robust neural networks in diverse application scenarios.

References

Adafruit [2024] Adafruit. Tensorflow lite for microcontrollers kit. https://www.adafruit.com/product/4317, 2024. Accessed: 05/19/2024.
Afzal et al. [2022] Sayed Saad Afzal, Waleed Akbar, Osvy Rodriguez, Mario Doumet, Unsoo Ha, Reza Ghaffarivardavagh, and Fadel Adib. Battery-free wireless imaging of underwater environments. Nature communications, 13(1):5546, 2022.
Alex [2009] Krizhevsky Alex. Learning multiple layers of features from tiny images. https://www. cs. toronto. edu/kriz/learning-features-2009-TR. pdf, 2009.
Arduino [2024] Arduino. Arduino nano 33 ble sense with headers. https://store-usa.arduino.cc/products/arduino-nano-33-ble-sense-with-headers, 2024. Accessed on 05/19/2024.
Banos et al. [2014] Oresti Banos, Rafael Garcia, Juan A Holgado-Terriza, Miguel Damas, Hector Pomares, Ignacio Rojas, Alejandro Saez, and Claudia Villalonga. mhealthdroid: a novel framework for agile development of mobile health applications. In Ambient Assisted Living and Daily Activities: 6th International Work-Conference, IWAAL 2014, Belfast, UK, December 2-5, 2014. Proceedings 6, pages 91–98. Springer, 2014.
Becker et al. [2023] Sören Becker, Johanna Vielhaben, Marcel Ackermann, Klaus-Robert Müller, Sebastian Lapuschkin, and Wojciech Samek. Audiomnist: Exploring explainable artificial intelligence for audio analysis on a simple benchmark. Journal of the Franklin Institute, 2023. ISSN 0016-0032. doi: https://doi.org/10.1016/j.jfranklin.2023.11.038. URL https://www.sciencedirect.com/science/article/pii/S0016003223007536.
Gobieski et al. [2018] Graham Gobieski, Nathan Beckmann, and Brandon Lucia. Intermittent deep neural network inference. In SysML Conference, pages 1–3, 2018.
Gobieski et al. [2019] Graham Gobieski, Brandon Lucia, and Nathan Beckmann. Intelligence beyond the edge: Inference on intermittent embedded systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 199–213, 2019.
Insights [2023] Transforma Insights. Iot & ai market forecasts. https://transformainsights.com/research/tam/market, 2023. Accessed: 05/19/2021.
Instruments [2024a] Texas Instruments. Msp430fr5994 mixed-signal microcontrollers. https://www.ti.com/product/MSP430FR5994, 2024a. Accessed: 05/19/2024.
Instruments [2024b] Texas Instruments. Msp dsp library: Low energy accelerator (lea) user’s guide. https://software-dl.ti.com/msp430/msp430_public_sw/mcu/msp430/DSPLib/1_30_00_02/exports/html/usersguide_lea.html, 2024b. Accessed: 05/19/2024.
Islam and Nirjon [2019] Bashima Islam and Shahriar Nirjon. Zygarde: Time-sensitive on-device deep inference and adaptation on intermittently-powered systems. arXiv preprint arXiv:1905.03854, 2019.
Islam et al. [2022] Sahidul Islam, Jieren Deng, Shanglin Zhou, Chen Pan, Caiwen Ding, and Mimi Xie. Enabling fast deep learning on tiny energy-harvesting iot devices. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 921–926. IEEE, 2022.
Kang et al. [2020] Chih-Kai Kang, Hashan Roshantha Mendis, Chun-Han Lin, Ming-Syan Chen, and Pi-Cheng Hsiu. Everything leaves footprints: Hardware accelerated intermittent deep inference. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39(11):3479–3491, 2020.
Kang et al. [2022] Chih-Kai Kang, Hashan Roshantha Mendis, Chun-Han Lin, Ming-Syan Chen, and Pi-Cheng Hsiu. More is less: Model augmentation for intermittent deep inference. ACM Transactions on Embedded Computing Systems (TECS), 21(5):1–26, 2022.
Lasi et al. [2014] Heiner Lasi, Peter Fettke, Hans-Georg Kemper, Thomas Feld, and Michael Hoffmann. Industry 4.0. Business & information systems engineering, 6(4):239–242, 2014.
Liu et al. [2015] Yongpan Liu, Zewei Li, Hehe Li, Yiqun Wang, Xueqing Li, Kaisheng Ma, Shuangchen Li, Meng-Fan Chang, Sampson John, Yuan Xie, et al. Ambient energy harvesting nonvolatile processors: From circuit to system. In Proceedings of the 52nd Annual Design Automation Conference, pages 1–6, 2015.
Lv and Xu [2022] Mingsong Lv and Enyu Xu. Efficient dnn execution on intermittently-powered iot devices with depth-first inference. IEEE Access, 10:101999–102008, 2022.
Ma et al. [2016] Kaisheng Ma, Xueqing Li, Karthik Swaminathan, Yang Zheng, Shuangchen Li, Yongpan Liu, Yuan Xie, John Jack Sampson, and Vijaykrishnan Narayanan. Nonvolatile processor architectures: Efficient, reliable progress with unstable power. IEEE Micro, 36(3):72–83, 2016.
Ma et al. [2017] Kaisheng Ma, Xueqing Li, Jinyang Li, Yongpan Liu, Yuan Xie, Jack Sampson, Mahmut Taylan Kandemir, and Vijaykrishnan Narayanan. Incidental computing on iot nonvolatile processors. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 204–218, 2017.
Maeng and Lucia [2018] Kiwan Maeng and Brandon Lucia. Adaptive dynamic checkpointing for safe efficient intermittent computing. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 129–144, 2018.
Mendis et al. [2021] Hashan Roshantha Mendis, Chih-Kai Kang, and Pi-cheng Hsiu. Intermittent-aware neural architecture search. ACM Transactions on Embedded Computing Systems (TECS), 20(5s):1–27, 2021.
Mishra et al. [2021] Cyan Subhra Mishra, Jack Sampson, Mahmut Taylan Kandemir, and Vijaykrishnan Narayanan. Origin: Enabling on-device intelligence for human activity recognition using energy harvesting wireless sensor networks. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1414–1419. IEEE, 2021.
Mishra et al. [2024] Cyan Subhra Mishra, Jack Sampson, Mahmut Taylan Kandemir, Vijaykrishnan Narayanan, and Chita R Das. Usas: A sustainable continuous-learning´ framework for edge servers. In 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 891–907. IEEE, 2024.
Qiu et al. [2020] Keni Qiu, Nicholas Jao, Mengying Zhao, Cyan Subhra Mishra, Gulsum Gudukbay, Sethu Jose, Jack Sampson, Mahmut Taylan Kandemir, and Vijaykrishnan Narayanan. Resirca: A resilient energy harvesting reram crossbar-based accelerator for intelligent embedded processors. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 315–327. IEEE, 2020.
Reiss and Stricker [2012] Attila Reiss and Didier Stricker. Introducing a new benchmarked dataset for activity monitoring. In 2012 16th international symposium on wearable computers, pages 108–109. IEEE, 2012.
Resch et al. [2020] Salonik Resch, S Karen Khatamifard, Zamshed I Chowdhury, Masoud Zabihi, Zhengyang Zhao, Husrev Cilasun, Jian-Ping Wang, Sachin S Sapatnekar, and Ulya R Karpuzcu. Mouse: Inference in non-volatile memory for energy harvesting applications. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 400–414. IEEE, 2020.
Saffari et al. [2021] Ali Saffari, Sin Yong Tan, Mohamad Katanbaf, Homagni Saha, Joshua R Smith, and Soumik Sarkar. Battery-free camera occupancy detection system. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, pages 13–18, 2021.
Shen et al. [2022] Tianyi Shen, Cyan Subhra Mishra, Jack Sampson, Mahmut Taylan Kandemir, and Vijaykrishnan Narayanan. An efficient edge-cloud partitioning of random forests for distributed sensor networks. IEEE Embedded Systems Letters, 2022.
Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
Yang et al. [2017] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5687–5695, 2017.
Yang et al. [2018] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European conference on computer vision (ECCV), pages 285–300, 2018.

Appendix A More Results on Other Platforms and EH Sources

Datasets	Full Power	AP	PT	iNAS+PT	NExUME	Better
Datasets	Full Power	MSP on Piezo
FMNIST	98.70	71.90	79.72	83.68	88.90	6.24%
CIFAR10	89.81	55.05	62.00	66.98	76.29	13.90%
MHEALTH	89.62	59.76	65.40	71.56	80.75	12.84%
PAMAP	87.30	57.38	65.77	65.38	75.16	14.97%
AudioMNIST	88.20	67.29	73.16	75.41	80.01	6.10%

Table 3: Accuracy of NExUME on MSP board using vibration from a Piezoelectric harvestor. Better refers to the improvement over iNAS+PT baseline.

Datasets	Full Power	AP	PT	iNAS+PT	NExUME	Better
Datasets	Full Power	MSP on Thermal
FMNIST	98.70	80.92	86.32	88.93	95.62	7.53%
CIFAR10	89.81	64.78	69.29	71.53	83.78	17.13%
MHEALTH	89.62	69.77	73.99	77.70	89.62	15.34%
PAMAP	87.30	66.33	71.84	74.47	85.24	14.46%
AudioMNIST	88.20	73.84	78.03	81.60	87.64	7.40%

Table 4: Accuracy of NExUME on MSP board using thermocouple based thermal harvester. Better refers to the improvement over iNAS+PT baseline.

Datasets	Full Power	AP	PT	iNAS+PT	NExUME	Better
Datasets	Full Power	Arduino on RF
FMNIST	98.70	74.44	79.63	83.61	90.44	8.17%
CIFAR10	89.81	58.11	63.91	65.01	79.60	22.44%
MHEALTH	89.62	63.52	67.40	74.30	83.86	12.87%
PAMAP	87.30	61.39	67.24	69.45	77.00	10.87%
AudioMNIST	88.20	66.11	74.28	76.60	78.87	2.97%

Table 5: Accuracy of NExUME on Arduino nano board using WiFi based RF harvester. Better refers to the improvement over iNAS+PT baseline.

Datasets	Full Power	AP	PT	iNAS+PT	NExUME	Better
Datasets	Full Power	Arduino on Thermal
FMNIST	98.70	77.04	80.44	83.08	89.90	8.20%
CIFAR10	89.81	60.38	65.90	66.98	80.70	20.48%
MHEALTH	89.62	65.74	69.88	72.41	85.75	18.42%
PAMAP	87.30	62.76	65.93	71.46	81.27	13.73%
AudioMNIST	88.20	69.12	73.86	77.79	83.54	7.39%

Table 6: Accuracy of NExUME on Arduino nano board using thermocouple based thermal harvester. Better refers to the improvement over iNAS+PT baseline.

Class	Full Power	TI MSP on Piezoelectric EH Device
Class	Full Power	AP	PT	iNAS+PT	NExUME	F1 Score
R1	84.93	74.46	77.02	79.62	80.85	0.89
R2	85.85	76.21	79.18	80.36	83.58	0.86
R3	81.09	72.43	75.38	78.18	80.61	0.79
SJ	90.95	82.33	85.00	87.58	89.04	0.91
SI	94.76	85.31	88.05	89.90	91.42	0.94

Table 7: Accuracy of NExUME for industry status monitoring dataset using TI MSP board and a piezoelectric EH source harvesting from the vibrations of the machines.

Appendix B Details on Energy Harvesting

A typical energy harvesting (EH) setup captures and converts environmental energy into usable electrical power, which can then support various electronic devices. Here’s a simplified breakdown of the process:

1.

Energy Capture: The setup begins with a harvester, such as a solar panel, piezoelectric sensor, or thermocouple. These devices are designed to collect energy from their surroundings—light, mechanical vibrations, or heat, respectively.
2.

Power Conditioning: Once energy is harvested, it often needs to be converted and stabilized for use. This is done using a rectifier, which transforms alternating current (AC) into a more usable direct current (DC).
3.

Voltage Regulation: After rectification, the power might not be at the right voltage for the device it needs to support. A matching circuit, including components like buck or boost converters, adjusts the voltage to the appropriate level, ensuring the device receives the correct current and voltage.
4.

Energy Storage: Finally, to ensure a continuous power supply even when the immediate energy source is inconsistent (like when a cloud passes over a solar panel), the system includes a temporary storage unit, such as a super-capacitor. This component helps smooth out the supply, providing steady power to the compute circuit.

By integrating these components, an EH system can sustainably power devices without relying on traditional power grids, making it ideal for remote or mobile applications.

Appendix C Intermittent Computing and Check-pointing

C.1 Intermittency-Aware General Matrix Multiplication (GeMM)

Here we explain the operation of an energy-aware algorithm for performing General Matrix Multiplication (GeMM). The algorithm is designed to operate in environments where energy availability is intermittent, such as in devices powered by energy harvesting. It includes mechanisms for loop tiling, checkpointing, and resumption to manage computation across power interruptions effectively.

C.1.1 Algorithm Overview

The GeMM operation, typically expressed as $C=A\times B$ , where $A$ , $B$ , and $C$ are matrices, is implemented with considerations for energy limitations. The algorithm breaks the matrix multiplication into smaller chunks (tiles), periodically saves the state before potential power losses, and resumes computation from the last saved state upon power restoration.

C.1.2 Function Definitions

•

SAVE_STATE: Saves the current indices and the partial result of the output matrix $C$ to non-volatile memory to allow recovery after a power interruption.
•

LOAD_STATE: Retrieves the last saved indices and partial result from non-volatile memory to resume computation.

C.1.3 Loop Tiling

The algorithm uses loop tiling to divide the computation into smaller blocks that can be managed between power interruptions. This tiling not only makes the computation manageable but also optimizes memory usage and cache performance, which is critical in constrained environments.

C.1.4 Check-pointing Mechanism

Before each power interruption, detected through an energy monitoring system, the algorithm saves the current state using the SAVE_STATE function. This state includes the loop indices and the current value of the element being processed in $C$ . This ensures that no computation is lost when the power goes out.

C.1.5 Resumption Mechanism

Upon resuming, the algorithm loads the saved state using the LOAD_STATE function. This state is used to continue the computation exactly where it left off, minimizing redundant operations and ensuring efficiency.

C.2 DynInfer with Hardware-Software Support

We also provide a full software-compiler-hardware driven execution framework for commercial devices with non-volatility support (like MSP-EXP430FR5994 with FeRAM). Figure 3 shows a detailed overview of our design execution. To support user programs ( ) we implement a moving window based power predictor ( ) which takes its input from the on-board EH capacitor. Considering the energy available, the predictor makes an informed decision on how to proceed. The compiler deconstructs the program into jobs to perform seamless program execution. These jobs form the functional program execution DAG. For example, for a DNN execution, the jobs could be CONV2D ( ), batch normalization ( ) etc. However, certain jobs could be too big to execute atomically on harvested energy. Therefore, we profile the task using the compute platform (in this case using the MSP-EXP430FR5994 and the LEA in it) to further divide the jobs into Power Atomic Tasks (Tasks hence further). These Tasks are carefully coded with optimized assembly language to maximize their efficiency. We take advantage of Hardware Support (the on-board NV FeRAM) to perform backup and restore in case of Power Emergencies. In case of a power emergency (as shown in Figure 3 ) the task is abandoned and a hardware assisted Backup ( ) and Restore ( ) is performed.

Appendix D Formulation of Dynamic Dropouts:

D.1 L2 Dynamic Dropout with QuantaTask Optimization

L2 Dynamic Dropout leverages the L2 norm of the weights to influence dropout rates, combined with the QuantaTask optimization to handle energy constraints in intermittent systems.

Mathematical Formulation: Let $\mathbf{W}$ be the weight matrix of a layer. The L2 norm of the weights is calculated as:

\|\mathbf{W}\|_{2}=\sqrt{\sum_{i,j}W_{ij}^{2}}

Define the dropout probability $p_{i}$ for neuron $i$ based on the L2 norm of its corresponding weights. The idea is to use the inverse of the L2 norm to determine the probability:

p_{i}=\frac{\alpha}{\|\mathbf{W}_{i}\|_{2}+\epsilon}

where $\alpha$ is a scaling factor to adjust the overall dropout rate, and $\epsilon$ is a small constant to avoid division by zero. Define a binary dropout mask $\mathbf{m}=[m_{1},m_{2},\ldots,m_{n}]$ where $m_{i}\in\{0,1\}$ . Each element of the mask is determined by sampling from a Bernoulli distribution with probability $1-p_{i}$ :

m_{i}\sim\text{Bernoulli}(1-p_{i})

Apply the dropout mask during the forward pass. Let $\mathbf{a}_{i}$ denote the activation of neuron $i$ :

\mathbf{a}_{i}^{\text{dropout}}=\mathbf{a}_{i}\cdot m_{i}

Training with L2 Dynamic Dropout and QuantaTask Optimization: Initialize the network parameters $\mathbf{W}$ , dropout mask $\mathbf{m}$ , and scaling factor $\alpha$ . Define the energy budget $E_{b}$ for a single quanta and for the entire inference. Initialize the loop iteration parameters $l$ . Compute the activations $\mathbf{a}$ and apply the dropout mask:

\mathbf{a}_{i}^{\text{dropout}}=\mathbf{a}_{i}\cdot m_{i}

Compute the loss $\mathcal{L}(\mathbf{Y},\mathbf{\hat{Y}})$ where $\mathbf{Y}$ is the output of the network and $\mathbf{\hat{Y}}$ is the target output. Calculate the gradients of the loss with respect to the weights:

\frac{\partial\mathcal{L}}{\partial W_{ij}}

For each layer $L$ and loop $i$ within the layer, estimate the energy $E_{i}$ required for the current quanta size $l_{i}$ :

E_{i}\leftarrow\text{DynAgent.estimateEnergy}(L,i,l_{i})

If $E_{i}>E_{b}$ , fuse tasks to reduce the overhead:

\text{FuseTasks}(L,i,l_{i},E_{b})

Update $E_{i}$ after task fusion:

E_{i}\leftarrow\text{DynAgent.estimateEnergy}(L,i,l_{i})

Update the dropout mask $\mathbf{m}$ based on the L2 norm of the weights:

p_{i}=\frac{\alpha}{\|\mathbf{W}_{i}\|_{2}+\epsilon}

m_{i}=\begin{cases}0&\text{if }\text{Bernoulli}(1-p_{i})=0\\ 1&\text{otherwise}\end{cases}

Perform the backward pass to update the network weights, considering the dropout mask:

\mathbf{W}\leftarrow\mathbf{W}-\eta\frac{\partial\mathcal{L}}{\partial\mathbf{W}}\odot\mathbf{m}

where $\eta$ is the learning rate and $\odot$ denotes element-wise multiplication.

Inference with L2 Dynamic Dropout and QuantaTask Optimization: Check the available energy using DynAgent. If energy is below a threshold, increase the dropout rate to ensure the inference can be completed within the energy budget. Otherwise, maintain or reduce the dropout rate to improve accuracy. Perform the forward pass with the updated dropout mask to obtain the output $\mathbf{Y}$ . This approach ensures that the network is robust to varying energy conditions by incorporating dynamic dropout influenced by the L2 norm of the weights, along with the QuantaTask optimization to handle energy constraints.

D.2 Optimal Brain Damage Dropout with QuantaTask Optimization

Optimal Brain Damage Dropout leverages a simplified version of the Optimal Brain Damage pruning method to adjust dropout rates, combined with the QuantaTask optimization to handle energy constraints in intermittent systems.

Mathematical Formulation: Let $\mathbf{W}$ be the weight matrix of a layer. The sensitivity of each weight $W_{ij}$ is calculated using the second-order Taylor expansion of the loss function $\mathcal{L}$ :

\Delta\mathcal{L}\approx\frac{1}{2}\sum_{i,j}\frac{\partial^{2}\mathcal{L}}{\partial W_{ij}^{2}}(W_{ij})^{2}

where $\frac{\partial^{2}\mathcal{L}}{\partial W_{ij}^{2}}$ is the second-order derivative (Hessian) of the loss with respect to the weights.

Define the dropout probability $p_{i}$ for neuron $i$ based on the sensitivity of its corresponding weights. The idea is to use the sensitivity to determine the probability:

p_{i}=\frac{\beta\sum_{j}\frac{\partial^{2}\mathcal{L}}{\partial W_{ij}^{2}}(W_{ij})^{2}}{\max\left(\sum_{j}\frac{\partial^{2}\mathcal{L}}{\partial W_{ij}^{2}}(W_{ij})^{2}\right)+\epsilon}

where $\beta$ is a scaling factor to adjust the overall dropout rate, and $\epsilon$ is a small constant to avoid division by zero.

Define a binary dropout mask $\mathbf{m}=[m_{1},m_{2},\ldots,m_{n}]$ where $m_{i}\in\{0,1\}$ . Each element of the mask is determined by sampling from a Bernoulli distribution with probability $1-p_{i}$ :

m_{i}\sim\text{Bernoulli}(1-p_{i})

Apply the dropout mask during the forward pass. Let $\mathbf{a}_{i}$ denote the activation of neuron $i$ :

\mathbf{a}_{i}^{\text{dropout}}=\mathbf{a}_{i}\cdot m_{i}

Training with Optimal Brain Damage Dropout and QuantaTask Optimization: Initialize the network parameters $\mathbf{W}$ , dropout mask $\mathbf{m}$ , and scaling factor $\beta$ . Define the energy budget $E_{b}$ for a single quanta and for the entire inference. Initialize the loop iteration parameters $l$ .

Compute the activations $\mathbf{a}$ and apply the dropout mask:

\mathbf{a}_{i}^{\text{dropout}}=\mathbf{a}_{i}\cdot m_{i}

Compute the loss $\mathcal{L}(\mathbf{Y},\mathbf{\hat{Y}})$ where $\mathbf{Y}$ is the output of the network and $\mathbf{\hat{Y}}$ is the target output.

Calculate the gradients and Hessians of the loss with respect to the weights:

\frac{\partial\mathcal{L}}{\partial W_{ij}},\quad\frac{\partial^{2}\mathcal{L}}{\partial W_{ij}^{2}}

For each layer $L$ and loop $i$ within the layer, estimate the energy $E_{i}$ required for the current quanta size $l_{i}$ :

E_{i}\leftarrow\text{DynAgent.estimateEnergy}(L,i,l_{i})

If $E_{i}>E_{b}$ , fuse tasks to reduce the overhead:

\text{FuseTasks}(L,i,l_{i},E_{b})

Update $E_{i}$ after task fusion:

E_{i}\leftarrow\text{DynAgent.estimateEnergy}(L,i,l_{i})

Update the dropout mask $\mathbf{m}$ based on the sensitivities:

p_{i}=\frac{\beta\sum_{j}\frac{\partial^{2}\mathcal{L}}{\partial W_{ij}^{2}}(W_{ij})^{2}}{\max\left(\sum_{j}\frac{\partial^{2}\mathcal{L}}{\partial W_{ij}^{2}}(W_{ij})^{2}\right)+\epsilon}

m_{i}=\begin{cases}0&\text{if }\text{Bernoulli}(1-p_{i})=0\\ 1&\text{otherwise}\end{cases}

Perform the backward pass to update the network weights, considering the dropout mask:

\mathbf{W}\leftarrow\mathbf{W}-\eta\frac{\partial\mathcal{L}}{\partial\mathbf{W}}\odot\mathbf{m}

where $\eta$ is the learning rate and $\odot$ denotes element-wise multiplication.

Inference with Optimal Brain Damage Dropout and QuantaTask Optimization: Check the available energy using DynAgent. If energy is below a threshold, increase the dropout rate to ensure the inference can be completed within the energy budget. Otherwise, maintain or reduce the dropout rate to improve accuracy. Perform the forward pass with the updated dropout mask to obtain the output $\mathbf{Y}$ . This approach ensures that the network is robust to varying energy conditions by incorporating dynamic dropout influenced by the sensitivity of the weights, along with the QuantaTask optimization to handle energy constraints.

D.3 Feature Map Reconstruction Error Dropout with QuantaTask Optimization

Feature Map Reconstruction Error Dropout leverages the reconstruction error of feature maps to adjust dropout rates, combined with the QuantaTask optimization to handle energy constraints in intermittent systems.

Mathematical Formulation: Let $\mathbf{W}$ be the weight matrix of a layer and $\mathbf{F}$ be the feature maps produced by the layer. The reconstruction error of a feature map $F_{i}$ is calculated as:

\text{RE}_{i}=\|\mathbf{F}_{i}-\hat{\mathbf{F}}_{i}\|_{2}

where $\hat{\mathbf{F}}_{i}$ is the reconstructed feature map, and $\|\cdot\|_{2}$ denotes the L2 norm.

Define the dropout probability $p_{i}$ for neuron $i$ based on the reconstruction error of its corresponding feature map. The idea is to use the reconstruction error to determine the probability:

p_{i}=\frac{\gamma\,\text{RE}_{i}}{\max(\text{RE})+\epsilon}

where $\gamma$ is a scaling factor to adjust the overall dropout rate, and $\epsilon$ is a small constant to avoid division by zero.

m_{i}\sim\text{Bernoulli}(1-p_{i})

Apply the dropout mask during the forward pass. Let $\mathbf{a}_{i}$ denote the activation of neuron $i$ :

\mathbf{a}_{i}^{\text{dropout}}=\mathbf{a}_{i}\cdot m_{i}

Training with Feature Map Reconstruction Error Dropout and QuantaTask Optimization: Initialize the network parameters $\mathbf{W}$ , dropout mask $\mathbf{m}$ , and scaling factor $\gamma$ . Define the energy budget $E_{b}$ for a single quanta and for the entire inference. Initialize the loop iteration parameters $l$ .

Compute the activations $\mathbf{a}$ and apply the dropout mask:

\mathbf{a}_{i}^{\text{dropout}}=\mathbf{a}_{i}\cdot m_{i}

Compute the loss $\mathcal{L}(\mathbf{Y},\mathbf{\hat{Y}})$ where $\mathbf{Y}$ is the output of the network and $\mathbf{\hat{Y}}$ is the target output.

Calculate the gradients of the loss with respect to the weights:

\frac{\partial\mathcal{L}}{\partial W_{ij}}

For each layer $L$ and loop $i$ within the layer, estimate the energy $E_{i}$ required for the current quanta size $l_{i}$ :

E_{i}\leftarrow\text{DynAgent.estimateEnergy}(L,i,l_{i})

If $E_{i}>E_{b}$ , fuse tasks to reduce the overhead:

\text{FuseTasks}(L,i,l_{i},E_{b})

Update $E_{i}$ after task fusion:

E_{i}\leftarrow\text{DynAgent.estimateEnergy}(L,i,l_{i})

Update the dropout mask $\mathbf{m}$ based on the reconstruction error of the feature maps:

p_{i}=\frac{\gamma\,\text{RE}_{i}}{\max(\text{RE})+\epsilon}

m_{i}=\begin{cases}0&\text{if }\text{Bernoulli}(1-p_{i})=0\\ 1&\text{otherwise}\end{cases}

Perform the backward pass to update the network weights, considering the dropout mask:

\mathbf{W}\leftarrow\mathbf{W}-\eta\frac{\partial\mathcal{L}}{\partial\mathbf{W}}\odot\mathbf{m}

where $\eta$ is the learning rate and $\odot$ denotes element-wise multiplication.

Inference with Feature Map Reconstruction Error Dropout and QuantaTask Optimization: Check the available energy using DynAgent. If energy is below a threshold, increase the dropout rate to ensure the inference can be completed within the energy budget. Otherwise, maintain or reduce the dropout rate to improve accuracy. Perform the forward pass with the updated dropout mask to obtain the output $\mathbf{Y}$ . This approach ensures that the network is robust to varying energy conditions by incorporating dynamic dropout influenced by the reconstruction error of the feature maps, along with the QuantaTask optimization to handle energy constraints.

D.4 Learning Sparse Masks Dropout with QuantaTask Optimization

Learning Sparse Masks Dropout adapts dropout masks as learnable parameters within the network, inspired by Wen et al. (2016), combined with the QuantaTask optimization to handle energy constraints in intermittent systems.

Mathematical Formulation: Let $\mathbf{W}$ be the weight matrix of a layer. Define a binary dropout mask $\mathbf{m}=[m_{1},m_{2},\ldots,m_{n}]$ where $m_{i}\in\{0,1\}$ . In Learning Sparse Masks Dropout, the dropout masks are treated as learnable parameters. The mask values are determined using a sigmoid function to ensure they lie between 0 and 1:

m_{i}=\sigma(z_{i})

where $z_{i}$ are learnable parameters and $\sigma(\cdot)$ is the sigmoid function.

Apply the dropout mask during the forward pass. Let $\mathbf{a}_{i}$ denote the activation of neuron $i$ :

\mathbf{a}_{i}^{\text{dropout}}=\mathbf{a}_{i}\cdot m_{i}

Compute the loss $\mathcal{L}(\mathbf{Y},\mathbf{\hat{Y}})$ where $\mathbf{Y}$ is the output of the network and $\mathbf{\hat{Y}}$ is the target output.

DynFit integrates closely with DynAgent, which serves as a repository of EH profiles and hardware characteristics. Let $\mathcal{Q}$ represent the set of execution quanta, where each quanta $q\in\mathcal{Q}$ is defined by a tuple $(l,e)$ :

q=(l,e)

Here, $l$ is the number of loop iterations and $e$ is the estimated energy required for these iterations. The goal is to optimize the loop iteration parameter $l$ such that the energy consumption $E_{q}$ for each quanta $q$ is within the energy budget $E_{b}$ :

\text{minimize}\quad\sum_{q\in\mathcal{Q}}E_{q}\quad\text{subject to}\quad E_{q}\leq E_{b}

Training with Learning Sparse Masks Dropout and QuantaTask Optimization: Initialize the network parameters $\mathbf{W}$ , dropout mask parameters $\mathbf{z}$ , and scaling factor $\alpha$ . Define the energy budget $E_{b}$ for a single quanta and for the entire inference. Initialize the loop iteration parameters $l$ .

Compute the activations $\mathbf{a}$ and apply the dropout mask:

m_{i}=\sigma(z_{i})

\mathbf{a}_{i}^{\text{dropout}}=\mathbf{a}_{i}\cdot m_{i}

Compute the loss $\mathcal{L}(\mathbf{Y},\mathbf{\hat{Y}})$ . Calculate the gradients of the loss with respect to the weights and dropout mask parameters:

\frac{\partial\mathcal{L}}{\partial W_{ij}},\quad\frac{\partial\mathcal{L}}{\partial z_{i}}

For each layer $L$ and loop $i$ within the layer, estimate the energy $E_{i}$ required for the current quanta size $l_{i}$ :

E_{i}\leftarrow\text{DynAgent.estimateEnergy}(L,i,l_{i})

If $E_{i}>E_{b}$ , fuse tasks to reduce the overhead:

\text{FuseTasks}(L,i,l_{i},E_{b})

Update $E_{i}$ after task fusion:

E_{i}\leftarrow\text{DynAgent.estimateEnergy}(L,i,l_{i})

Update the dropout mask parameters $\mathbf{z}$ based on the gradients:

z_{i}\leftarrow z_{i}-\eta\frac{\partial\mathcal{L}}{\partial z_{i}}

Perform the backward pass to update the network weights, considering the dropout mask:

\mathbf{W}\leftarrow\mathbf{W}-\eta\frac{\partial\mathcal{L}}{\partial\mathbf{W}}\odot\mathbf{m}

where $\eta$ is the learning rate and $\odot$ denotes element-wise multiplication.

Inference with Learning Sparse Masks Dropout and QuantaTask Optimization: Check the available energy using DynAgent. If energy is below a threshold, increase the dropout rate to ensure the inference can be completed within the energy budget. Otherwise, maintain or reduce the dropout rate to improve accuracy. Perform the forward pass with the updated dropout mask to obtain the output $\mathbf{Y}$ . This approach ensures that the network is robust to varying energy conditions by incorporating dynamic dropout with learnable mask parameters, along with the QuantaTask optimization to handle energy constraints.

D.5 Neuron Shapley Value Dropout with QuantaTask Optimization

Neuron Shapley Value Dropout applies the concept of Shapley values from game theory (Aas et al., 2021) to assess neuron importance for dropout, combined with the QuantaTask optimization to handle energy constraints in intermittent systems.

Mathematical Formulation: The Shapley value $\phi_{i}$ of neuron $i$ is a measure of its contribution to the overall network performance. It is calculated by considering all possible subsets of neurons and computing the marginal contribution of neuron $i$ to the network’s output:

\phi_{i}=\frac{1}{|\mathcal{N}|!}\sum_{S\subseteq\mathcal{N}\setminus\{i\}}\frac{|S|!(|\mathcal{N}|-|S|-1)!}{|\mathcal{N}|}\left[\mathcal{L}(S\cup\{i\})-\mathcal{L}(S)\right]

where $\mathcal{N}$ is the set of all neurons, $S$ is a subset of neurons not containing $i$ , and $\mathcal{L}(\cdot)$ denotes the loss function.

Define the dropout probability $p_{i}$ for neuron $i$ based on its Shapley value. Neurons with lower Shapley values are more likely to be dropped:

p_{i}=\frac{\delta}{\phi_{i}+\epsilon}

where $\delta$ is a scaling factor to adjust the overall dropout rate, and $\epsilon$ is a small constant to avoid division by zero.

m_{i}\sim\text{Bernoulli}(1-p_{i})

Apply the dropout mask during the forward pass. Let $\mathbf{a}_{i}$ denote the activation of neuron $i$ :

\mathbf{a}_{i}^{\text{dropout}}=\mathbf{a}_{i}\cdot m_{i}

Training with Neuron Shapley Value Dropout and QuantaTask Optimization: Initialize the network parameters $\mathbf{W}$ , dropout mask $\mathbf{m}$ , and scaling factor $\delta$ . Define the energy budget $E_{b}$ for a single quanta and for the entire inference. Initialize the loop iteration parameters $l$ .

Compute the activations $\mathbf{a}$ and apply the dropout mask:

\mathbf{a}_{i}^{\text{dropout}}=\mathbf{a}_{i}\cdot m_{i}

Compute the loss $\mathcal{L}(\mathbf{Y},\mathbf{\hat{Y}})$ where $\mathbf{Y}$ is the output of the network and $\mathbf{\hat{Y}}$ is the target output.

Calculate the Shapley values $\phi_{i}$ for each neuron based on their contribution to the network’s performance.

For each layer $L$ and loop $i$ within the layer, estimate the energy $E_{i}$ required for the current quanta size $l_{i}$ :

E_{i}\leftarrow\text{DynAgent.estimateEnergy}(L,i,l_{i})

If $E_{i}>E_{b}$ , fuse tasks to reduce the overhead:

\text{FuseTasks}(L,i,l_{i},E_{b})

Update $E_{i}$ after task fusion:

E_{i}\leftarrow\text{DynAgent.estimateEnergy}(L,i,l_{i})

Update the dropout mask $\mathbf{m}$ based on the Shapley values:

p_{i}=\frac{\delta}{\phi_{i}+\epsilon}

m_{i}=\begin{cases}0&\text{if }\text{Bernoulli}(1-p_{i})=0\\ 1&\text{otherwise}\end{cases}

Perform the backward pass to update the network weights, considering the dropout mask:

\mathbf{W}\leftarrow\mathbf{W}-\eta\frac{\partial\mathcal{L}}{\partial\mathbf{W}}\odot\mathbf{m}

where $\eta$ is the learning rate and $\odot$ denotes element-wise multiplication.

Inference with Neuron Shapley Value Dropout and QuantaTask Optimization: Check the available energy using DynAgent. If energy is below a threshold, increase the dropout rate to ensure the inference can be completed within the energy budget. Otherwise, maintain or reduce the dropout rate to improve accuracy. Perform the forward pass with the updated dropout mask to obtain the output $\mathbf{Y}$ . This approach ensures that the network is robust to varying energy conditions by incorporating dynamic dropout influenced by the Shapley values of the neurons, along with the QuantaTask optimization to handle energy constraints.

D.6 Taylor Expansion Dropout with QuantaTask Optimization

Taylor Expansion Dropout uses Taylor expansion (Li et al., 2016) to evaluate the impact of neurons on loss for dropout adjustments, combined with the QuantaTask optimization to handle energy constraints in intermittent systems.

Mathematical Formulation: Let $\mathbf{W}$ be the weight matrix of a layer. The impact of neuron $i$ on the loss function $\mathcal{L}$ can be approximated using the first-order Taylor expansion:

\Delta\mathcal{L}_{i}\approx\left|\frac{\partial\mathcal{L}}{\partial\mathbf{a}_{i}}\mathbf{a}_{i}\right|

where $\mathbf{a}_{i}$ is the activation of neuron $i$ , and $\frac{\partial\mathcal{L}}{\partial\mathbf{a}_{i}}$ is the gradient of the loss with respect to the activation.

Define the dropout probability $p_{i}$ for neuron $i$ based on the Taylor expansion approximation of its impact on the loss:

p_{i}=\frac{\lambda}{\left|\frac{\partial\mathcal{L}}{\partial\mathbf{a}_{i}}\mathbf{a}_{i}\right|+\epsilon}

where $\lambda$ is a scaling factor to adjust the overall dropout rate, and $\epsilon$ is a small constant to avoid division by zero.

m_{i}\sim\text{Bernoulli}(1-p_{i})

Apply the dropout mask during the forward pass. Let $\mathbf{a}_{i}$ denote the activation of neuron $i$ :

\mathbf{a}_{i}^{\text{dropout}}=\mathbf{a}_{i}\cdot m_{i}

Training with Taylor Expansion Dropout and QuantaTask Optimization: Initialize the network parameters $\mathbf{W}$ , dropout mask $\mathbf{m}$ , and scaling factor $\lambda$ . Define the energy budget $E_{b}$ for a single quanta and for the entire inference. Initialize the loop iteration parameters $l$ .

Compute the activations $\mathbf{a}$ and apply the dropout mask:

\mathbf{a}_{i}^{\text{dropout}}=\mathbf{a}_{i}\cdot m_{i}

Compute the loss $\mathcal{L}(\mathbf{Y},\mathbf{\hat{Y}})$ where $\mathbf{Y}$ is the output of the network and $\mathbf{\hat{Y}}$ is the target output.

Calculate the gradients of the loss with respect to the activations:

\frac{\partial\mathcal{L}}{\partial\mathbf{a}_{i}}

For each layer $L$ and loop $i$ within the layer, estimate the energy $E_{i}$ required for the current quanta size $l_{i}$ :

E_{i}\leftarrow\text{DynAgent.estimateEnergy}(L,i,l_{i})

If $E_{i}>E_{b}$ , fuse tasks to reduce the overhead:

\text{FuseTasks}(L,i,l_{i},E_{b})

Update $E_{i}$ after task fusion:

E_{i}\leftarrow\text{DynAgent.estimateEnergy}(L,i,l_{i})

Update the dropout mask $\mathbf{m}$ based on the Taylor expansion approximation:

p_{i}=\frac{\lambda}{\left|\frac{\partial\mathcal{L}}{\partial\mathbf{a}_{i}}\mathbf{a}_{i}\right|+\epsilon}

m_{i}=\begin{cases}0&\text{if }\text{Bernoulli}(1-p_{i})=0\\ 1&\text{otherwise}\end{cases}

Perform the backward pass to update the network weights, considering the dropout mask:

\mathbf{W}\leftarrow\mathbf{W}-\eta\frac{\partial\mathcal{L}}{\partial\mathbf{W}}\odot\mathbf{m}

where $\eta$ is the learning rate and $\odot$ denotes element-wise multiplication.

Inference with Taylor Expansion Dropout and QuantaTask Optimization: Check the available energy using DynAgent. If energy is below a threshold, increase the dropout rate to ensure the inference can be completed within the energy budget. Otherwise, maintain or reduce the dropout rate to improve accuracy. Perform the forward pass with the updated dropout mask to obtain the output $\mathbf{Y}$ . This approach ensures that the network is robust to varying energy conditions by incorporating dynamic dropout influenced by the Taylor expansion approximation of the neurons’ impact on the loss, along with the QuantaTask optimization to handle energy constraints.

Appendix E Workings of Re-RAM Crossbar

E.1 Re-RAM cross-bar for DNN inference:

ReRAM x-bars are an emerging class of computing devices that leverage resistive random-access memory (ReRAM) technology for efficient and low-power computing. These devices can perform multiplication and addition operations in a single operation, making them ideal for many signal processing and machine learning applications. Moreover, these devices can also be used for performing convolution operations, which are widely used in image and signal processing applications.

E.1.1 Simple Single Cell Example:

consider a simple example of a ReRAM crossbar array with two cells, where V1 and V2 are the input voltages, G1 and G2 are the conductance values of the ReRAM devices, and I1 and I2 are the resulting output currents. To perform multiplication-addition, we first apply the input voltages V1 and V2 to the rows of the crossbar array. The conductance values G1 and G2 of the ReRAM devices are set to the corresponding weight values for the multiplication operation. The output currents I1 and I2 are then computed as follows:

	$\displaystyle I=I1+I2$
	$\displaystyle=G1\times V1+G2\times V2$

Here, the output currents I1 and I2 are the result of the multiplication of the input voltages V1 and V2 by their respective weight values, which are summed together using the crossbar wires. Please refer to Figure 4(a) for more details. As we can see, the input voltages V1 and V2 are applied to the rows of the crossbar array, while the conductance values G1 and G2 are applied to the columns. The output currents I1 and I2 are the result of the multiplication-addition operation, and are obtained by summing the currents flowing through the ReRAM devices.

In practice, ReRAM crossbar arrays can have many more cells, and can be used to perform more complex multiplication-addition and convolution operations. However, the basic principle remains the same, where the input signals are applied to the rows, the weights are applied to the columns, and the output signals are obtained by summing the currents flowing through the ReRAM devices.

E.1.2 Extending to Complex Compute:

In order to perform multiplication-addition in ReRAM x-bars, two arrays of weights and inputs are used. The inputs are fed to the x-bar, which is a two-dimensional array of ReRAM crossbar arrays. The crossbar arrays are composed of a set of row and column wires that intersect at a set of ReRAM devices (refer Figure 4(b)). The ReRAM devices are programmed to have different resistance values, which are used to store the weights.

During the multiplication-addition operation, the input signals are applied to the rows of the x-bar, and the weights are applied to the columns. The output of each ReRAM device is the product of the input and weight signals, which are added together using the crossbar wires. This results in a single output signal that represents the sum of the weighted inputs.

To perform convolution, ReRAM x-bars use a similar approach, but with a more complex circuit. The input signal is applied to the x-bar in the same way, but the weights are now applied in a more structured way. Specifically, the weights are arranged in a way that mimics the convolution operation, such that each weight corresponds to a specific location in the input signal. To perform the convolution operation, the input signal is applied to the rows of the x-bar, and the weights are applied to the columns in a structured way. The output signal is obtained by summing the weighted input signals over a sliding window, which moves across the input signal to compute the convolution.

At the circuit level, the ReRAM x-bar for multiplication-addition typically includes several components, such as digital-to-analog converters (DACs), analog-to-digital converters (ADCs), shift registers, and hold capacitors. The DACs and ADCs are used to convert the digital input and weight signals into analog signals that can be applied to the rows and columns of the x-bar. The shift registers are used to apply the weight signals in a structured way, and the hold capacitors are used to store the analog signals during the multiplication-addition operation. Similarly, for performing convolution, the ReRAM x-bar typically includes additional components, such as delay lines and adders. The delay lines are used to implement the sliding window for the convolution operation, while the adders are used to sum the weighted input signals over the sliding window.

Appendix F Pseudo Codes

F.1 Depth-wise Separable Convolution 2D Using TI LEA

Depth-wise separable convolution is an efficient form of convolution that reduces the computational cost compared to standard convolution. Here we describe the implementation of depth-wise separable convolution 2D using the Low Energy Accelerator (LEA) in Texas Instruments’ MSP430 microcontrollers.

F.1.1 depth-wise Separable Convolution 2D Using Conv1D

The pseudo code described in Algorithm 1 implements a depth-wise separable convolution 2D (DWSConv2D) using a 1D convolution primitive function (conv1D). The DWSConv2D function takes four inputs: an input matrix, depth-wise kernels (DWsKernels), point-wise kernels (PtWsKernel), and an output matrix. The depth-wise separable convolution is performed in two main steps: depth-wise convolution and point-wise convolution.

Algorithm 1 Implementing Depth-wise Separable Convolution - DWSConv2D() using CONV1D ()

1:Function DWSepConv2D(

inputMatrix

DWsKernels

PtWsKernel

outputMatrix

2: Initialize

DWsOutput

with zero values, same shape as

inputMatrix

3: # Depth-wise Separable (DWs) convolution

4: for

c\leftarrow 0

channels(inputMatrix)-1

5: # Apply 1D convolution along rows

6: for

i\leftarrow 0

rows(inputMatrix[c])-1

7: conv1D(

inputMatrix[c][i,:]

DWsKernels[c][0,:]

DWsOutput[c][i,:]

)

8: # Apply 1D convolution along columns

9: for

j\leftarrow 0

cols(DWsOutput[c])-1

10: conv1D(

DWsOutput[c][:,j]

DWsKernels[c][:,0]

DWsOutput[c][:,j]

)

11: # Point-wise (PtWs) convolution

12: Initialize

finalOutput

with zero values, with shape [rows(

DWsOutput

), cols(

DWsOutput

), channels(

PtWsKernel

)]

13: for

i\leftarrow 0

rows(DWsOutput)-1

14: for

j\leftarrow 0

cols(DWsOutput)-1

15: for

k\leftarrow 0

channels(PtWsKernel)-1

16: Initialize

PtWsSum\leftarrow 0

17: for

c\leftarrow 0

channels(DWsOutput)-1

18:

PtWsSum\leftarrow PtWsSum+DWsOutput[c][i][j]\times PtWsKernel[c][k]

19:

finalOutput[i][j][k]\leftarrow PtWsSum

20: return

finalOutput

F.1.2 Pseudocode with micro-controller primitives

The following pseudocode describes the steps to implement depth-wise separable convolution using LEA primitives from TI’s DSP Library.

Algorithm 2 depth-wise Separable Convolution 2D Using TI LEA

1:function DWSepConv2D(

inputMatrix

DWsKernels

PtWsKernel

outputMatrix

)

2: Initialize

tempMatrix1

and

tempMatrix2

with zero values, same shape as

inputMatrix

3: // Depth-wise convolution

4: for

c\leftarrow 0

channels(inputMatrix)-1

5: // Apply 1D convolution along rows

6: for

i\leftarrow 0

rows(inputMatrix[c])-1

7: msp_conv_iq31(

inputMatrix[c][i,:]

DWsKernels[c][0,:]

tempMatrix1[c][i,:]

cols(inputMatrix)

FILTER\_SIZE

)

8: end for

9: // Apply 1D convolution along columns

10: for

j\leftarrow 0

cols(tempMatrix1[c])-1

11: msp_conv_iq31(

tempMatrix1[c][:,j]

DWsKernels[c][:,0]

tempMatrix2[c][:,j]

rows(tempMatrix1)

FILTER\_SIZE

)

12: end for

13: end for

14: // Point-wise convolution

15: Initialize

finalOutput

with zero values, shape [rows(

tempMatrix2

), cols(

tempMatrix2

), channels(

PtWsKernel

)]

16: for

i\leftarrow 0

rows(tempMatrix2)-1

17: for

j\leftarrow 0

cols(tempMatrix2)-1

18: for

k\leftarrow 0

channels(PtWsKernel)-1

19: Initialize

PtWsSum\leftarrow 0

20: for

c\leftarrow 0

channels(tempMatrix2)-1

21:

PtWsSum\leftarrow PtWsSum+tempMatrix2[c][i][j]\times PtWsKernel[c][k]

22: end for

23:

finalOutput[i][j][k]\leftarrow PtWsSum

24: end for

25: end for

26: end for

27: return

finalOutput

28:end function

F.1.3 Implementation Code

C code that implements the pseudo-code using TI’s LEA Instruments [2024b] functions.

#include <msp430.h>
#include "DSPLib.h"

#define ROWS 64
#define COLS 64
#define CHANNELS 3
#define FILTER_SIZE 3

// Initialize your input, depth-wise kernels, point-wise kernels,
// and output matrices appropriately
_q31 inputMatrix[CHANNELS][ROWS][COLS];
_q31 DWsKernels[CHANNELS][FILTER_SIZE][FILTER_SIZE];
_q31 PtWsKernel[CHANNELS][CHANNELS];
_q31 tempMatrix1[CHANNELS][ROWS][COLS];
_q31 tempMatrix2[CHANNELS][ROWS][COLS];
_q31 finalOutput[ROWS][COLS][CHANNELS];

void DWSepConv2D() {
    // Depth-wise convolution
    for (int c = 0; c < CHANNELS; c++) {
        // Apply 1D convolution along rows
        for (int i = 0; i < ROWS; i++) {
            msp_conv_iq31(&inputMatrix[c][i][0], DWsKernels[c][0],
            &tempMatrix1[c][i][0], COLS, FILTER_SIZE);
        }
        // Apply 1D convolution along columns
        for (int j = 0; j < COLS; j++) {
            msp_conv_iq31(&tempMatrix1[c][0][j], DWsKernels[c][0],
            &tempMatrix2[c][0][j], ROWS, FILTER_SIZE);
        }
    }

    // Point-wise convolution
    for (int i = 0; i < ROWS; i++) {
        for (int j = 0; j < COLS; j++) {
            for (int k = 0; k < CHANNELS; k++) {
                _q31 PtWsSum = 0;
                for (int c = 0; c < CHANNELS; c++) {
                    PtWsSum += tempMatrix2[c][i][j] * PtWsKernel[c][k];
                }
                finalOutput[i][j][k] = PtWsSum;
            }
        }
    }
}

F.2 Task-Based Conv2D

Here we describe the implementation of a task-based ‘CONV2D‘ function using the Low Energy Accelerator (LEA) in Texas Instruments’ MSP430 microcontrollers. The function is designed to handle energy constraints by decomposing the convolution loops into smaller quanta tasks. Foloowing are the outline of the requirements:

1.

Define ‘QuantaTask‘ as the minimum iterations that can run.
2.

Decomposable loops: Each ‘QuantaTask‘ runs a certain part of the loop.
3.

Check for sufficient energy before launching a ‘QuantaTask‘.
4.

Fuse multiple ‘QuantaTask‘s to minimize load/store operations.
5.

Check for power loss after each ‘QuantaTask‘ or fused ‘QuantaTask‘ and checkpoint if necessary.

Algorithm 3 Task-Based CONV2D Using TI LEA

1:Define

QuantaTask

as the minimum iterations we can run

2:function TaskBasedCONV2D(

inputMatrix

kernel

outputMatrix

)

3: Initialize

tempMatrix

with zero values, same shape as

inputMatrix

rows\leftarrow\text{rows of }inputMatrix

cols\leftarrow\text{cols of }inputMatrix

kernelSize\leftarrow\text{size of }kernel

i\leftarrow 0

8: while

i<rows

j\leftarrow 0

10: while

j<cols

11:

remainingEnergy\leftarrow\textsc{CheckEnergy}(QuantaTask)

12: if

remainingEnergy

is sufficient then

13: ExecuteQuantaTask(

i,j,inputMatrix,kernel,tempMatrix

)

14: UpdateProgress(

i,j,QuantaTask

)

15: if PowerLossDetected then

16: Checkpoint(

i,j,tempMatrix

)

17: break

18: end if

19: else

20: wait for energy to replenish

21: end if

22: end while

23: end while

24: FuseTasks

25: return

outputMatrix

26:end function

27:function ExecuteQuantaTask(

i,j,inputMatrix,kernel,tempMatrix

)

28: for

ki\leftarrow 0

kernelSize-1

29: for

kj\leftarrow 0

kernelSize-1

30: msp_conv_iq31(

inputMatrix[i+ki][j+kj]

kernel[ki][kj]

tempMatrix[i][j]

cols

kernelSize

)

31: end for

32: end for

33:end function

34:function FuseTasks

35:

remainingEnergy\leftarrow\textsc{CheckEnergy}(multiple\_QuantaTask)

36: while

remainingEnergy

is sufficient do

37: ExecuteQuantaTask(

i,j,inputMatrix,kernel,tempMatrix

)

38: UpdateProgress(

i,j,multiple\_QuantaTask

)

39:

remainingEnergy\leftarrow\textsc{CheckEnergy}(multiple\_QuantaTask)

40: if PowerLossDetected then

41: Checkpoint(

i,j,tempMatrix

) break

42: end if

43: end while

44:end function

45:function CheckEnergy(

QuantaTask

)

46: # Check if there is enough energy to run the quanta task

47: return

remainingEnergy

48:end function

49:function PowerLossDetected

50: # Check if power loss is detected

51: return

powerLoss

52:end function

53:function Checkpoint(

i,j,tempMatrix

)

54: # Save the current state to non-volatile memory

55:end function

56:function UpdateProgress(

i,j,QuantaTask

)

57: # Update loop indices based on the quanta task executed

58:

j\leftarrow j+QuantaTask

59: if

j\geq cols

then

60:

j\leftarrow 0

61:

i\leftarrow i+QuantaTask

62: end if

63:end function

F.2.1 Implementation Code

#include <msp430.h>
#include "DSPLib.h"

#define ROWS 64
#define COLS 64
#define KERNEL_SIZE 3
#define QuantaTask 8

// Define the FeRAM addresses for storing the checkpoint data
#define FERAM_ADDR_I 0xF000
#define FERAM_ADDR_J 0xF002
#define FERAM_ADDR_TEMPMATRIX 0xF004

_q31 inputMatrix[ROWS][COLS];
_q31 kernel[KERNEL_SIZE][KERNEL_SIZE];
_q31 tempMatrix[ROWS][COLS];
_q31 outputMatrix[ROWS][COLS];

void TaskBasedCONV2D() {
    int rows = ROWS;
    int cols = COLS;
    int kernelSize = KERNEL_SIZE;
    int i = 0;

    while (i < rows) {
        int j = 0;
        while (j < cols) {
            int remainingEnergy = CheckEnergy(QuantaTask);
            if (remainingEnergy > 0) {
                ExecuteQuantaTask(i, j, inputMatrix, kernel, tempMatrix);
                UpdateProgress(&i, &j, QuantaTask);
                if (PowerLossDetected()) {
                    Checkpoint(i, j, tempMatrix);
                    break;
                }
            } else {
                // Wait for energy to replenish
            }
        }
    }

    FuseTasks();
}

void ExecuteQuantaTask(int i, int j, _q31 inputMatrix[][COLS],
    _q31 kernel[][KERNEL_SIZE], _q31 tempMatrix[][COLS]) {
    for (int ki = 0; ki < KERNEL_SIZE; ki++) {
        for (int kj = 0; kj < KERNEL_SIZE; kj++) {
            msp_conv_iq31(&inputMatrix[i + ki][j + kj],
                &kernel[ki][kj], &tempMatrix[i][j], COLS, KERNEL_SIZE);
        }
    }
}

void FuseTasks() {
    int remainingEnergy = CheckEnergy(QuantaTask);
    while (remainingEnergy > 0) {
        ExecuteQuantaTask(i, j, inputMatrix, kernel, tempMatrix);
        UpdateProgress(&i, &j, QuantaTask);
        remainingEnergy = CheckEnergy(QuantaTask);
        if (PowerLossDetected()) {
            Checkpoint(i, j, tempMatrix);
            break;
        }
    }
}

int CheckEnergy(int QuantaTask) {
    // Energy checking  - HW interrupt
    return 1;
}

int PowerLossDetected() {
    // ower loss detection - HW interrupt logic
    return 0;
}

void Checkpoint(int i, int j, _q31 tempMatrix[][COLS]) {
    // Disable interrupts to prevent corruption during the write process
    __disable_interrupt();

    // Save the indices i and j to FeRAM
    *((volatile int*)FERAM_ADDR_I) = i;
    *((volatile int*)FERAM_ADDR_J) = j;

    // Save the current state of tempMatrix to FeRAM
    // Assuming tempMatrix is a 2D array of dimensions [ROWS][COLS]
    for (int row = 0; row < ROWS; row++) {
        for (int col = 0; col < COLS; col++) {
            ((volatile _q31*)FERAM_ADDR_TEMPMATRIX)[row * COLS + col]
                = tempMatrix[row][col];
        }
    }

    // Re-enable interrupts
    __enable_interrupt();
}

void RestoreCheckpoint(int *i, int *j, _q31 tempMatrix[][COLS]) {
    // Disable interrupts
    __disable_interrupt();

    // Restore the indices i and j from FeRAM
    *i = *((volatile int*)FERAM_ADDR_I);
    *j = *((volatile int*)FERAM_ADDR_J);

    // Restore the state of tempMatrix from FeRAM
    for (int row = 0; row < ROWS; row++) {
        for (int col = 0; col < COLS; col++) {
            tempMatrix[row][col] = ((volatile _q31*)
                FERAM_ADDR_TEMPMATRIX)[row * COLS + col];
        }
    }

    // Re-enable interrupts
    __enable_interrupt();
}


void UpdateProgress(int *i, int *j, int QuantaTask) {
    *j += QuantaTask;
    if (*j >= COLS) {
        *j = 0;
        *i += QuantaTask;
    }
}