Generating Synthetic Net Load Data with Physics-informed Diffusion Model
Abstract
This paper presents a novel physics-informed diffusion model for generating synthetic net load data, addressing the challenges of data scarcity and privacy concerns. The proposed framework embeds physical models within denoising networks, offering a versatile approach that can be readily generalized to unforeseen scenarios. A conditional denoising neural network is designed to jointly train the parameters of the transition kernel of the diffusion model and the parameters of the physics-informed function. Utilizing the real-world smart meter data from Pecan Street, we validate the proposed method and conduct a thorough numerical study comparing its performance with state-of-the-art generative models, including generative adversarial networks, variational autoencoders, normalizing flows, and a well calibrated baseline diffusion model. A comprehensive set of evaluation metrics is used to assess the accuracy and diversity of the generated synthetic net load data. The numerical study results demonstrate that the proposed physics-informed diffusion model outperforms state-of-the-art models across all quantitative metrics, yielding at least 20% improvement.
Index Terms:
Net load, synthetic data, diffusion model, physics-informed machine learning.I Introduction
Having access to energy consumption data at the customer level is crucial to the development of distribution system operation and planning tools [1]. As advanced metering infrastructure expands, it provides a wealth of net load measurements that are instrumental for informed decision-making [2], promoting efficient and reliable operation of the distribution system. Nonetheless, it is still challenging for some electric utilities to obtain net load data for all customers due to the significant costs for the installation and maintenance of smart meters [3]. Furthermore, industry developers and academic researchers often struggle to acquire real-world smart meter data due to privacy and security concerns [4, 5]. Even if the access to smart meter data is granted, the availability of data under extreme operating conditions are often very limited.
To address these challenges, generating synthetic net-load data [1] has emerged as a promising approach to provide realistic energy consumption data for research and development purposes. By preserving the essential spatio-temporal correlations of real-world data [6], synthetic net-load data becomes a cornerstone for many power system studies such as power flow analysis, stability assessments, fault analysis, and demand-side management [7, 8]. Synthetic datasets are not only instrumental in representing a wide variety of operating scenarios for these studies but also crucial for multi-period economic dispatch, unit commitment, system planning, and long-term reliability assessments [9, 10]. Electric load data synthesis [11] is also an effective technique that provides ample training data for developing data-driven and machine learning algorithms in power distribution systems.
To generate synthetic load data, a variety of methods have been developed and refined over time. Model-based methods involve creating physical models that can generate synthetic load curves. They use historical data to capture power consumption patterns and then learn physical properties in different scenarios. For example, residential energy consumption can be simulated by modeling household appliances and physical characteristics of buildings [12, 13]. However, model-based methods require detailed physics-based models and accurate parameters of the physical systems, making them difficult to adopt and generalize across different scenarios [11].
To solve the above problems, a growing body of research on load synthesis is turning towards data-driven methods, such as principal component analysis [7], nonlinear independent component estimation [1], probabilistic models [14, 15], autoregressive models and Markov chains [16]. Researchers also used clustering techniques such as K-means [17] and fuzzy c-means clustering [18] to segment load profiles into different groups, which represent various customer categories. The clustering techniques can be combined with other data-driven methods such as Markov models to synthesize residential loads in a top-down manner [19].
The data-driven methods have evolved with the advancement of machine learning algorithms, with earlier works focusing on end-to-end learning by neural networks. In [20], an artificial neural network (ANN) based method is proposed to synthesize load profiles for a target region using its weather data as inputs. In [21], mixture density network (MDN) model was integrated with a multi-layered long short-term memory (LSTM) network to synthesize energy consumption data. In [22], a framework that combined transfer learning with the domain adaptation approach was developed for load profile generation in medium voltage networks.
In recent years, researchers who study synthetic load data generation embraced sophisticated generative models, such as variational auto-encoders (VAEs), Normalizing Flows (NFs), and generative adversarial networks (GANs). For instance, VAEs have been effectively utilized for generating electric vehicle (EV) load profiles, as demonstrated in [23]. While conditional VAEs have been tailored to generate contextual load profiles based on temporal conditions and grid interactions in [24]. The VAEs suffer from inherent shortcomings, such as the difficulties of tuning hyper-parameters or generalizing a specific generative model structure to other databases [25].
GANs have also been employed to create synthetic load patterns and energy usage behaviors [26] and realistic building electric load profiles [27], and predict daily load profiles [28]. GANs have proven to be quite capable of synthesizing fine-grained energy consumption time series [29], and recovering high-resolution load profiles from low-resolution ones [30]. More advanced GAN models have also been used to generate synthetic load data. To preserve customers’ data privacy, differentially private Wasserstein generative adversarial networks (DPWGAN) were developed to synthesize high-quality load profiles [31]. Considering spatial-temporal correlations, multi-load generative adversarial network (MultiLoad-GAN) was proposed to generate a group of synthetic load profiles simultaneously.
With the increasing integration of intermittent energy resources, such as rooftop solar photovoltaics (PVs), EVs and demand response programs, the pattern of the net load becomes more complex, which increases the difficulty of the net load synthesis task. However, most of the work using GAN models to create synthetic load data did not benefit from conditional information such as weather forecasts and solar PV capacity. This is because GAN can be tailored to condition on discrete variables but not on continuous and vector-valued ones in the synthetic load data generation task. Furthermore, the training of GAN models is notoriously unstable because of the two constantly competing components: the generator and the discriminator [32].
To use continuous variables as conditional information, some researchers start to use the NF model to create synthetic load data [33, 2]. However, the generalizability of flow-based models is limited because of their reliance on specialized architectures, which must be individually designed for each case to establish reversible transformations [11].
Diffusion models overcome several key challenges in other generative models: the problem of aligning posterior distributions in VAEs, the unstable adversarial objective in GANs, the long training-time of Markov Chain Monte Carlo (MCMC) methods in energy-based models (EBMs), and imposing network constraints as in normalizing flows [34]. Diffusion models have showcased tremendous success on many applications such as image generation [35] and audio generation [36].
Addressing the limitations of GAN, VAE and NF, diffusion models have emerged as the leading choice for generative models. Their applications extend well beyond image synthesis. In recent years, diffusion models have been effectively utilized in various time series generation tasks. These include generation of sleep electroencephalography signals [37], wind power scenario generation [38], and EV charging scenario generation [39]. Demonstrating significant advantages over other generative models, diffusion models excel at capturing the complex statistical properties and temporal dynamics inherent in time series data.
Therefore, we adapt diffusion models to generate net load data in this paper. In order to fully utilize physical models, we propose a physics-informed diffusion model (PDM) by embeding solar PV System Performance Model (PVSPM) into the baseline diffusion model (BDM). We propose to jointly estimate the parameters of physics-informed function and the transition kernel of the BDM in the tailored conditional denoising network. We evaluate the proposed PDM with state-of-the-art generative models such as BDM, GAN, VAE, NF using publicly available net-load data from Pecan Street [40].
The main contributions of this paper are listed below:
-
•
We propose a physics-informed diffusion framework for net load data synthesis. This framework integrates physical solar PV performance model directly into the denoising network, making it more interpretable and capable of generalizing to unforeseen conditions.
-
•
To capture the temporal correlations of the net load profiles and fully utilize the physical model, we design both the baseline and the physics-informed denoising networks. These networks effectively combine and integrate LSTM units, multi-head self-attention mechanisms, multi-layer perceptrons, and physical models, among other components, to optimize the model performance.
-
•
The superior performance of the proposed method is verified through comprehensive numerical studies. We compare the performance of our proposed PDM with an extensive list of generative models, including GAN, VAE, NF and BDM on the net load synthetic task. It is shown that the proposed PDM achieves over 20% improvement over other methods across all evaluation metrics.
II Problem Formulation
The aim of this paper is to generate the net load data of residential customers with solar PV systems. The net load readings can be decomposed into the solar PV generation and electric load consumption. According to the net load definition, the net load, electric load, and solar generation of a customer satisfy the following equality constraint:
(1) |
where denotes the net load measurements for a residential customer across a day, which is also referred to as net load profile in this paper. Here, and represent the electric load consumption and solar generation, respectively, each over the time horizon of one day.
In general, net load profiles differ daily and vary among customers. Fig. 1 shows the net load profiles for two different customers in the first two months of 2018. The data sampling frequency is 15 minutes and the net-load readings are obtained from the Pecan street dataset [40]. The net load curves exhibit significant variations due to changing weather conditions, customer electricity consumption behaviors, and solar PV system configurations. Furthermore, the daily net load curves of a specific customer exhibit significant variations across time.
Our task is to generate net load profiles utilizing customer ID, static solar PV system information, and other variables associated with the date. This task calls for the development of a conditional generative model, which is more challenging than designing a basic generative model without using context information. Let represent the conditional information required for data generation, , where , , represent the conditional information that encoded user ID, solar PV system information, and variables associated the date, respectively. The solar PV system information includes the size of solar PV systems oriented towards the south, west, and east. The variables associated with specific dates are represented using one-hot encoding, which includes 12 dimensions for the month, 31 dimensions for the date within a month, and 7 dimensions for the day of the week. To generate net load data in specific scenarios, we need to learn the conditional distribution using generative models. Note that it is quite challenging to learn the generative model for net load data generation, because contains both continuous and discrete variables.

III Technical methods
III-A Denoising Diffusion Probabilistic Model for Net Load Profile Generation
In this work, we build on top of the denoising diffusion probabilistic model (DDPM), which achieves state of the art results in fields such as image generation [35] and speech synthesis [36]. Aimed at producing higher quality time series data, our implementation of the DDPM primarily draws inspiration from the speech synthesis model described in [36]. The key modification is the utilization of the norm as loss function, which is more suitable for net load profile generation based on numerical study results.
Diffusion in the context of statistics refers to transforming a complex data distribution to a simple prior distribution on the same domain. In the DDPM, this process is called the forward process. Conversely, the reverse process transforms to . In practice, we let be a Gaussian distribution. Both the forward and the reverse processes can be modeled as Markov Chains, and the reverse data transformation can be learned by a deep denoising neural network. The overall framework of the diffusion process for net load profile generation is shown in Fig. 2.

III-A1 Forward Process
The diffusion forward process in DDPM utilizes a Markov chain to progressively transform the data distribution into a predefined prior distribution, specifically a standard Gaussian distribution . This transformation involves a series of steps where each transition from to is governed by a Gaussian distribution:
(2) |
Under some fixed noise schedule , the diffusion forward process with steps can be implemented using a closed form solution:
(3) |
where , and .
III-A2 Reverse Process
The reverse process employs a Markov chain to transition from the prior distribution back to the original data distribution , effectively “denoising” the data. The reverse process still follows a Gaussian distribution but lacks a closed-form transition kernel. We can use neural networks to parameterize the transition kernel:
(4) |
Instead of parameterizing both the mean and covariance matrix directly, we make the covariance matrix to be independent of , i.e., , by following [35]. Furthermore, we reparameterize the model to condition on the continuous noise level instead of the discrete iteration index , which is shown to have a better performance in the audio generation task [36]. Specifically, , where contains the conditioning features, is reparameterized as:
(5) |
The denoising network can be trained by denoising score matching [35]. The training objective is:
(6) |
where and .
After training the model, the sampling can be implemented using the Langevin dynamics:
(7) |
where and . The pseudo-code for model training and scenario generation are summarized in Algorithm 1 and Algorithm 2.
III-B Solar PV Generation Basis Profiles
This subsection presents how we create solar PV generation basis profiles, which will be later used in the proposed PDM. The solar PV generation depends on the weather and solar PV system specifications. By leveraging the technical parameters a solar PV system along with pertinent weather data, the output of a solar PV system can be accurately estimated using the PVSPM. Our research employs models from two key sources: the PV performance model of Sandia National Laboratory [41] and PVWatts from the National Renewable Energy Laboratory [42]. The technical parameters of a solar PV system include the system’s DC rating () in kW, the tilt angle () and azimuth angle () of the solar PV array, the nominal efficiency of the inverter (), and the system’s overall loss (). Weather data encompasses temperature, wind speed, direct normal irradiance (DNI), diffuse horizontal irradiance (DHI), and global horizontal irradiance (GHI), collectively represented by .
The following procedure is followed to create the solar PV generation basis profiles. We pick a few representative azimuth angles () and form a set . The tilt angle () is set to equal to the latitude of the customer location. The representative parameters of the solar PV system including the nominal inverter efficiency and panel loss are obtained from the Sandia Module database [43]. In the dataset, all customers are located in the same city with the same longitude and latitude for solar PV systems. Thus, for a given day, weather data can be accurately determined by using the conditional information: . Consequently, variations in the solar PV generation basis profiles are attributed solely to the system size and azimuth angle. The solar PV output, given the specific date information and azimuth angle , can be calculated as follows:
(8) |
where represents the PVSPM model, which uses the weather data and the azimuth angle as inputs. The output, serves as one of the solar PV generation basis profile associated with the azimuth angle and date .
The entire solar PV generation basis profiles for all can be calculated:
(9) |
where is the entire solar PV generation basis under the date condition . denotes the cardinality of set , indicating the number of azimuth angles in the solar PV generation basis profiles.
III-C Physics-informed Diffusion Framework
Although diffusion models are successful in generating images and audio, they struggle to produce high quality net-load time series data. The net-load data can be partitioned into two distinct components: the electric load and solar PV generation. The second component can be calculated by a physical equation using the PVSPM. To effectively utilize the physical model, we propose a physics-informed diffusion framework. This innovative framework integrates physical models into the diffusion process, enhancing its generalizability and accuracy.
Let us assume that a signal to be generated can be decomposed into two main components:
(10) |
where is the component that follows an unknown distribution, which will be learned by diffusion models. Thus, is calculated through a Markov chain , given and transition kernel (7). , on the other hand, can be calculated using a parameterized function with the basis profiles of a physical model and contextual variables.
Suppose the physical component can be represented as:
(11) |
where contains the contextual features and denotes the unknown parameters. is the PV generation basis profiles, which can be calculated through the PVSPM model. The details about will be introduced in the next subsection.
The first component of the time series, , is the end point of a Markov chain, whereas the second component can be calculated through a parameterized physics-informed function. After both the Markov chain and the physics-informed function are learned, computing becomes straightforward. However, since both the transition kernel and the physics-informed function contains unknown parameters, i.e., and , it is difficult to learn them together.
The intuitive strategy might involve alternating updates for the two sets of parameters. However, this approach requires long training time and encounters instability and convergence issues. To address these challenges, we propose to learn and simultaneously by embedding the solar PV generation basis profiles into the denoising network. The details of this approach will be elaborated in the following subsection.
III-D Denoising Networks Design
The most popular denoising network architecture in the image generation field is the U-net, which achieves state-of-the-art performance due to its ability to capture the spatial correlations of pixels in images. However, when it comes to generating net load profiles, the denoising network should be tailored to learn the temporal correlations and unique daily patterns in the energy usage. For instance, residential energy consumption profile often peaks during morning and early evening hours, which coincide with the daily routines of residential customers. Similarly, solar generation profiles are closely linked to the intensity of solar radiation and has strong temporal correlations. In this subsection, we design a denoising network that is tailored for synthesizing net load time series data.
The overall architectures of both the baseline and the physics-informed denoising networks are illustrated in Fig. 3. The architecture of both the baseline and the physics-informed denoising networks, as outlined, integrates advanced neural network components to effectively handle the complex temporal correlations in the net load time series data. The proposed PDM’s network design encapsulates a blend of LSTM networks, multi-head self-attention mechanisms, and a solar PV embedding module to enhance the model’s accuracy and efficiency. The structure and operational mechanism of these components are presented below.
LSTM Embedding: Central to capturing temporal correlations in net load profiles, the LSTM network offers a robust framework for modeling time-dependent data. By utilizing a 1-layer LSTM network to process the Gaussian noise input , the model generates latent states that encapsulate temporal patterns and dependencies. The dimension of the LSTM’s hidden layer, denoted by , is a crucial hyperparameter, establishing the foundation for the level of temporal complexity attainable by the network.
Positional Embedding: Regulating the diffusion/denoising process’s noise level, , the Positional Embedding module employs a Transformer-style sinusoidal embedding function. This approach effectively encodes the noise level within the model, ensuring that the temporal dynamics influenced by the diffusion process are accurately represented. The output of this module is adjusted to match the LSTM’s hidden layer dimension .
Multi-head Self Attention: The incorporation of a multi-head self-attention module, functioning as a Transformer encoder, introduces a sophisticated mechanism for analyzing and integrating information across different segments of the input data. Maintaining both input and output dimensions at , this module enhances the model’s ability to discern complex interdependencies within the net-load data.
MLP: Following the multi-head self-attention module, a Multi-Layer Perceptron (MLP) module with Leaky ReLU activation function in introduced. This module retains the dimensionality at , which helps maintain the processed data’s depth and complexity.
Conditional Embedding: This module is another MLP, which encodes the conditional information with dimension and produces an output with dimension that aligns with the output of the primary MLP module.
Linear Layer: Concluding the network architecture, a Linear Layer module performs a crucial transformation, which changes the data dimension from to the final output dimension . This step is pivotal in aligning the processed data with the original signal space, ensuring that the output closely matches the targeted net-load profile.


The details of the PV embedding module are shown in Fig. 4. This module contains 5 submodules. The operational flow and functionality of each submodule are outlined below:
Weather Query: This initial step involves taking specific date information, denoted as , and querying to retrieve corresponding weather data. The outcome is a comprehensive set of weather variables for the given date, symbolized as , which includes temperature, wind speed, DNI, DHI and GHI.
PVSPM: Leveraging the weather data acquired from the previous step, along with representative azimuth angles , this submodule employs the PVSPM to calculate the solar PV generation basis profiles. The output, , is a matrix in , where represents the number of representative azimuth angles, and signifies the time dimension.
Flattened Layer: Upon receiving , this layer acts to reshape the matrix into a flattened vector. The transformation process effectively converts the matrix into a one-dimensional vector, facilitating its subsequent processing. This step is crucial for aligning the data structure with the needs of subsequent neural network modules.
PV-basis Embedding: Following the flattening process, this submodule employs a MLP architecture with three layers. It utilizes the Tanh function as its activation mechanism, processing the flattened vector to generate a feature representation that captures the intricacies of solar generation potential under varying weather conditions.
Cond-PV Embedding: Parallel to the PV-basis Embedding, this submodule is another MLP, albeit with two layers. It also adopts the Tanh function for activation. The primary role of this submodule is to further process and refine the information, preparing it for final integration into the diffusion model.
The culmination of these steps is the element-wise multiplication () of the outputs from the PV-basis Embedding and Cond-PV Embedding submodules. This operation fuses the learned representations of solar generation and other conditional information, resulting in a highly nuanced feature vector ready to be utilized within the diffusion model’s framework. This integrated approach allows the model to leverage detailed solar PV generation information, thereby enhancing its predictive accuracy and relevance for applications involving physical signals. Finally, the input size and output size for each module are summarized in Table I.
Module | Input size | Output size |
---|---|---|
LSTM Embedding | ||
Positional Embedding | ||
Multi-head Self Attention | ||
MLP | ||
Conditional Embedding | ||
PV Embedding | ||
PVSPM | ||
Flattened layer | ||
Cond-PV Embedding | ||
PV-basis Embedding | ||
Linear Layer |
IV Neumerical Studies
IV-A Dataset and Preprocessing
Net load data. The net load data for residential customers in Austin, Texas, collected by Pecan Street Inc. are used in our experiments [40]. This dataset includes the daily energy consumption, solar PV production, and net-load time series of individual customers. Specifically, the net load data of 25 residential households from January 1, 2018, to December 30, 2018 are provided. Measurements were recorded at 15-minute intervals, achieving a 99% completeness rate across all intervals for the 25 homes. The missing values are imputed using the average net load of the adjacent hours. Given the 15-minute sampling rate, the dimensionality of the daily net load profile, , equals . Furthermore, the net load profiles for each customer are Min-max normalized to fall within the range from to .
Conditional information. The conditional information contains user ID, solar PV system information, and variables associated with the specific dates. For user identification, We employ one-hot encoding for the 25 users, denoted as . The dataset provides detailed solar PV system information, including the total and orientation-specific (west, south, east) capacities of solar PV systems installed at each household, represented as . To account for the variations in electricity usage behavior across time, we represent specific dates using the following variables: one-hot encoding for month (12 dimensions), day of the week (7 dimensions), and day of the month (31 dimensions), resulting in . Consequently, the total dimension of the conditional information, is .
Weather data. Hourly weather variables such as DHI, GHI, DNI, temperature and wind speed, are collected from the National Solar Radiation Database [44]. The meteological variables are then converted to 15-minute intervals through linear interpolation to align with the net load data. The approximate longitude and latitude of Austin, Texas is used as a common proxy location for all customers as their exact locations are not available. The tilt angle used to calculate the solar PV generation basis profiles is set as the latitude, i.e., . The azimuth angle is set from to with interval of . Then we can generate AC output for panels facing different directions ranging from east to west.
In the experiments, we randomly select net load profiles for customers with days, thus the total number of dataset contains trajectories. For each customer, we randomly select of the dataset for training and the remaining dataset are used for testing.
IV-B Experimental Setup
Our benchmark models include WGAN, VAE and NF developed in [25]. The idea behind GANs are adversarial training of two neural networks: the generator and the discriminator. The state-of-the-art Wasserstein GAN with gradient penalty (WGAN-GP) limits the gradient norm of the discriminator’s output with respect to its input to enforce the 1-Lipschitz conditions [45], which can reduce mode collapse and improve the stability of training. VAE contains an encoder and a decoder network, which are jointly trained to maximize a lower bound on the likelihood. NF learns a sequence of transformations, a flow, from a density known analytically, e.g., a normal distribution to a complex target distribution [25].
Our study also uses a baseline diffusion model for benchmarking purposes. To ensure a fair evaluation, both the baseline diffusion model and the physics-informed model were configured with identical hyperparameters. To balance sample generation quality and speed, we design the noise schedule, , as a linear function for . To accelerate training, we employ a learning rate scheduler that applies a decay factor of 0.9 to the learning rate every 1000 steps. During the diffusion process, the result of each iteration is clipped to be within the range from to . Additionally, we utilize an exponential moving average (EMA) for parameter updates in our model. The EMA model is refined at each step by combining times the current EMA model with times the newly updated weights from the most recent forward and backward pass. Here is called the “smoothing factor”.
All the hyperparameters of diffusion models are shown in the Table II. Training is completed on a Linux machine with Nvidia 2080 Ti GPU. It’s worth noting that the training time for diffusion models exceeds 2 hours, and the sampling time for 1 trajectory is about 15 s. Note that we plan to reduce the training and sampling time in the future work. In this paper we mainly focus on improving the quality of the synthetic data.
Hyperparameter | Value |
---|---|
Hidden dimension () | 1000 |
Learning rate | |
Epoch | |
EMA decay rate | |
Batch size | |
Diffusion steps () | |
Number of heads in attention layer | |
smooting factor |
IV-C Result and Analysis
For evaluating synthetic images, the key metrics are the Fréchet Inception Distance (FID) and Inception Score (IS). These metrics have been widely applied to assess the quality and diversity of the images produced by generative models, providing insights into how well a model can generate new, realistic images that resemble a given dataset. However, the evaluation metrics developed for synthetic image can not be directly applied for net-load profiles creation. This is because the number samples for net-load profiles is much smaller than that of the image datasets. Thus, we introduce an evaluation system with a set of complementary metrics to thoroughly assess the quality and diversity of synthetic net load data. Our evaluation framework is structured into four parts:
Synthetic Conditional Net Load Data. This part of the evaluation compares synthetic net load profiles generated under given conditions with actual data. The objective is to evaluate the generative models’ ability to capture complex net load patterns accurately.
t-SNE Visualizations for Each Customer’s Net Load Profile. Utilizing t-SNE visualizations for each customer, this part of the evaluation aims to uncover the generative models’ ability to reproduce the nuanced conditional distributions that are unique characteristic of individual customers. It provides insight into whether the generative models can learn the diverse and specific net-load patterns of different users.
Probabilistic Forecasting of Marginal Distribution. In this part, continuous ranked probability score (CRPS) and quantile score (QS) are employed to assess the accuracy of probabilistic forecasts. CRPS evaluates the models’ skill in forecasting marginal distributions for specified times of the day. QS is used to compare the accuracy of forecasts at specific quantiles, highlighting the models’ precision in predicting extreme conditions.
Other Quantitative Metrics. In this part of the evaluation, we select 6 quantitative metrics to measure the overall similarity between the distribution of the generated and actual data.
IV-C1 Synthetic Conditional Net Load Data
The net load profiles’ complex patterns shift significantly based on the conditional information, which includes user ID, solar PV system information, and date. Not all generative models can capture the highly nonlinear relationships between the conditional information and the net load time series. To evaluate the generative models’ response to different conditions, we first select 5 typical conditions from the test data set, which are associated with 5 different net load patterns. Next, under each condition, we generate samples for all of the baseline and proposed generative models. The synthetic net-load data are illustrated in Fig. 5. For every condition, the PDM yields the best performance. In contrast, the GAN, VAE, NF model were unable to accurately replicate diverse net load patterns. While the BDM and the proposed PDM showed great ability to emulate each pattern, the BDM exhibited higher approximation error and variance compared to the proposed PDM.

The approximation error of a sample created by generative models can be measured by mean squared error (MSE). The MSE of the synthetic net-load data for a given condition can be calculated as:
(12) |
where represents the number of samples. denotes the predicted value of the generative model and indicates the actual value at time for the -th sample under condition .
For each customer, we calculate the MSEs between the predicted and actual net load profiles over time across all 5 conditions. Subsequently, we determine the mean and variance of their respective MSEs. The results are presented in Fig. 6. The figure reveals that for each customer, the proposed PDM achieves the lowest mean and variance for nearly all time points, while the BDM achieves the second best result. The results show that the proposed PDM is much more capable than state-of-the-art generative models in learning the conditional distribution of the net load data, particularly in scenarios that do not occur frequently.

IV-C2 t-SNE Visualizations for Each Customer’s Net Load Profile
We employ t-SNE to project the high-dimensional net load data to lower dimensional space to explore the ability of generative models to capture the nuanced conditional distributions of individual customers’ net load profiles, which are influenced by their unique electricity consumption habits and solar PV system configurations. The t-SNE visualizations shown in Fig. 7 illustrate the degree of agreement between the distributions of the generated synthetic data and the actual net load data for specific customers.
As shown in the figure, while GAN and VAE models sometimes create data points that look like the real data in the low dimensional space, their performance is marred by inconsistencies. Although capable of approximating the overarching structure of the data, these two models falter in replicating intricate patterns. In contrast, the NF model commits fewer errors in generating data closely aligned with actual observations. Nevertheless, they fall short in representing the entire spectrum of patterns in the dataset, indicating a somewhat limited representation power. Both the BDM and the PDM demonstrate a balanced ability to learn data distributions, effectively avoiding the pitfalls of being either overly conservative or excessively aggressive in data generation. Notably, the proposed PDM stands out for its exceptional ability to accurately model the conditional distributions, marking it the superior choice among all evaluated models.

IV-C3 Probabilistic Forecasting of Marginal Distribution
The quantile score (QS) and continuous ranked probability score (CRPS) are used as evaluation metrics for the performance of probabilistic forecasting of marginal distribution [25]. We calculate the QS for different level of quantile forecasts. The quantile score is plotted against different quantiles () in Fig. 8(a). A lower score indicates that the predicted quantiles are closer to the true quantiles of the actual data. The proposed PDM achieves the lowest quantile scores across all quantiles and it is shown to excel at predicting true quantiles of the net load profile distribution.
CRPS takes into account both the accuracy and the sharpness of the predictions. This means that it not only measures how close the predictions are to the actual values but also how confident the predictions are (i.e., a narrower predicted distribution is better if it is accurate). The CRPS for all time slots of a given day are shown in Fig. 8(b). The PDM achieves the lowest level for nearly all time slots. The result demonstrates the superior performance of PDM in probabilistic forecasting.

IV-C4 Quantitative Metrics Evaluation
Multiple quantitative metrics are used to evaluate the quality of synthetic net load data, including: MAE, root mean squared error (RMSE), averaged quantile score (), averaged continuous ranked probability score (), energy score (ES) and variogram score (VS) [25]. The MAE and RMSE metrics are pivotal in evaluating the accuracy of generated results under specific conditions. The averaged QS enables detailed assessment of forecast quality at specific probability levels, such as over-forecasting or under-forecasting, particularly with regard to the tails of the predictive distribution [46]. The ES and VS are used as evaluation metrics for multivariate data generation [47]. In this paper, we consider the trajectory of net load in a day as a vector of multivariate variables. For the calculation of VS, we use equal weights across all hours of the day and setting at 0.5 [25].
The results for the 6 quantitative metrics are presented in Table III. These metrics indicate that BDM significantly outperforms GAN, VAE, and NF across all six metrics. Moreover, PDM surpasses BDM in every metric, showcasing its superior performance. The improvement of PDM for all the metrics exceeds . The numerical study results suggest that the proposed PDM excels in modeling the complex patterns and unique distribution of net load profiles. Its effectiveness can be attributed to the incorporation of domain knowledge through a physics-informed model. The BDM although not as powerful as the PDM, also demonstrates superior performance over state-of-the-art generative models, which implies that the diffusion model have an advantage over other generative models in the task of net-load profile generation.
Model | MAE | RMSE | ES | VS | ||
---|---|---|---|---|---|---|
WGAN | 1.38 | 3.35 | 0.49 | 0.90 | 11.07 | 1504.04 |
VAE | 1.13 | 2.35 | 0.38 | 0.70 | 8.89 | 1280.41 |
NF | 1.20 | 2.94 | 0.35 | 0.64 | 8.69 | 1303.79 |
BDM | 1.08 | 2.01 | 0.32 | 0.58 | 7.22 | 1078.53 |
PDM | 0.73 | 1.05 | 0.24 | 0.45 | 5.72 | 803.24 |
V Conclusion
In this paper, we propose a novel physics-informed diffusion model for generating synthetic net load data by embedding solar PV system performance model into the baseline diffusion model. To mitigate the convergence problem in the model training process, we propose to jointly learn the parameters of the diffusion model for load and the physics-based model for solar PV system. A unique denoising neural network with conditioning is designed to generate the net load data for different customers and weather conditions. Comprehensive numerical results on Pecan Street dataset reveals that the proposed PDM significantly outperforms other generative models such as GANs, VAEs, NFs and BDM. Specifically, the proposed PDM yields more than 20% improvement across all evaluation metrics compared to state-of-the-art generative models. Future extensions of this work include scaling the model for larger datasets and considering other types of behind-the-meter resources.
References
- [1] W. N. Silva, L. H. Bandória, B. H. Dias, M. C. de Almeida, and L. W. de Oliveira, “Generating realistic load profiles in smart grids: An approach based on nonlinear independent component estimation (NICE) and convolutional layers,” Applied Energy, vol. 351, p. 121902, 2023.
- [2] L. Zhang and B. Zhang, “Scenario forecasting of residential load profiles,” IEEE Journal on Selected Areas in Communications, vol. 38, no. 1, pp. 84–95, 2019.
- [3] A. Dehdarian, “Scenario-based system dynamics modeling for the cost recovery of new energy technology deployment: The case of smart metering roll-out,” Journal of cleaner production, vol. 178, pp. 791–803, 2018.
- [4] D. Lee and D. J. Hess, “Data privacy and residential smart meters: Comparative analysis and harmonization potential,” Utilities Policy, vol. 70, p. 101188, 2021.
- [5] S. Lee and D.-H. Choi, “Federated reinforcement learning for energy management of multiple smart homes with distributed energy resources,” IEEE Trans. Ind. Informat., vol. 18, no. 1, pp. 488–497, 2020.
- [6] B. Yilmaz and R. Korn, “Synthetic demand data generation for individual electricity consumers: Generative adversarial networks (GANs),” Energy and AI, vol. 9, p. 100161, 2022.
- [7] A. Pinceti, O. Kosut, and L. Sankar, “Data-driven generation of synthetic load datasets preserving spatio-temporal features,” in 2019 IEEE Power & Energy Society General Meeting (PESGM), pp. 1–5.
- [8] A. Menati, K. Lee, and L. Xie, “Modeling and analysis of utilizing cryptocurrency mining for demand flexibility in electric energy systems: A synthetic Texas grid case study,” IEEE Trans. Energy Mark. Policy Regul., vol. 1, no. 1, pp. 1–10, 2023.
- [9] B. Asare-Bediako, W. Kling, and P. Ribeiro, “Future residential load profiles: Scenario-based analysis of high penetration of heavy loads and distributed generation,” Energy and Buildings, vol. 75, pp. 228–238, 2014.
- [10] G. Pinto, Z. Wang, A. Roy, T. Hong, and A. Capozzoli, “Transfer learning for smart buildings: A critical review of algorithms, applications, and future perspectives,” Advances in Applied Energy, vol. 5, p. 100084, 2022.
- [11] Z. Wang and H. Zhang, “Customized load profiles synthesis for electricity customers based on conditional diffusion models,” arXiv preprint arXiv:2304.12076, 2023.
- [12] L. Diao, Y. Sun, Z. Chen, and J. Chen, “Modeling energy consumption in residential buildings: A bottom-up analysis based on occupant behavior pattern clustering and stochastic simulation,” Energy and Buildings, vol. 147, pp. 47–66, 2017.
- [13] J. M. G. López, E. Pouresmaeil, C. A. Canizares, K. Bhattacharya, A. Mosaddegh, and B. V. Solanki, “Smart residential load simulator for energy management in smart grids,” IEEE Trans. Ind. Electron., vol. 66, no. 2, pp. 1443–1452, 2018.
- [14] J. Dickert and P. Schegner, “A time series probabilistic synthetic load curve model for residential customers,” in 2011 IEEE Trondheim PowerTech, pp. 1–6.
- [15] J. K. Gruber and M. Prodanovic, “Residential energy load profile generation using a probabilistic approach,” in 2012 Sixth UKSim/AMSS European Symposium on Computer Modeling and Simulation, pp. 317–322.
- [16] X. Liu, N. Iftikhar, H. Huo, R. Li, and P. S. Nielsen, “Two approaches for synthesizing scalable residential energy consumption data,” Future Generation Computer Systems, vol. 95, pp. 586–600, 2019.
- [17] A. Al-Wakeel, J. Wu, and N. Jenkins, “K-means based load estimation of domestic smart meter measurements,” Applied energy, vol. 194, pp. 333–342, 2017.
- [18] Y.-I. Kim, S.-J. Kang, J.-M. Ko, and S.-H. Choi, “A study for clustering method to generate typical load profiles for smart grid,” in 8th International Conference on Power Electronics-Asia, 2011, pp. 1102–1109.
- [19] W. Labeeuw and G. Deconinck, “Residential electrical load model based on mixture model clustering and Markov models,” IEEE Trans. Ind. Informat., vol. 9, no. 3, pp. 1561–1569, 2013.
- [20] G. G. Pillai, G. A. Putrus, and N. M. Pearsall, “Generation of synthetic benchmark electrical load profiles using publicly available load and weather data,” Int. J. Electr. Power & Energy Syst., vol. 61, pp. 1–10, 2014.
- [21] J. Sarochar, I. Acharya, H. Riggs, A. Sundararajan, L. Wei, T. Olowu, and A. I. Sarwat, “Synthesizing energy consumption data using a mixture density network integrated with long short term memory,” in 2019 IEEE green technologies conference (greentech), pp. 1–4.
- [22] M. Salazar, I. Dukovska, P. H. Nguyen, R. Bernards, and H. J. Slootweg, “Data driven framework for load profile generation in medium voltage networks via transfer learning,” in 2020 IEEE PES Innovative Smart Grid Technologies Europe (ISGT-Europe), pp. 909–913.
- [23] Z. Pan, J. Wang, W. Liao, H. Chen, D. Yuan, W. Zhu, X. Fang, and Z. Zhu, “Data-driven EV load profiles generation using a variational auto-encoder,” Energies, vol. 12, no. 5, p. 849, 2019.
- [24] C. Wang, S. H. Tindemans, and P. Palensky, “Generating contextual load profiles using a conditional variational autoencoder,” in 2022 IEEE PES Innovative Smart Grid Technologies Conference Europe, pp. 1–6.
- [25] J. Dumas, A. Wehenkel, D. Lanaspeze, B. Cornélusse, and A. Sutera, “A deep generative model for probabilistic energy forecasting in power systems: normalizing flows,” Applied Energy, vol. 305, p. 117871, 2022.
- [26] S. El Kababji and P. Srikantha, “A data-driven approach for generating synthetic load patterns and usage habits,” IEEE Trans. Smart Grid, vol. 11, no. 6, pp. 4984–4995, 2020.
- [27] Z. Wang and T. Hong, “Generating realistic building electrical load profiles through the generative adversarial network (GAN),” Energy and Buildings, vol. 224, p. 110299, 2020.
- [28] N. M. M. Bendaoud, N. Farah, and S. B. Ahmed, “Comparing generative adversarial networks architectures for electricity demand forecasting,” Energy and Buildings, vol. 247, p. 111152, 2021.
- [29] J. Li, Z. Chen, L. Cheng, and X. Liu, “Energy data generation with wasserstein deep convolutional generative adversarial networks,” Energy, vol. 257, p. 124694, 2022.
- [30] L. Song, Y. Li, and N. Lu, “Profilesr-GAN: A GAN based super-resolution method for generating high-resolution load profiles,” IEEE Trans. Smart Grid, vol. 13, no. 4, pp. 3278–3289, 2022.
- [31] J. Huang, Q. Huang, G. Mou, and C. Wu, “DPWGAN: High-quality load profiles synthesis with differential privacy guarantees,” IEEE Trans. Smart Grid, vol. 14, no. 4, pp. 3283–3295, 2023.
- [32] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International conference on machine learning. PMLR, 2017, pp. 214–223.
- [33] Y. Chen, Y. Wang, D. Kirschen, and B. Zhang, “Model-free renewable scenario generation using generative adversarial networks,” IEEE Trans. Power Syst., vol. 33, no. 3, pp. 3265–3275, 2018.
- [34] H. Cao, C. Tan, Z. Gao, Y. Xu, G. Chen, P.-A. Heng, and S. Z. Li, “A survey on generative diffusion model,” arXiv preprint arXiv:2209.02646, 2022.
- [35] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
- [36] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “Wavegrad: Estimating gradients for waveform generation,” arXiv preprint arXiv:2009.00713, 2020.
- [37] B. Aristimunha, R. Y. de Camargo, S. Chevallier, O. Lucena, A. G. Thomas, M. J. Cardoso, W. H. L. Pinaya, and J. Dafflon, “Synthetic sleep EEG signal generation using latent diffusion models,” in Deep Generative Models for Health Workshop NeurIPS 2023, 2023.
- [38] X. Dong, Z. Mao, Y. Sun, and X. Xu, “Short-term wind power scenario generation based on conditional latent diffusion models,” IEEE Trans. Sustain. Energy, vol. 15, no. 2, pp. 1074–1085, 2024.
- [39] S. Li, H. Xiong, and Y. Chen, “Diffcharge: Generating EV charging scenarios via a denoising diffusion model,” IEEE Trans. Smart Grid, 2024.
- [40] Pecan Street Inc., “Pecan street dataport,” 2020. [Online]. Available: https://dataport.pecanstreet.org/
- [41] J. S. Stein, “The photovoltaic performance modeling collaborative (PVPMC),” in 2012 38th ieee photovoltaic specialists conference. IEEE, 2012, pp. 003 048–003 052.
- [42] A. P. Dobos, “PVWatts version 5 manual,” National Renewable Energy Lab.(NREL), Golden, CO (United States), Tech. Rep., 2014.
- [43] National Renewable Energy Laboratory, “System Advisor Model (SAM),” https://github.com/NREL/SAM/tree/develop/deploy/libraries, 2023, accessed: 2024-05-28.
- [44] M. Sengupta, Y. Xie, A. Lopez, A. Habte, G. Maclaurin, and J. Shelby, “The national solar radiation data base (NSRDB),” Renewable and sustainable energy reviews, vol. 89, pp. 51–60, 2018.
- [45] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein GANs,” Advances in neural information processing systems, vol. 30, 2017.
- [46] P. Lauret, M. David, and P. Pinson, “Verification of solar irradiance probabilistic forecasts,” Solar Energy, vol. 194, pp. 254–271, 2019.
- [47] M. Scheuerer and T. M. Hamill, “Variogram-based proper scoring rules for probabilistic forecasts of multivariate quantities,” Monthly Weather Review, vol. 143, no. 4, pp. 1321–1334, 2015.