Physics-guided Residual Learning for
Probabilistic Power Flow Analysis

Kejun Chen and Yu Zhang K. Chen and Y. Zhang are with the Department of Electrical and Computer Engineering at the University of California, Santa Cruz. Emails: {kchen158,yzhan419}@ucsc.edu This work was supported in part by the Faculty Research Grant of UC Santa Cruz and the Hellman Fellowship (Corresponding author: Yu Zhang).

Abstract

Probabilistic power flow (PPF) analysis is critical to power system operation and planning. PPF aims at obtaining probabilistic descriptions of the state of the system with stochastic power injections (e.g., renewable power generation and load demands). Given power injection samples, numerical methods repeatedly run classic power flow (PF) solvers to find the voltage phasors. However, the computational burden is heavy due to many PF simulations. Recently, many data-driven based PF solvers have been proposed due to the availability of sufficient measurements. This paper proposes a novel neural network (NN) framework which can accurately approximate the non-linear AC-PF equations. The trained NN works as a rapid PF solver, significantly reducing the heavy computational burden in classic PPF analysis. Inspired by residual learning, we develop a fully connected linear layer between the input and output in the multilayer perceptron (MLP). To improve the NN training convergence, we propose three schemes to initialize the NN weights of the shortcut connection layer based on the physical characteristics of AC-PF equations. Specifically, two model-based methods require the knowledge of system topology and line parameters, while the purely data-driven method can work without power grid parameters. Numerical tests on five benchmark systems show that our proposed approaches achieve higher accuracy in estimating voltage phasors than existing methods. In addition, three meticulously designed initialization schemes help the NN training process converge faster, which is appealing under limited training time.

Index Terms:

Data-driven, neural network, probabilistic power flow, physics-guided initialization, residual learning.

I Introduction

Renewable power generation technology has been developed rapidly due to its great advantage in economic savings and environmental friendliness [1]. However, compared with conventional generation, renewable generation (e.g., solar and wind power) is highly dependent on weather conditions, such as ambient temperature, solar irradiance, relative humidity, wind speed, etc [2]. Hence, renewable generation brings lots of uncertainties to power system operation, e.g., significant fluctuations of voltage phasors and branch flows. Probabilistic power flow (PPF) analysis is essential for characterizing power system uncertainties under various random variations [3]. PPF focuses on obtaining the probabilistic distributions of voltage phasors and branch flows (output variables) under the uncertainties of power injections (input variables). Existing schemes for solving PPF problems can be divided into analytical, approximate, and numerical methods.

Analytical methods use the first-order Taylor expansion of AC power flow (AC-PF) equations around the operating point. The uncertainties of voltage phasors (including voltage magnitudes and voltage angles) are represented as linear combinations of the variations of power injections. For example, cumulant methods use arithmetic operations to calculate moments of voltage phasors and obtain their probability density functions (PDFs) with various series expansions [4]. Their computational efficiency is high for large-scale power systems. However, the linearization approximation suffers significant errors when power injections deviate far from the operating point [5]. In addition, these series expansions (e.g., Gram–Charlier, Edgeworth, and Cornish–Fisher expansions) cannot always guarantee convergence [4].

Point estimation methods are the most commonly used approximate methods, which conduct deterministic PF analysis over a limited number of selected sample points [6]. Given the first few statistical moments of input variables, they calculate the moments of output variables and then apply series expansions to obtain their PDFs. However, calculating exact higher‐order moments of many input variables is challenging, leading to decreased estimation accuracy in medium- and large-scale bus systems [7]. In addition, to cope with the potential divergence issue brought about by the series expansions, the stochastic response surface method has been applied to PPF analysis. It can describe the system responses based on polynomial chaos expansion and obtain the PDFs using kernel density estimation, e.g., [8] and [9].

Numerical methods, whose canonical example is Monte Carlo simulation (MCS), have been widely applied in PPF analysis [10]. MCS generates samples from stochastic power injection distributions and calculates their corresponding voltage phasors using the Newton–Raphson (NR) algorithm. In addition, some improved MCS-like approaches achieve better computation efficiency by reducing the sampling number, such as Latin supercube sampling [11] and Quasi-Monte Carlo (QMC) [12]. However, these improved MCS-like methods still need to call the NR solver repeatedly over many samples to guarantee accurate estimation of the PDFs of voltage phasors. Therefore, the cumulative computational time of these samples is long.

Speeding up the computation of PF analysis of each sample without sacrificing accuracy is another promising direction to reduce the total computational time. Recently, the widespread use of massive phasor measurement units (PMUs) and supervisory control and data acquisition (SCADA) systems can collect sufficient measurement data. Thus, data-driven PF solvers have gained increasing attention recently. They learn the mapping from power injections to voltage phasors based on historical input-output data pairs [13]. For example, [14] and [15] use linear regression to approximate the decoupled linear PF functions. Their proposed approaches do not rely on power grid parameters and perform better than model-based decoupled linear PF models. However, these linear models suffer from accuracy limitations because they cannot extract non-linear features of the PF functions. In addition, [16] and [17] use Gaussian process regression, which can only obtain the PDF of a single target quantity in one shot. This property prevents its application in medium- and large-scale power systems if the PDFs of all buses’ voltage phasors are required.

Exploiting the impressive capability of neural networks (NNs) in function approximation, we focus on employing the NN as a rapid PF solver. The motivations are primarily two-fold. First, the universal approximation theorem states that NNs can approximate any arbitrary complex functions [18]. Thus, they are expected to approximate the AC-PF equations accurately. Secondly, the time-consuming training process is implemented offline. Once the training is completed, the trained NN is ready to participate in PF analysis. Specifically, they take power injections as the input and output the corresponding voltage phasors. The forward propagation is fast; therefore, the total prediction time of lots of input power injection samples is negligible. The PDFs of voltage phasors based on the output samples can be further obtained.

NN-based approaches have recently become popular in PF and PPF analysis. For example, [19] and [20] employ multilayer perceptrons (MLPs) to approximate the AC-PF equations and achieve much less computational time than MCS. The loss function consists of the mean square error of voltage magnitudes/angles and active/reactive branch flows. However, the capability of NN may be insufficient to minimize these four errors simultaneously [21]. Besides, the NN training time will significantly increase due to the branch flow calculations involvement. In addition, [22] employs MLPs while incorporating the system topology to solve the PF problem. Under physical guidance, their proposed topology-pruned bilinear neural network (TPBNN) method performs better than the MLP. However, the model outputs are the voltage phasors’ real and imaginary parts. After transforming to voltage magnitudes and angles, the calculation errors can get magnified. In addition, electrical grids can be abstracted as sparse-connected graphs composed of nodal buses and power branches. Graph convolutional neural networks can exploit topological information and aggregate locality information. [23] and [24] have applied them to PF analysis. However, the formulation of AC-PF equations shows that the power injections of each bus will affect not only its neighboring buses but all other buses in the electrical grid. Therefore, the benefits of graph convolutional layers over fully connected layers may be obscure [25].

In addition, residual neural networks (ResNets) have been widely used in image recognition. ResNets consist of many stacked residual building blocks, which are achieved by introducing skip-layer connections among layers in MLPs. A central choice is attaching a direct path between the input and output layers in one residual block, the so-called identity shortcut connection. These shortcut connections can help deal with vanishing gradients and accuracy saturation in deep NNs [26]. The motivation is that learning residuals regarding identity mapping should be easier than learning the desired mapping directly [27].

In this paper, we propose a novel physics-guided NN framework to approximate the inverse AC-PF equations. We further employ it as the rapid PF solver to reduce the total computational time of many samples in PPF analysis. The PDF can be obtained using the non-parametric kernel density estimation method based on data samples. The main contributions of this paper are summarized below:

•

Given sufficient historical operational input-output data pairs, we train the NN to learn the mapping from power injections to voltage phasors. The trained NN will serve as a rapid data-driven PF solver in PPF analysis.
•

Motivated by residual learning, we introduce a fully connected linear layer between the input and output in the MLP structure. Therefore, the original MLP essentially learns the non-linear corrections to the linear mapping from the power injections to voltage phasors.
•

Leveraging the linearized formulations of AC-PF equations, we design three initialization methods for the weights of the shortcut connection layer. Furthermore, the proposed data-driven initialization scheme does not require the knowledge of system topology and line parameters.

The remaining part of this paper is organized as follows. Section II describes the problem formulation. Section III details the proposed network and three novel initialization methods. Section IV shows the simulation results tested on five benchmark systems. Finally, Section V presents the concluding remarks.

Notation: Upper (lower) boldface letters are used for matrices (column vectors). Sets are denoted by calligraphic letters. $(\cdot)^{\top}$ is vector/matrix transpose; $\|\cdot\|_{2}$ denotes vector $\ell_{2}$ -norm; $(\cdot)^{-1}$ and $(\cdot)^{\dagger}$ denote inverse and pseudo-inverse, respectively.

II Problem Formulation

Deterministic PF analysis is the cornerstone of PPF analysis. This section first introduces the system description and the problem formulation. Then, we show the connection between PF analysis and PPF analysis.

II-A System Description

PF analysis aims to analyze the steady-state operating points of an electrical grid. The operational data includes power generation, load demands, voltage phasors, and branch flows. There are three types of buses: PQ buses, PV buses, and one slack bus. A PQ bus (a.k.a. load bus) has no generator attached, where its active and reactive power injections are fixed. A PV bus (generator bus) has generators connected, and its active power injection and voltage magnitude are known. Lastly, the slack bus has the given voltage angle and voltage magnitude. Therefore, the unknown variables include the voltage angles and voltage magnitudes of PQ buses and the voltage angles of PV buses.

Consider a power grid with $N$ buses, where there are one slack bus, $N_{g}$ PV buses (denoted by set $\mathcal{N}_{g}$ ), and $(N-N_{g}-1)$ PQ buses (denoted by set $\mathcal{N}_{l}$ ). The number of unknown variables is $2\times(N-N_{g}-1)+N_{g}$ , which is the same as the number of power balance equations.

II-B AC Power Flow Equations

Let $P_{i}$ and $Q_{i}$ represent the active and reactive power injections at bus $i$ while $V_{i}$ and $\theta_{i}$ are the voltage magnitude and angle. The AC-PF equations are the forward mappings from voltage phasors to power injections, which are given as:


$\displaystyle P_{i}$	$\displaystyle=\sum_{j=1}^{N}V_{i}V_{j}(G_{ij}\cos{\theta_{ij}}+B_{ij}\sin\theta_{ij}),\,\forall i\in\mathcal{N}_{g}\cup\mathcal{N}_{l}\,,$	(1a)
$\displaystyle Q_{i}$	$\displaystyle=\sum_{j=1}^{N}V_{i}V_{j}(G_{ij}\sin{\theta_{ij}}-B_{ij}\cos\theta_{ij}),\,\forall i\in\mathcal{N}_{l}\,,$	(1b)

where $\theta_{ij}:=\theta_{i}-\theta_{j}$ is the voltage angle difference between bus $i$ and bus $j$ . $G_{ij}$ and $B_{ij}$ are the real and imaginary parts of the $(i,j)$ -th element of the nodal admittance matrix $\mathbf{Y}\in\mathbb{C}^{N\times N}$ . Moreover, the active and reactive power branch flows between two connected buses $i$ and $j$ are:


$\displaystyle P_{ij}$	$\displaystyle=V_{i}V_{j}(G_{ij}\cos{\theta_{ij}}+B_{ij}\sin\theta_{ij})-G_{ij}V_{i}^{2}\,,$	(2a)
$\displaystyle Q_{ij}$	$\displaystyle=V_{i}V_{j}(G_{ij}\sin{\theta_{ij}}-B_{ij}\cos\theta_{ij})+B_{ij}V_{i}^{2}-\frac{b_{ij}^{c}}{2}V_{i}^{2}\,,$	(2b)

where $b_{ij}^{c}$ is the total line-charging susceptance.

Clearly, the power injections and branch flows are determined by all the voltage phasors, which are defined as the state of the system:

\mathbf{z}:=[\theta_{s},\bm{\theta}_{g},\bm{\theta}_{l},V_{s},\mathbf{V}_{g},\mathbf{V}_{l}]^{\top}\,,

where subscripts $(\cdot)_{s}$ , $(\cdot)_{g}$ and $(\cdot)_{l}$ denote the quantities corresponding to the slack bus, generator buses, and load buses, respectively. Let $\mathbf{P}_{g}$ , $\mathbf{P}_{l}$ , and $\mathbf{Q}_{l}$ denote the active power of generator buses and active/reactive power injections of load buses. Partitioning and reorganizing the admittance matrix $\mathbf{Y}$ in the same manner yielding

\displaystyle\tilde{\mathbf{Y}}=\left[\begin{array}[]{ccc}\mathbf{Y}_{ss}&\mathbf{Y}_{sg}&\mathbf{Y}_{sl}\\ \mathbf{Y}_{gs}&\mathbf{Y}_{gg}&\mathbf{Y}_{gl}\\ \mathbf{Y}_{ls}&\mathbf{Y}_{lg}&\mathbf{Y}_{ll}\\ \end{array}\right]\,,

(6)

where $\mathbf{Y}_{gs}$ is formed from the rows of $\mathbf{Y}$ that correspond to generator buses and the column for the slack bus. Similarly, we can find all other blocks. Define

	$\displaystyle\mathbf{x}=[\mathbf{P}_{g};\mathbf{P}_{l};\mathbf{Q}_{l}]^{\top},$
	$\displaystyle\mathbf{y}=[\bm{\theta}_{g};\bm{\theta}_{l};\mathbf{V}_{l}]^{\top}.$

Hence, the inverse mappings of the AC-PF equations can be compactly expressed as:

\displaystyle\mathbf{y}=\mathbf{f}(\mathbf{x})\,.

(7)

II-C PPF Analysis

The PPF analysis is closely related to the aforementioned AC-PF problem. Consider an electricity grid with stochastic renewable power generation and load demands. PPF studies aim to characterize the uncertainties of voltage phasors induced by the fluctuations of power injections. As a new effort in PPF studies, NNs are employed to learn the end-to-end mapping (7) from historical data pairs $(\mathbf{x},\mathbf{y})$ . NNs take in power injections $\mathbf{x}$ and predict the corresponding voltage phasors $\mathbf{y}$ . The PDFs of voltage phasors can be further inferred from the output samples. By shifting the time-consuming training process offline, NNs can rapidly predict the corresponding voltage phasors of new power injection samples in the testing stage. Even with lots of new input samples, NNs are still computational efficiently because the forward propagations during the test stage are often very fast.

III Physics-guided PPF Analysis

This section presents the details of our proposed NN structure designed for approximating the inverse AC-PF mappings (7). Furthermore, we develop three novel initialization methods for the shortcut connection layer in different scenarios. Two are model-based initialization schemes, which require the knowledge of system topology and line parameters, while the other data-driven initialization scheme does not require the power grid’s parameters.

III-A Residual Learning

Unlike the building blocks in the MLP, residual blocks introduce identity mapping as a shortcut connection that skips one or more layers [27]. Fig. 1a illustrates the structure of one residual building block. In Fig. 1a, let $\mathbf{G}(\cdot)$ denote the mapping from $\mathbf{u}$ to $\mathbf{\bar{v}}$ . The motivation for formulating the residual building block is as follows. MLP consists of a few stacked layers and shows universal function approximation capabilities. Thus, it is reasonable to assume that the MLP can approximate the residual function $\mathbf{R}(\cdot)$ . Then, if those stacked layers are only used to approximate $\mathbf{R}(\cdot)$ , the original function $\mathbf{G}(\mathbf{u}):=\mathbf{R}(\mathbf{u})+\mathbf{u}$ can be obtained by introducing an extra identity shortcut connection. It has been shown that ResNets are also universal function approximators [28].

III-B Designed framework for PF Analysis

However, the benefits of ResNet over MLP in PF analysis are not obvious. The reason is as follows. The NNs applied in the image recognition domain are usually composed of dozens or hundreds of layers. Therefore, deep ResNets will have apparent benefits over deep MLPs because they do not have degradation issues. However, the advantages are not evident because very deep NNs may not be necessary for approximating the inverse AC-PF equations.

In addition, [27] claimed that if the desired function is close to the identity function, it will be easier for stacked layers to learn the perturbations regarding the identity instead of approximating that desired function directly. Inspired by their work, our proposition is that compared with MLPs that directly approximate the complicated inverse AC-PF equations, pushing the residual correction regarding the linear relationship to zero may be more accessible. This motivates us to design a novel architecture with some initialization methods tailored for PF analysis.

In practical power systems, the voltage magnitudes of buses are approximately 1 per unit, and voltage angle differences are small values [29]. Many existing works use these two properties to linearize the AC-PF equations, and numerical results have shown that these linear models perform well. Therefore, we make novel modifications to the MLP structure to better use this linear property. As shown in Fig. 1b, we replace the identity mapping with a fully connected linear layer. Compared with the MLP that tries to approximate the mapping from $\mathbf{x}$ to $\mathbf{y}$ directly, our designed framework uses a set of stacked layers to only approximate the latent residual functions $\mathbf{R}(\cdot)$ . The stacked layers are composed of fully connected linear layers and rectified linear unit (ReLU) activation function [30]. Let $\mathbf{W}_{s}$ and $\mathbf{b}_{s}$ denote the weight matrix and bias of the shortcut connection linear layer, respectively. $\mathbf{W}_{i}$ and $\mathbf{b}_{i}$ are the weight matrix and bias of the $i$ -th fully connected linear layer. $\bm{\sigma}$ is the activation function. The residual output $\mathbf{y}_{r}$ and the final output $\mathbf{y}$ can be written as:

	$\displaystyle\mathbf{y}_{r}$	$\displaystyle=\mathbf{W}_{3}(\bm{\sigma}(\mathbf{W}_{2}(\bm{\sigma}(\mathbf{W}_{1}\mathbf{x}+\mathbf{b}_{1}))+\mathbf{b}_{2}))+\mathbf{b}_{3}\,,$		(8)
	$\displaystyle\mathbf{y}$	$\displaystyle=\mathbf{y}_{r}+(\mathbf{W}_{s}\mathbf{x}+\mathbf{b}_{s})\,.$		(9)

III-C Physics-guided Initialization Methods

In NN training, learnable weights are randomly initialized which can be critical to the NN performance. We develop two physics-guided initialization methods for the weights of the shortcut connection layer. These two methods significantly accelerate the training process and improve the NN performance than random initialization. The goal is to get $\mathbf{y}_{r}$ close to zero in the initial stage of the training process by initializing $\mathbf{W}_{s}$ and $\mathbf{b}_{s}$ properly. If $\mathbf{W}_{s}$ and $\mathbf{b}_{s}$ are randomly initialized, the value of $\mathbf{y}_{r}=\mathbf{y}-(\mathbf{W}_{s}\mathbf{x}+\mathbf{b}_{s})$ cannot be zero. Thus, to drive $\mathbf{y}_{r}$ to zero, $\mathbf{W}_{s}\mathbf{x}+\mathbf{b}_{s}$ should be close to the output $\mathbf{y}$ . In other words, the shortcut connection linear layer should serve as an excellent linear approximation from $\mathbf{x}$ to $\mathbf{y}$ after initializing its weights. Therefore, we modify two linear PF models and apply them to initialize $\mathbf{W}_{s}$ and $\mathbf{b}_{s}$ in the proposed framework.

III-C1 Pre-initialization using linearized PF model

A decoupled linear PF model is proposed in [31], and the linearized AC-PF equations can be expressed as:


$\displaystyle P_{i}$	$\displaystyle=\displaystyle\sum_{j=1}^{N}-B_{ij}^{\prime}\theta_{j}+\displaystyle\sum_{j=1}^{N}G_{ij}V_{j},\,i\in\mathcal{N}_{g}\cup\mathcal{N}_{l},$	(10a)
$\displaystyle Q_{i}$	$\displaystyle=\displaystyle\sum_{j=1}^{N}-G_{ij}\theta_{j}+\displaystyle\sum_{j=1}^{N}-B_{ij}V_{j},\,i\in\mathcal{N}_{l},$	(10b)

where $B_{ij}^{\prime}$ is the imaginary part of the $(i,j)$ -th element of the nodal admittance matrix without shunt elements and line-charging susceptance. Note that the summation in equation (10) contains all the buses. Since the voltage phasor of the slack bus and the voltage magnitudes of PV buses are known, we separate them from the remaining unknown voltage phasors. After the separation, (10) can be compactly rewritten as

\displaystyle\mathbf{x}=\mathbf{E}\mathbf{c}+\mathbf{F}\mathbf{y}\,,

(11)

where

	$\displaystyle\mathbf{E}$	$\displaystyle=\left[\begin{array}[]{ccc}-\mathbf{B}_{gs}^{\prime}&\mathbf{G}_{gs}&\mathbf{G}_{gg}\\ -\mathbf{B}_{ls}^{\prime}&\mathbf{G}_{ls}&\mathbf{G}_{lg}\\ -\mathbf{G}_{ls}&-\mathbf{B}_{ls}&-\mathbf{B}_{lg}\\ \end{array}\right]\in\mathbb{R}^{(2N-N_{g}-2)\times(N_{g}+2)}\,,$
	$\displaystyle\mathbf{F}$	$\displaystyle=\left[\begin{array}[]{ccc}-\mathbf{B}_{gg}^{\prime}&-\mathbf{B}_{gl}^{\prime}&\mathbf{G}_{gl}\\ -\mathbf{B}_{lg}^{\prime}&-\mathbf{B}_{ll}^{\prime}&\mathbf{G}_{ll}\\ -\mathbf{G}_{lg}&-\mathbf{G}_{ll}&-\mathbf{B}_{ll}\\ \end{array}\right]\in\mathbb{R}^{(2N-N_{g}-2)\times(2N-N_{g}-2)},$
	$\displaystyle\mathbf{c}$	$\displaystyle=[{\theta}_{s},{V}_{s},\mathbf{V}_{g}]^{\top}\in\mathbb{R}^{N_{g}+2}\,.$

Assuming that the network topology and line parameters are known, matrices $\mathbf{E}$ and $\mathbf{F}$ are fixed. Vector $\mathbf{c}$ is also fixed since the voltage phasor of the slack bus and voltage magnitudes of PV buses are known. According to (11), the linear mapping from $\mathbf{x}$ to $\mathbf{y}$ becomes:

\mathbf{y}=\mathbf{F}^{\dagger}\mathbf{x}-\mathbf{F}^{\dagger}\mathbf{E}\mathbf{c}\,,

(12)

where $\mathbf{F}^{\dagger}$ is the pseudo-inverse of $\mathbf{F}$ . Note that if there are zero bus injections for the PV and PQ buses, $\mathbf{F}$ is not invertible. Thus, we can use $\mathbf{F}^{\dagger}$ and $-\mathbf{F}^{\dagger}\mathbf{E}\mathbf{c}$ to pre-initialize $\mathbf{W}_{s}$ and $\mathbf{b}_{s}$ , respectively.

III-C2 Pre-initialization using Jacobian matrix model

AC-PF equations can be linearized based on the first-order Taylor expansion around the nominal operating point, denoted as $(\mathbf{x}_{0}\,,\mathbf{y}_{0})$ . Expanding (1) around the operating point and ignoring the higher-order terms, the linearized AC-PF equations can be expressed as:

	$\displaystyle\mathbf{x}$	$\displaystyle=\mathbf{x}_{0}+\mathbf{J}(\mathbf{y}-\mathbf{y}_{0})\,,$		(13)
	$\displaystyle\mathbf{y}$	$\displaystyle=\mathbf{J}^{-1}\mathbf{x}-\mathbf{J}^{-1}\mathbf{x}_{0}+\mathbf{y}_{0}\,,$		(14)

where the Jacobian matrix $\mathbf{J}\in\mathbb{R}^{(2N-N_{g}-2)\times(2N-N_{g}-2)}$ is evaluated at $\mathbf{y}_{0}$ . We can use $\mathbf{J}^{-1}$ and $-\mathbf{J}^{-1}\mathbf{x}_{0}+\mathbf{y}_{0}$ to pre-initialize $\mathbf{W}_{s}$ and $\mathbf{b}_{s}$ .

The reason for using the Jacobian matrix for pre-initialization is straightforward. If the input power injections are slightly perturbed around $\mathbf{x}_{0}$ , their corresponding output voltage phasors should be close to $\mathbf{y}_{0}$ . In the first training iteration, the residual output represents the higher-order remainder of the Taylor series. Thus, the value should be closer to zero than that under the random initialization method.

III-D Initialization based on Data-driven PF Model

The line parameter profiles are only available from the grid planning files, which are likely outdated [22]. If accurate information on line parameters is unavailable, the aforementioned physics-guided initialization methods are not applicable. Therefore, we propose a data-driven initialization scheme by leveraging ridge regression to deal with this situation. The motivation for using ridge regression is as follows. First, [32] shows the load demands and power generation data have similar rise and fall patterns in adjacent areas. The high correlations among the input variables lead to multicollinearity, weakening the performance of the simple linear regression model. Ridge regression is an effective method for analyzing data that suffer from multicollinearity [33]. It introduces the $\ell_{2}$ regularization, which shrinks the regression coefficients to reduce model variance and avoid overfitting. Secondly, the initial weight parameters of NNs are typically random numbers following certain distributions. Small random values are generally considered better options than larger ones in preventing gradient from exploding or vanishing [34]. Due to the penalty term, ridge regression tends to give smaller random weights than simple linear regression models. Therefore, we employ the coefficients obtained from ridge regression to initialize the shortcut connection layer, which provides better initial values in stabilizing and speeding up the training process.

With $n$ training samples, let $\mathbf{X}\in\mathbb{R}^{n\times(2N-N_{g}-2)}$ and $\mathbf{y}^{o}_{i}\in\mathbb{R}^{n\times 1}$ denote the input data and the $i$ -th dimension of the output data, respectively. Ridge regression finds the coefficient vector $\mathbf{w}_{i}\in\mathbb{R}^{(2N-N_{g}-2)\times 1}$ and bias $b_{i}\in\mathbb{R}$ by solving the problem:

\displaystyle\operatorname*{arg\,min}_{\mathbf{w}_{i},b_{i}}\,\,L:=\|\mathbf{y}^{o}_{i}-(\mathbf{X}\mathbf{w}_{i}+b_{i}\mathbf{1}_{n})\|_{2}^{2}+\lambda\|\mathbf{w}_{i}\|_{2}^{2}\,,

(15)

where $\mathbf{1}_{n}\in\mathbb{R}^{n}$ is the all-ones column vector. $\mathbf{w}_{i}^{\top}$ and $b_{i}$ will be used to initialized the $i$ -th row of $\mathbf{W}_{s}$ and $\mathbf{b}_{s}$ . The first term in the loss function (15) calculates fitting errors of data pairs, while the second term shrinks the estimated coefficients. $\lambda$ is a tuning parameter that balances the relative strength of the least square error and the penalty term. When $\lambda=0$ , it degrades to the least-square based linear regression.

The closed-form solution to (15) is derived as follows. We can rewrite the objective function $L$ as:

L=\mathbf{w}_{i}^{\top}(\mathbf{X}^{\top}\mathbf{X}+\lambda\mathbf{I})\mathbf{w}_{i}-2\mathbf{w}_{i}^{\top}\mathbf{X}^{\top}\mathbf{y}^{o}_{i}+2b_{i}\mathbf{w}_{i}^{\top}\mathbf{X}^{\top}\mathbf{1}_{n}\\ -2b_{i}\mathbf{1}_{n}^{\top}\mathbf{y}^{o}_{i}+b_{i}^{2}\mathbf{1}_{n}^{\top}\mathbf{1}_{n}+\mathbf{y}^{o}_{i}{{}^{\top}}\mathbf{y}^{o}_{i}\,,

(16)

where $\mathbf{I}\in\mathbb{R}^{(2N-N_{g}-2)\times(2N-N_{g}-2)}$ is the identity matrix. The gradients of the convex objective function are

	$\displaystyle\nabla_{\mathbf{w}_{i}}L$	$\displaystyle=-2(\mathbf{X}^{\top}\mathbf{y}^{o}_{i}-\mathbf{X}^{\top}\mathbf{X}\mathbf{w}_{i}-b_{i}\mathbf{X}^{\top}\mathbf{1}_{n}-\lambda\mathbf{w}_{i})\,,$		(17)
	$\displaystyle\nabla_{b_{i}}L$	$\displaystyle=-2\mathbf{1}_{n}^{\top}\mathbf{y}^{o}_{i}+2\mathbf{1}_{n}^{\top}\mathbf{X}\mathbf{w}_{i}+2b_{i}\mathbf{1}_{n}^{\top}\mathbf{1}_{n}\,.$		(18)

Set $\nabla_{\mathbf{w}_{i}}L=\mathbf{0}$ and $\nabla_{b_{i}}L=0$ , we get:

	$\displaystyle\mathbf{w}_{i}$	$\displaystyle=(\mathbf{X}^{\top}\mathbf{X}+\lambda\mathbf{I})^{-1}(\mathbf{X}^{\top}\mathbf{y}^{o}_{i}-b_{i}\mathbf{X}^{\top}\mathbf{1}_{n})\,,$		(19)
	$\displaystyle b_{i}$	$\displaystyle=\frac{1}{n}(\mathbf{1}_{n}^{\top}\mathbf{y}^{o}_{i}-\mathbf{1}_{n}^{\top}\mathbf{X}\mathbf{w}_{i})\,.$		(20)

After plugging (20) into (19), we eliminate $b_{i}$ and obtain the closed-form solution of optimal $\mathbf{w}_{i}$ as:

\displaystyle\mathbf{w}_{i}=\left(\mathbf{X}^{\top}\mathbf{X}+\lambda\mathbf{I}-\mathbf{X}^{\top}\mathbf{H}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}(\mathbf{I}-\mathbf{H})\mathbf{y}^{o}_{i}

(21)

with $\mathbf{H}=\frac{1}{n}\mathbf{1}_{n}\mathbf{1}_{n}^{\top}\in\mathbb{R}^{n\times n}$ .

IV Numerical Results

With different datasets, the effectiveness of our proposed approaches is verified on the IEEE-30, IEEE-118, IEEE-300 [35], SouthCarolina-500 [36], and PEGASE-1354 [37] bus systems. Detailed results and insights are presented in this section.

IV-A Simulation Setup

IV-A1 Test systems and datasets

We use synthetic and real-world data for a comprehensive evaluation. For load demands and power generation, the real-world data are provided by the global energy forecasting competition 2012 [38] and PV plants installed in California [39], respectively. They are scaled to match the system capacity and avoid violations. In addition, the number of real-world data samples is insufficient for our simulated systems. Thus, we generate synthetic data as a supplement. For the IEEE-30 bus, the active power generation is modeled as multivariate Gaussian distributions. The ratio of standard deviation to mean value is 0.2, and the correlation coefficient between different PV buses is 0.2. In addition, the ratios of standard deviation to mean value are 0.1, 0.01, and 0.2 of the load demands on the IEEE-118, IEEE-300, and SouthCarolina-500 bus systems, respectively [40]. The correlation coefficient between the same bus is 0.8 and 0.2 for different load buses. For the PEGASE-1354 bus system, the ratio of standard deviation to mean value is 0.1 for both active power generation and load demands. In addition, to consider various renewable energy uncertainties, we add six wind generations following the Weibull distribution and eight PV generations following the beta distribution. Besides, two generators may have power outages following the binomial distribution for the IEEE-118 bus system. We use IEEE- $118^{*}$ and PEGASE- $1354^{*}$ to denote those case studies. Finally, we conduct the PF analysis using the NR solver in MATPOWER 7.0 to generate input-output data pairs [35]. The voltage angles are measured in radians.

IV-A2 Methods for comparison

Based on the proposed framework shown in Fig. 1b, we adopt four different initialization schemes, including random [41], data-driven PF model, linearized PF model, and Jacobian matrix model. We call them Random, Data-driven, Linearized PF, and Jacobian, respectively. The random initialization method serves as the baseline to show the benefits of our well-designed initialization methods. In addition, we also compare our work with some existing approaches.

•

FC [19]: MLPs are used to learn the inverse AC-PF mappings. Model-based initialization methods and loss functions are proposed.
•

TPBNN [22]: MLPs are applied to learn the inverse AC-PF equations, joined with an auxiliary task to rebuild the forward PF mapping. The training loss function consists of both estimation error and reconstruction error.
•

ResNet [27]: ResNets consist of stacked residual building blocks. Each block replaces the identity mapping (cf. Fig. 1a) with a fully connected linear layer to improve the generalization capability. The output layer is also linear because voltage angles can be negative.
•

LPF [42]: AC-PF equations are linearized around the operating point by the first-order Taylor expansion.
•

SML [8]: The stochastic response surface method adopts the polynomial chaos expansion to model the input-output random variables relationship.
•

KNN: K nearest neighbor algorithm predicts the corresponding voltage phasors of new power injection samples based on their feature similarity measure to the training data points. It belongs to non-linear regressors and has various applications in power systems, e.g., state estimation, load forecasting, and fault detection [43].
•

RR: We employ ridge regression to learn the inverse AC-PF mappings (see eq. (15)).
•

SVR [13]: Support vector regression can approximate the non-linear AC-PF equations with the kernel trick.
•

QMC [12]: Uses the low discrepancy sequences to achieve a faster convergence rate than traditional MCS, which uses simple random sampling. It can reduce the computational burden by decreasing the number of samples without sacrificing accuracy.

IV-A3 Details of training

Adam optimizer with mini-batch size 32 is used for training. Table I shows the network structure and the dataset size of five benchmark systems. The validation dataset is used to tune the hyperparameters. In addition, the mean square error (MSE) is adopted as the loss function. The training process will stop when the validation loss has stabilized with no further improvements [44]. We train and test all models five times to alleviate the randomness.

TABLE I: Hyperparameters of the NN structures. The second column indicates the number of neurons in each layer. The last column shows the size of each dataset.

Cases	Structure	[Training, Validation, Testing]
30	[53 100 100 53]	[12k, 4k, 4k]
118	[181 300 300 181]	[20k, 5k, 5k]
300	[530 200 200 200 530]	[20k, 5k, 5k]
500	[909 300 300 300 300 909]	[28k, 6k, 6k]
1354	[2447 300 300 300 300 2447]	[34k, 8k, 8k]

IV-B Model Evaluation Criteria

IV-B1 Average root mean square error (ARMSE)

Let $\mathbf{O}$ , $\hat{\mathbf{O}}\in\mathbb{R}^{M\times D}$ denote the matrices containing the true and estimate values. $M$ and $D$ are the numbers of samples and output dimensions. $O_{i,j}$ and $\hat{O}_{i,j}$ are the $(i,j)$ -th element of $\mathbf{O}$ and $\hat{\mathbf{O}}$ , respectively. The ARMSE can be calculated by:

\displaystyle\text{ARMSE}=\frac{1}{D}\sum_{j=1}^{D}\sqrt{\frac{1}{M}\sum_{i=1}^{M}(\hat{O}_{i,j}-O_{i,j})^{2}}\,.

(22)

IV-B2 Mean absolute percentage error (MAPE)

The MAPE for the $j$ -th element of the outputs is given as:

\displaystyle\text{MAPE}=\frac{1}{M}\sum_{i=1}^{M}\left|\frac{\hat{O}_{i,j}-O_{i,j}}{O_{i,j}}\right|\times 100\%\,.

(23)

IV-B3 Average Wasserstein distance (AWD)

Wasserstein distance measures the similarity between two probability distributions in the same metric space. Let $\rho_{j}$ and $\hat{\rho}_{j}$ be the distributions of the $j$ -th column of $\mathbf{O}$ and $\hat{\mathbf{O}}$ , respectively. The first-order Wasserstein distance loss $l_{\mathrm{wd}}$ between the two distributions is calculated by:

\displaystyle l_{\mathrm{wd}}(\hat{\rho}_{j},\rho_{j})=\underset{\gamma\in\Gamma(\hat{\rho}_{j},\rho_{j})}{\text{inf}}\int_{\mathbb{R}\times\mathbb{R}}|\hat{\rho}_{j}-\rho_{j}|\text{d}\gamma(\hat{\rho}_{j},\rho_{j})\,,

(24)

where $\Gamma(\hat{\rho}_{j},\rho_{j})$ denotes the set of all measures on $\mathbb{R}\times\mathbb{R}$ whose marginal distributions are $\hat{\rho}_{j}$ and $\rho_{j}$ on the first and second factors, respectively. Then, the average loss of all output dimensions is:

\displaystyle\text{AWD}=\frac{1}{D}\sum_{j=1}^{D}l_{\mathrm{wd}}(\hat{\rho}_{j},\rho_{j}).

(25)

IV-C Power Flow Analysis Results

Table II shows our proposed approaches outperform the other methods in predicting the voltage magnitudes and angles for the PF analysis. Besides, we have the following observations:

•

Classical non-linear regressors in the machine learning field, including KNN, RR, and SVR, have significant estimation errors in medium- and large-scale bus systems. These solvers cannot effectively extract the data features and recover the complicated mapping relationship in the inverse AC-PF equations.
•

The NN-based approaches achieve more accurate predictions than the linear LPF method, the non-linear SML method, and classical non-linear regressors, especially in the voltage angle estimates.
•

The performance of the FC method is comparable to that of the ResNet and Random approaches. It shows that only using the residual framework does not have apparent advantages over the vanilla MLP structure.
•

Three designed initialization schemes seem straightforward but significantly outperform the FC and random initialization methods in improving the accuracy of voltage phasor estimates.

In addition, Fig. 2 and Fig. 3 show the MAPEs of voltage angles and magnitudes on the IEEE-118 bus system. The average MAPEs of voltage angles and magnitudes of the data-driven method are 0.028% and 0.0042%. Similarly, 0.023% and 0.0065% for the Linearized PF method, and 0.023% and 0.0042% for the Jacobian method.

Table III shows the errors of branch flow calculations. Data-driven, Linearized PF and Jacobian schemes achieve the best performance among all the NN-based solvers. However, on the IEEE-300 and PEGASE-1354 bus systems, the LPF and RR methods obtain accurate branch flow calculations despite their poor performance in voltage phasor estimates. The reason is as follows. The relationship between voltage angles and voltage angle differences is not bijective. Branch flows between two buses are more related to voltage angle differences than voltage angles. However, the outputs of our proposed models are voltage angles instead of voltage angle differences. Therefore, our methods may predict $\theta_{i}$ and $\theta_{j}$ accurately, while there is no guarantee of a similar level of accuracy for $\theta_{ij}$ estimate. This property further affects the calculation accuracy of branch flows [21].

TABLE II: ARMSEs of the voltage phasor calculations for different cases (

10^{-4}

Cases	Voltage phasor	LPF	SML	KNN	RR	SVR	FC	TPBNN	ResNet	Random	Data-driven	Linearized PF	Jacobian
30	angle	188.40	479.04	110.45	3.66	1.51	2.96	6.29	2.17	2.06	1.43	1.35	1.94
30	magnitude	13.72	41.33	25.21	1.78	1.27	2.17	5.20	1.44	1.23	1.08	1.02	1.16
118	angle	289.31	1661.08	342.04	24.33	5.61	7.36	61.37	8.14	7.14	2.70	2.46	2.72
118	magnitude	11.93	20.95	8.39	1.43	4.36	4.07	9.84	5.09	2.93	0.79	1.24	0.85
118^∗	angle	87.67	1754.87	385.79	17.44	5.76	7.50	15.00	10.82	5.54	2.55	2.71	2.48
118^∗	magnitude	5.03	17.54	8.57	1.55	4.40	5.30	9.94	6.80	2.10	0.76	0.80	0.77
300	angle	302.38	2476.92	579.51	108.55	324.10	21.71	32.66	35.11	23.59	12.87	7.80	6.96
300	magnitude	27.94	24.76	9.73	11.23	9.63	5.58	8.20	8.81	10.65	2.42	2.18	1.94
500	angle	309.28	4284.63	401.38	259.15	239.32	16.18	798.32	13.40	7.83	8.95	6.93	8.26
500	magnitude	58.46	357.44	46.81	46.68	238.68	10.91	15.13	8.48	3.35	3.67	3.31	3.43
1354	angle	207.19	6145.52	1189.77	164.72	2973.84	32.20	-	30.84	13.79	12.77	9.13	8.78
1354	magnitude	7.82	48.95	16.78	6.59	60.23	10.23	-	10.11	15.53	4.34	3.53	3.54
1354^∗	angle	80.47	4707.09	374.57	65.61	2059.95	18.88	-	29.85	21.15	6.74	5.11	5.20
1354^∗	magnitude	3.84	45.73	11.00	3.58	36.48	10.67	-	12.05	12.31	3.50	3.31	3.13

TABLE III: ARMSEs of the branch flow calculations for different cases

Cases	Branch flow	LPF	SML	KNN	RR	SVR	FC	TPBNN	ResNet	Random	Data-driven	Linearized PF	Jacobian
30	active	3.998	5.049	1.329	0.038	0.080	0.108	0.618	0.057	0.029	0.028	0.031	0.028
30	reactive	1.064	1.774	0.660	0.048	0.088	0.103	0.457	0.056	0.036	0.034	0.041	0.033
118	active	2.836	19.944	4.472	0.282	0.294	0.391	3.585	0.745	0.255	0.071	0.105	0.077
118	reactive	1.341	4.579	1.063	0.153	0.643	0.321	1.654	0.440	0.221	0.067	0.108	0.065
118^∗	active	0.678	21.288	5.004	0.156	0.347	0.588	4.235	0.924	0.204	0.089	0.070	0.078
118^∗	reactive	0.710	4.447	1.173	0.191	0.611	0.403	1.719	0.529	0.172	0.075	0.065	0.063
300	active	0.930	14.010	2.889	0.209	2.212	1.718	2.430	4.526	5.146	0.496	0.557	0.508
300	reactive	2.314	3.566	0.911	0.686	1.239	0.924	2.080	1.817	2.869	0.471	0.452	0.427
500	active	1.658	37.855	7.821	1.427	9.974	4.394	17.896	2.834	0.970	0.756	0.767	0.741
500	reactive	3.843	18.909	3.899	3.095	21.054	3.015	81.021	2.091	1.205	0.909	0.798	0.859
1354	active	1.103	62.524	18.677	1.131	42.54	16.164	-	15.843	11.94	7.066	5.808	5.922
1354	reactive	2.197	15.189	4.201	1.857	20.48	7.800	-	8.189	19.65	6.841	5.395	5.406
1354^∗	active	0.513	44.810	8.420	0.420	50.087	7.660	-	19.056	34.828	3.986	4.055	3.923
1354^∗	reactive	0.996	9.812	2.252	0.812	21.279	10.821	-	14.658	23.948	5.626	5.033	5.305

IV-D Comparison Results of Probabilistic Characteristics

IV-D1 Wasserstein distance comparison

The Wasserstein distance measures the minimum effort in transforming the probability mass from one distribution to the other. It quantifies the distribution difference between the model outputs and the targets [45]. Table IV and Table V show the AWD of voltage phasor and branch flow distributions. Two physics-guided initialization schemes achieve the best performance among all the methods.

TABLE IV: AWDs of the voltage phasor distributions for different cases (

10^{-4}

Cases	Voltage phasor	LPF	SML	KNN	RR	SVR	FC	TPBNN	ResNet	Random	Data-driven	Linearized PF	Jacobian
30	angle	152.99	33.50	45.84	2.07	0.67	0.86	2.03	0.70	0.76	0.53	0.49	0.70
30	magnitude	10.73	4.34	13.70	0.59	0.57	0.51	1.99	0.35	0.37	0.28	0.25	0.32
118	angle	257.96	66.79	158.91	12.7	2.64	1.96	10.43	2.22	2.48	0.92	0.86	0.91
118	magnitude	10.17	0.97	3.64	0.58	2.26	0.94	2.98	1.32	0.74	0.22	0.52	0.20
118^∗	angle	78.95	109.10	175.09	11.12	2.65	2.04	3.55	2.70	2.03	0.96	0.98	0.89
118^∗	magnitude	3.88	1.21	4.00	0.46	2.15	1.09	2.80	1.41	0.45	0.24	0.20	0.22
300	angle	266.14	212.63	144.58	69.72	84.50	4.57	5.81	7.07	4.75	3.49	2.36	2.22
300	magnitude	22.44	3.08	3.46	4.08	4.23	1.29	2.53	1.78	3.14	0.60	0.62	0.52
500	angle	167.55	93.60	199.68	160.44	191.47	3.26	59.16	2.82	1.75	2.44	1.71	1.82
500	magnitude	35.30	10.81	19.49	29.06	221.55	1.62	3.24	1.22	1.05	1.19	0.97	0.89
1354	angle	108.80	85.46	659.17	101.01	2364.51	5.79	-	5.66	5.22	3.92	2.79	2.72
1354	magnitude	4.75	1.05	9.23	2.40	55.00	2.26	-	2.75	5.32	1.53	1.04	1.09
1354^∗	angle	47.65	59.32	195.80	42.88	1621.93	3.76	-	4.78	3.76	2.11	1.44	1.39
1354^∗	magnitude	2.36	0.60	4.32	1.69	31.44	1.59	-	2.50	3.21	1.40	1.27	1.26

TABLE V: AWDs of the branch flow distributions for different cases.

Cases	Branch flow	LPF	SML	KNN	RR	SVR	FC	TPBNN	ResNet	Random	Data-driven	Linearized PF	Jacobian
30	active	3.293	0.382	0.625	0.019	0.022	0.024	0.246	0.019	0.013	0.016	0.018	0.012
30	reactive	0.871	0.166	0.319	0.014	0.033	0.025	0.194	0.014	0.012	0.014	0.025	0.010
118	active	2.420	1.291	1.988	0.145	0.103	0.121	1.740	0.213	0.147	0.045	0.080	0.039
118	reactive	1.144	0.326	0.456	0.074	0.404	0.117	0.715	0.160	0.115	0.039	0.085	0.030
118^∗	active	0.560	2.059	2.414	0.094	0.102	0.121	2.090	0.195	0.092	0.063	0.038	0.045
118^∗	reactive	0.556	0.441	0.548	0.093	0.333	0.102	0.760	0.141	0.069	0.048	0.035	0.036
300	active	0.781	1.563	1.010	0.134	1.069	0.644	1.043	2.066	2.894	0.249	0.268	0.256
300	reactive	1.880	0.433	0.296	0.325	0.754	0.396	1.049	0.907	2.073	0.347	0.289	0.296
500	active	0.924	0.690	4.026	0.860	7.330	1.273	6.772	0.922	0.540	0.395	0.356	0.356
500	reactive	2.278	0.556	1.869	1.935	18.031	1.176	7.537	0.803	0.922	0.681	0.569	0.630
1354	active	0.611	1.162	10.157	0.629	28.185	7.089	-	6.815	9.806	4.493	3.077	3.148
1354	reactive	1.274	5.949	6.703	0.890	20.103	4.523	-	5.277	14.079	5.105	3.497	3.589
1354^∗	active	0.303	0.611	4.043	0.257	28.666	3.098	-	11.106	25.222	3.140	3.132	3.092
1354^∗	reactive	0.589	0.142	1.002	0.441	18.318	2.600	-	7.301	18.161	5.101	4.412	4.793

IV-D2 PDF estimates accuracy comparison

Fig. 4 and Fig. 5 show the estimated and target PDFs. The voltage magnitude of bus 16 has the largest standard deviation value, which means its voltage magnitude values are spread out over a broader range. Therefore, its PDF estimate accuracy can be a good indicator for comparing different approaches. As shown in Fig. 5, the estimated PDFs obtained by our proposed approaches are almost close to the ground-truth PDF. In addition, it is worth pointing out that the PDF estimates obtained by the LPF, SML, and RR methods have significant errors.

Furthermore, the accuracy of MCS can be improved by feeding more data samples, thus providing more precise PDF estimates. Employing a fixed number of iterations or establishing a threshold for the variance is practical for determining the stopping criteria of the MCS [46]. Table VI shows variance coefficients of voltage phasors using different sampling numbers of the MCS method. Nonetheless, it is crucial to acknowledge that the variance coefficient primarily emphasizes the mean value and variance. Achieving a precise PDF estimation might require a larger dataset, extending beyond precise mean value and variance estimations. Consequently, even if the coefficient of variation converges, the PDF estimate could still lack the desired level of accuracy.

In addition, Fig. 6 and Fig. 7 show the estimated PDFs obtained from the QMC with different sampling numbers. The QMC method aims to reduce sampling numbers while guaranteeing accurate PDF estimates. However, with the decrease in sampling numbers, the estimated PDFs are also slowly driving far away from the ground-truth PDF. Therefore, to guarantee an accurate PDF estimate, we cannot significantly reduce the sampling points. Besides, the computational complexity of the NR solver with $k$ iterations is $\mathcal{O}(k\times N^{1.4})$ for each sample [47]. In a word, the improved sampling algorithm QMC cannot significantly reduce the total computational time due to the necessity of enough samples and the usage of the NR solver. In addition, we notice that our proposed approaches can achieve accurate PDF estimation.

TABLE VI: The variance coefficients of voltage phasors using different sampling numbers of the MCS for different cases.

Cases	2000 samples	3000 samples	4000 samples	5000 samples
30	0.079	0.063	0.055	0.049
118	0.005	0.004	0.004	0.003
300	1.805	12.914	2.280	1.240
500	0.033	0.026	0.022	0.020
1354	5.783	20.016	2.718	5.971

IV-E Risk assessment

The performance of the proposed approaches is evaluated based on risk assessment, which is critical to power system operation. Tables VII and VIII show the probabilities of exceeding the operational limit of the voltage magnitude and apparent branch flow on the SouthCarolina-500 bus system [48]. Compared with the MCS method, our proposed approaches have achieved promising results in capturing the violation possibility values in the order of $10^{-3}$ . Furthermore, by employing the violation metrics for voltage magnitude and apparent branch flow, Tables IX and X present the 95% confidence intervals derived from the MCS, alongside the average violation values yielded by our innovative methods. Notably, we observe that the mean violation degrees estimated through our proposed approach consistently fall within the 95% confidence intervals established by the MCS technique.

TABLE VII: The lower bound constraint violation probabilities (

10^{-2}

) of the voltage magnitude on the SouthCarolina-500 bus system

Bus index	MCS	LPF	SML	KNN	RR	SVR	FC	TPBNN	ResNet	Random	Data-driven	Linearized PF	Jacobian
12	3.78	0.53	3.83	2.28	1.45	17.98	3.80	3.86	3.75	3.76	3.78	3.80	3.78
13	3.85	0.65	4.01	2.38	1.56	18.48	3.88	3.88	3.86	3.88	3.88	3.86	3.86
14	0.23	0	0.03	0	0	0	0.25	0.31	0.23	0.23	0.23	0.23	0.23
15	0.95	0	0.61	0.13	0	2.70	0.97	1.15	0.95	0.95	0.95	0.95	0.95
24	0.50	0	0.30	0	0	0.06	0.52	0.50	0.51	0.51	0.51	0.51	0.51

TABLE VIII: The constraint violation probabilities (

10^{-2}

) of the apparent branch flow on the SouthCarolina-500 bus system.

Branch index	MCS	LPF	SML	KNN	RR	SVR	FC	TPBNN	ResNet	Random	Data-driven	Linearized PF	Jacobian
1	0.30	0.25	0.38	0	0.25	0.26	0.15	1.71	0.15	0.13	0.41	0.23	0.43
8	0.01	0.01	0.01	0	0.01	0	0.01	0.01	0.01	0.01	0.01	0.01	0.01
9	0.01	0.01	0.01	0	0.01	0	0.01	0.05	0.01	0	0.01	0.01	0.01
18	0.16	0.15	0.55	0	0.15	0.13	0.13	0.11	0.16	0.13	0.15	0.15	0.16
19	0.63	0.38	1.13	0	0.48	0.28	0.53	0.53	0.41	0.63	0.60	0.67	0.53

TABLE IX: The 95% confidence intervals of the voltage magnitude lower bound constraint violation values obtained by the MCS and the average violation values estimated by the proposed approaches on the SouthCarolina-500 bus system (

10^{-2}

Bus index	MCS	Random	Data-driven	Linearized PF	Jacobian
12	[2.39, 3.08]	2.75	2.75	2.75	2.74
13	[2.44, 3.12]	2.77	2.77	2.80	2.78
14	[0.62, 1.42]	1.07	1.03	1.03	1.04
15	[1.86, 2.86]	2.41	2.43	2.40	2.42
24	[1.54, 2.48]	1.99	1.97	1.96	1.99

TABLE X: The 95% confidence intervals of the apparent branch flow upper bound constraint violation values obtained by the MCS and the average violation values estimated by the proposed approaches on the SouthCarolina-500 bus system.

Branch index	MCS	Random	Data-driven	Linearized PF	Jacobian
1	[0.03, 0.09]	0.05	0.07	0.08	0.07
4	[1.33, 1.39]	1.34	1.37	1.32	1.36
10	[6.07, 6.20]	6.11	6.19	6.15	6.15
18	[0.10, 0.37]	0.22	0.21	0.19	0.19
19	[0.11, 0.26]	0.20	0.22	0.18	0.19

In addition, one of the stopping criteria of the MCS method is the variance coefficient for the violation possibility to be smaller than 1% [49]. Within the simulations, accurate estimates of voltage magnitude violations and apparent branch flow violations necessitate a minimum of 2700 and 5100 samples, respectively. Thus, the cumulative computational time required to perform Monte Carlo simulations for power flow analysis across thousands of instances can be substantial. By contrast, our proposed methods can rapidly predict voltage phasors for thousands of data samples.

IV-F Computational Efficiency and Convergence Rate

The simulations are implemented on an iMac with i7-8007 CPU and 32GB RAM and a Linux server with NVIDIA Tesla K20-5 GB GPU. The NN training is implemented via PyTorch 1.7.1 in Python 3.7. Table XI shows the NN training time of each epoch. Residual building blocks need a slightly longer time due to the extra propagation of the shortcut connection layer. Table XII shows our proposed approaches significantly reduce the total computational time compared with the MCS and QMC (with 3000 sampling points) algorithms.

TABLE XI: Computation training time comparison of each epoch for different cases (seconds).

Cases	FC	TPBNN	ResNet	Proposed approaches
30	0.87	2.32	1.13	1.07
118	1.55	4.11	1.90	1.91
300	2.13	6.27	1.97	2.37
500	3.57	15.66	4.57	3.74
1354	4.58	-	6.10	5.77

TABLE XII: Computation testing time comparison for different cases (seconds) and the calculation time ratio of MCS to our proposed approaches.

Cases	MCS	QMC	LPF	SML	KNN	RR	SVR	FC	TPBNN	ResNet	Proposed approaches	Acceleration ratio
30	7.88	6.34	0.01	0.09	0.08	0.00	71.28	0.00	0.01	0.00	0.00	1136
118	14.95	9.67	0.04	0.44	0.36	0.02	173.83	0.00	0.22	0.00	0.00	1657
300	32.11	21.96	0.22	0.63	0.96	0.05	3028.93	0.01	0.62	0.01	0.01	3642
500	48.48	26.60	0.34	0.93	2.77	0.11	322.13	0.01	1.87	0.01	0.01	2921
1354	125.47	51.60	2.75	6.59	9.34	0.75	127.81	0.04	-	0.05	0.04	2598

The learning rate affects the loss convergence rate of the NN training process. A significant learning rate helps fast convergence but may lead to the NN weights converging to a suboptimal solution. Hence, we adopt a relatively small learning rate of $10^{-4}$ . Note that this small learning rate is only used in this part to indicate the convergence properties. The MSE evolution of the training loss in the first 500 epochs is shown in Fig. 8 and Fig. 9. After training the first epoch, the MSEs of our proposed approaches are more than two orders less than that of others. Besides, even after training 500 epochs, other NN-based methods cannot achieve the same loss level as ours. Therefore, our proposed approaches have significant advantages over others regarding the convergence rate, which can be attractive in the face of a limited training time.

In addition, three designed initialization schemes converge faster than the random initialization. Therefore, we show how designed initialization methods influence the NN weights update during the training process. Fig. 10a shows the randomly distributed parameters of the shortcut connection layer. After training 500 epochs, the pattern of parameters is still very random, as shown in Fig. 10b. In contrast, Fig. 11 shows that the pattern of the updated parameters after training is still quite similar to that of the initial parameters. This phenomenon indicates that these learnable parameters are finely updated based on the initial weights during the training process. Therefore, we conclude that our well-designed initial weights play a critical role in NN training.

V Conclusion

This paper proposes a novel residual learning NN framework with three different initialization schemes to conduct rapid PF analysis, which can significantly reduce the total computational time in PPF analysis. Traditional PF analysis relies on the NR solver to solve the AC-PF equations iteratively until convergence. The widely installed PMUs and SCADA systems can collect abundant measurements, which motivates the necessity of considering NNs for real-time PF analysis.

Inspired by the residual building block, we introduce a shortcut connection linear layer between the input power injections and output voltage phasors to the MLP structure. Our proposed framework aims to learn the non-linear correction to the linearized AC-PF equations instead of directly dealing with the original non-linear AC-PF equations. In addition, the absolute values of voltage angle differences between connected buses are typically small while the voltage magnitudes are slightly perturbed around 1 per unit. Based on this property, we develop three initialization schemes for different scenarios. Two model-based schemes (linearized PF and Jacobian methods) require knowledge of network topology and line parameters. If this information is missing or inaccurate, the data-driven approach will be a good choice. Extensive simulation results show that our proposed approaches improve the estimation accuracy and significantly speed up the training when compared with the existing methods.

References

[1] M. Bazilian, I. Onyeji, M. Liebreich, I. MacGill, J. Chase, J. Shah, D. Gielen, D. Arent, D. Landfear, and S. Zhengrong, “Re-considering the economics of photovoltaic power,” Renewable Energy, vol. 53, pp. 329–338, 2013.
[2] X. Fu, H. Sun, Q. Guo, Z. Pan, X. Zhang, and S. Zeng, “Probabilistic power flow analysis considering the dependence between power and heat,” Applied Energy, vol. 191, pp. 582–592, 2017.
[3] B. Borkowska, “Probabilistic load flow,” IEEE Trans. on Power App. and Syst., vol. PAS-93, no. 3, pp. 752–759, 1974.
[4] M. Fan, V. Vittal, G. T. Heydt, and R. Ayyanar, “Probabilistic power flow studies for transmission systems with photovoltaic generation using cumulants,” IEEE Trans. on Power Syst., vol. 27, no. 4, pp. 2251–2261, 2012.
[5] X. Deng, J. He, and P. Zhang, “A novel probabilistic optimal power flow method to handle large fluctuations of stochastic variables,” Energies, vol. 10, no. 10, 2017.
[6] J. Morales and J. Pérez-Ruiz, “Point estimate schemes to solve the probabilistic power flow,” IEEE Trans. on Power Syst., vol. 22, pp. 1594 – 1601, 12 2007.
[7] Y. Che, X. Lv, X. Wang, J. Liu, and Y. Zhang, “Probabilistic load flow using an improved point estimate method considering wind generation,” in IEEE PES ISGT-Asia, 2019, pp. 4080–4085.
[8] X. Fu, Q. Guo, and H. Sun, “Statistical machine learning model for stochastic optimal planning of distribution networks considering a dynamic correlation and dimension reduction,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 2904–2917, 2020.
[9] X. Fu, “Statistical machine learning model for capacitor planning considering uncertainties in photovoltaic power,” Protection and Control of Modern Power Systems, vol. 7, p. 5, 2022.
[10] G. E. Constante-Flores and M. S. Illindala, “Data-driven probabilistic power flow analysis for a distribution system with renewable energy sources using monte carlo simulation,” IEEE Trans. on Industry Applications, vol. 55, no. 1, pp. 174–181, 2019.
[11] M. Hajian, W. D. Rosehart, and H. Zareipour, “Probabilistic power flow by monte carlo simulation with latin supercube sampling,” IEEE Trans. on Power Syst., vol. 28, no. 2, pp. 1550–1559, 2013.
[12] X. Xu and Z. Yan, “Probabilistic load flow calculation with quasi-monte carlo and multiple linear regression,” Int. J. of Electr. Power and Energy Syst., vol. 88, pp. 1–12, 2017.
[13] J. Yu, Y. Weng, and R. Rajagopal, “Robust mapping rule estimation for power flow analysis in distribution grids,” in 2017 North American Power Symposium, 2017, pp. 1–6.
[14] Y. Liu, N. Zhang, Y. Wang, J. Yang, and C. Kang, “Data-driven power flow linearization: A regression approach,” IEEE Trans. on Smart Grid, vol. 10, no. 3, pp. 2569–2580, 2019.
[15] Y. Liu, Y. Wang, N. Zhang, D. Lu, and C. Kang, “A data-driven approach to linearize power flow equations considering measurement noise,” IEEE Trans. on Smart Grid, vol. 11, no. 3, pp. 2576–2587, 2020.
[16] Y. Xu, Z. Hu, L. Mili, M. Korkali, and X. Chen, “Probabilistic power flow based on a gaussian process emulator,” IEEE Trans. on Power Syst., vol. 35, no. 4, pp. 3278–3281, 2020.
[17] P. Pareek and H. D. Nguyen, “A framework for analytical power flow solution using gaussian process learning,” IEEE Trans. on Sustainable Energy, vol. 13, no. 1, pp. 452–463, 2022.
[18] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.
[19] Y. Yang, Z. Yang, J. Yu, B. Zhang, Y. Zhang, and H. Yu, “Fast calculation of probabilistic power flow: A model-based deep learning approach,” IEEE Trans. on Smart Grid, vol. 11, no. 3, pp. 2235–2244, 2020.
[20] M. Xiang, J. Yu, Z. Yang, Y. Yang, H. Yu, and H. He, “Probabilistic power flow with topology changes based on deep neural network,” Int. J. of Electr. Power and Energy Syst., vol. 117, p. 105650, 2020.
[21] K. Chen and Y. Zhang, “Variation-cognizant probabilistic power flow analysis via multi-task learning,” in IEEE PES ISGT-North America, 2022, pp. 1–5.
[22] X. Hu, H. Hu, S. Verma, and Z.-L. Zhang, “Physics-guided deep neural networks for power flow analysis,” IEEE Trans. on Power Syst., vol. 36, no. 3, pp. 2082–2092, 2021.
[23] V. Bolz, J. Rueß, and A. Zell, “Power flow approximation based on graph convolutional networks,” in 18th IEEE ICMLA, 2019, pp. 1679–1686.
[24] D. Wang, K. Zheng, Q. Chen, G. Luo, and X. Zhang, “Probabilistic power flow solution with graph convolutional network,” in IEEE PES ISGT-Europe), 2020, pp. 650–654.
[25] T. Falconer and L. Mones, “Leveraging power grid topology in machine learning assisted optimal power flow,” IEEE Trans. on Power Syst., pp. 1–13, 2022.
[26] A. Veit, M. J. Wilber, and S. J. Belongie, “Residual networks behave like ensembles of relatively shallow networks,” in Proc. 30th NeurIPS, Barcelona, Spain, Dec. 2016, pp. 550–558.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. CVPR, Las Vegas, NV, USA, Jun. 2016, pp. 770–778.
[28] H. Lin and S. Jegelka, “Resnet with one-neuron hidden layers is a universal approximator,” in Proc. 32nd NeurIPS, vol. 31, Palais des Congrès de Montréal, Canada, Dec. 2018, p. 6172–6181.
[29] S. M. Fatemi, S. Abedi, G. B. Gharehpetian, S. H. Hosseinian, and M. Abedi, “Introducing a novel dc power flow method with reactive power considerations,” IEEE Trans. on Power Syst., vol. 30, no. 6, pp. 3012–3023, 2015.
[30] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proc. 14th Int. Conf. AISTATS, vol. 15, Fort Lauderdale, FL, USA, Apr. 2011, pp. 315–323.
[31] J. Yang, N. Zhang, C. Kang, and Q. Xia, “A state-independent linear power flow model with accurate estimation of voltage magnitude,” IEEE Trans. on Power Syst., vol. 32, no. 5, pp. 3607–3617, 2017.
[32] J. Zhang, L. Guan, and C. Y. Chung, “Instantaneous sensitivity identification in power systems - challenges and technique roadmap,” in IEEE PES GM, 2016, pp. 1–5.
[33] D. N. Schreiber-Gregory, “Ridge regression and multicollinearity: An in-depth review,” Model Assisted Statistics and Applications, vol. 13, no. 4, pp. 359–365, 2018.
[34] B. Hanin, “Which neural net architectures give rise to exploding and vanishing gradients?” in Proc. 32nd NeurIPS, Dec. 2018, p. 580–589.
[35] R. D. Zimmerman, C. E. Murillo-Sánchez, and R. J. Thomas, “Matpower: Steady-state operations, planning, and analysis tools for power systems research and education,” IEEE Trans. on Power Syst., vol. 26, no. 1, pp. 12–19, 2011.
[36] A. B. Birchfield, T. Xu, K. M. Gegner, K. S. Shetye, and T. J. Overbye, “Grid structural characteristics as validation criteria for synthetic networks,” IEEE Trans. on Power Syst., vol. 32, no. 4, pp. 3258–3265, 2017.
[37] C. Josz, S. Fliscounakis, J. Maeght, and P. Panciatici, “Ac power flow data in matpower and qcqp format: itesla, rte snapshots, and pegase,” arXiv preprint arXiv:1603.01533, 03 2016.
[38] T. Hong, P. Pinson, and S. Fan, “Global energy forecasting competition 2012,” Int. J. Forecast., vol. 30, no. 2, pp. 357–363, 2014.
[39] “Solar power data for integration studies.” [Online]. Available: https://www.nrel.gov/grid/solar-power-data.html
[40] P. P. Biswas, P. Suganthan, R. Mallipeddi, and G. A. Amaratunga, “Optimal reactive power dispatch with uncertainties in load demand and renewable energy sources adopting scenario-based approach,” Applied Soft Computing, vol. 75, pp. 616–632, 2019.
[41] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proc. IEEE ICCV, Santiago, Chile, Dec. 2015, pp. 1026–1034.
[42] Y. C. Chen, J. Wang, A. D. Domínguez-García, and P. W. Sauer, “Measurement-based estimation of the power flow jacobian matrix,” IEEE Trans. on Smart Grid, vol. 7, no. 5, pp. 2507–2515, 2016.
[43] S. M. Miraftabzadeh, M. Longo, F. Foiadelli, M. Pasetti, and R. Igual, “Advances in the application of machine learning techniques for power system analytics: A survey,” Energies, vol. 14, p. 4776, 08 2021.
[44] L. Prechelt, Early Stopping-But When? Berlin, Heidelberg: Springer, 2012.
[45] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proc. 34th Int. Conf. on Machine Learning, vol. 70, Sydney, Australia, Aug. 2017, pp. 214–223.
[46] B. R. Prusty and D. Jena, “A critical review on probabilistic load flow studies in uncertainty constrained power systems with photovoltaic generation and a new approach,” Renewable and Sustainable Energy Reviews, vol. 69, pp. 1286–1302, 2017.
[47] F. Alvarado, “Computational complexity in power systems,” IEEE Trans. on Power App. and Syst., vol. 95, no. 4, pp. 1028–1037, 1976.
[48] Y. Xu, M. Korkali, L. Mili, X. Chen, and L. Min, “Risk assessment of rare events in probabilistic power flow via hybrid multi-surrogate method,” IEEE Trans. on Smart Grid, vol. 11, no. 2, pp. 1593–1603, 2020.
[49] A. M. Leite da Silva and A. M. de Castro, “Risk assessment in probabilistic load flow via monte carlo simulation and cross-entropy method,” IEEE Trans. on Power Syst., vol. 34, no. 2, pp. 1193–1202, 2019.

Physics-guided Residual Learning for Probabilistic Power Flow Analysis