This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Learning and Distributed Control for Residential Demand Response

Xin Chen,  Yingying Li, Jun Shimada, Na Li X. Chen, Y. Li and N. Li are with the School of Engineering and Applied Sciences, Harvard University, USA. Email: ([email protected], [email protected], [email protected]). J. Shimada is with ThinkEco Inc., New York, USA. Email: [email protected]. The work was supported by NSF CAREER: ECCS-1553407 and NSF EAGER: ECCS-1839632.
Abstract

This paper studies the automated control method for regulating air conditioner (AC) loads in incentive-based residential demand response (DR). The critical challenge is that the customer responses to load adjustment are uncertain and unknown in practice. In this paper, we formulate the AC control problem in a DR event as a multi-period stochastic optimization that integrates the indoor thermal dynamics and customer opt-out status transition. Specifically, machine learning techniques including Gaussian process and logistic regression are employed to learn the unknown thermal dynamics model and customer opt-out behavior model, respectively. We consider two typical DR objectives for AC load control: 1) minimizing the total demand, 2) closely tracking a regulated power trajectory. Based on the Thompson sampling framework, we propose an online DR control algorithm to learn customer behaviors and make real-time AC control schemes. This algorithm considers the influence of various environmental factors on customer behaviors and is implemented in a distributed fashion to preserve the privacy of customers. Numerical simulations demonstrate the control optimality and learning efficiency of the proposed algorithm.

Index Terms:
Online learning, uncertain customer behavior, distributed algorithm, incentive-based demand response.

I Introduction

Due to increasing renewable generation and growing peak load, electric power systems are inclined to confront a deficiency of reserve capacity. As a typical example, in mid-August 2019, Texas grid experienced record electricity demand and severe reserve emergency that were caused by the heat wave and reduced wind generation. The electricity price once soared to 9$/kWh and the Electric Reliability Council of Texas (ERCOT) issued Level 1 Energy Emergency Alert to call upon voluntary energy conservation and all available generation sources [1]. To cope with such problems, demand response (DR) is an economical and sustainable solution that strategically motivates load adjustment from end users to meet the needs of power supply [2]. In particular, residential loads account for a large share of the total electricity usage (e.g. about 38%38\% in the U.S. [3]), which can release significant power flexibility to facilitate system operation through a coordinated dispatch. Moreover, the widespread deployment of advanced meters, smart plugs, and built-in controllers enables the remote monitoring and control of electric appliances with two-way communications between households and load service entities (LSEs). This makes it technically feasible to implement residential DR, and well-designed DR control algorithms are necessitated to fully exploit potential flexibility.

The mechanisms for residential DR are mainly categorized as “price-based” and “incentive-based” [4]. The price-based DR programs [5] use various pricing schemes, such as time-of-use pricing, critical peak pricing, and real-time pricing, to influence and guide the electricity usage. In incentive-based DR programs [6, 7], the LSEs recruit customers to adjust their load demands in DR events with financial incentives, e.g. cash, coupon, raffle, rebate, etc. A typical residential DR event consists of two periods:1) the preparation period when the LSEs anticipate an upcoming load peak or system emergency and call upon customers to participate with incentives (day-ahead or hours-ahead); 2) the load adjustment period when the electric appliances of participating customers are controlled to achieve certain DR goals. Meanwhile, customers are allowed to opt out (e.g. by clicking the “opt out” button in the smart phone app) if unsatisfied and override the control commands from the LSEs. In the end, the LSEs pay customers according to their actual contributions. In practice, the load adjustment period usually lasts for a few hours, and the control cycle of electric appliances varies from 5 to 30 minutes depending on the actual DR setting, which is enabled by the advanced meters with second or minute sampling rates. This paper focuses on the real-time control of electric appliances during the load adjustment period from the perspective of LSEs.

According to the investigations in [8, 9], offering the override (opt-out) option can greatly enhance the customers’ acceptance of direct load control, which is even more effective than financial incentives. Hence, customers are generally authorized to have the opt-out option in modern residential DR programs. However, the customer opt-out behaviors are uncertain and unknown to LSEs in practice, which brings significant challenges to the real-time DR control. References [8, 9, 10, 11] indicate that customer DR behaviors are influenced by individual preference and environmental factors. The individual preference relates to customers’ intrinsic socio-demographic characteristics, e.g. income, education, age, attitude to energy saving, etc. The environmental factors refer to real-time externalities such as electricity price, indoor temperature, offered incentive, weather conditions, etc. Without considering customers’ actual willingness, a blind DR control scheme may lead to high opt-out rates and inefficient load adjustment.

To address the uncertainty issue, data-driven learning techniques can be employed to learn customer DR behaviors with historical data and through online interactions and observations. Comprehensive reviews on the application of reinforcement learning (RL) for DR are provided in [12, 13]. References [14, 15, 16] design home energy management systems to optimally schedule electric appliances under time-varying electricity price, where QQ-learning is used to learn customer preferences and make rescheduling decisions. In [17], a real-time DR strategy is presented for optimal arrangement of home appliances based on deep RL and policy search algorithms, considering the uncertainty of the resident’s behavior, electricity price, and outdoor temperature. Reference [18] applies batch RL to dispatch thermostatically controlled loads (TCLs) with the exploitation of exogenous data for day-ahead scheduling. Reference [19] proposes an incentive-based DR algorithm using RL and deep neural networks, to assist LSEs in designing the optimal incentive rates with uncertain energy prices and demands. In [20, 21], the multi-armed bandit method and its variants are adopted to select the right customers for DR participation at the preparation period to deal with unknown customer responses. However, for real-time load control in incentive-based DR, most existing works do not consider or overly simplify the uncertainty of customer behaviors, and the influence of various environmental factors is generally neglected. This causes significant mismatches between theory and practice. Hence, the development of real-time DR control algorithms that take into account customer opt-out behaviors remains largely unresolved.

Contribution. This paper studies the incentive-based residential DR programs that control air conditioner (AC) loads to optimize certain DR performances, e.g., minimizing the total AC load or closely tracking a target power trajectory. To this end, we propose a novel framework to model the real-time DR control as a multi-period stochastic optimization problem that integrates thermal dynamics and customer behavior transitions. In particular, Gaussian process (GP) [22] is adopted to build a non-parametric indoor thermal dynamical model from historical metering data, and logistic regression [23] is used to model the customer opt-out behaviors under the influence of environmental factors. Based on the Thompson sampling (TS) framework [24], we develop a distributed online DR control algorithm to learn the customer opt-out behaviors and make real-time AC power control decisions. The main merits of the proposed algorithm are summarized as follows:

  • 1)

    The individual preferences of customers and time-varying environmental factors are taken into account, which improves the predictions on customer opt-out behaviors and leads to efficient AC control schemes.

  • 2)

    This algorithm is implemented in a distributed manner and thus can be directly embedded in local household AC appliances, smart plugs, or smart phone apps. Moreover, the communication burdens are mitigated and the customer privacy can be preserved.

  • 3)

    Inheriting the merits of TS, this algorithm has a convenient decomposition structure of learning and optimization, and strikes an effective balance between exploration and exploitation in the online learning process.

The remainder of this paper is organized as follows: Section II provides a preliminary introduction on GP and TS. Section III presents the optimal AC control models with the learning techniques. Section IV develops the distributed online learning and AC control algorithm. Numerical tests are carried out in Section V, and conclusions are drawn in Section VI.

II Preliminaries on Learning Techniques

This section provides a preliminary introduction on the two key learning techniques used in this paper, i.e., Gaussian process and Thompson sampling.

II-A Gaussian Process

Gaussian process is a non-parametric supervised machine learning method [22] that has been widely used to model nonlinear system dynamics [27]. A formal definition of GP over a function f(𝒙)f(\bm{x}) is that any finite number of function realizations (f(𝒙1),f(𝒙2),f(𝒙3),)(f(\bm{x}_{1}),f(\bm{x}_{2}),f(\bm{x}_{3}),\cdots) are random variables and follow a joint Gaussian distribution, which is fully specified by the mean function m(𝒙)m(\bm{x}) and the (kernel) covariance function k(𝒙,𝒙)k(\bm{x},\bm{x}^{\prime}). Consider learning an unknown function y=f(𝒙)y\!=\!f(\bm{x}) based on a training dataset 𝒟\mathcal{D} of nn noisy observations, i.e. 𝒟:={(𝒙i,y^i)|i=1,,n}\mathcal{D}:=\{(\bm{x}_{i},\hat{y}_{i})|\,i\!=\!1,\cdots,n\}. It aims to infer the function value f(𝒙)f(\bm{x}_{*}) for a new point 𝒙\bm{x}_{*}. Denote 𝒚^𝒟:=(y^i)i=1n\hat{\bm{y}}_{\mathcal{D}}\!:=\!(\hat{y}_{i})_{i=1}^{n} and 𝒇𝒟:=(f(𝒙i))i=1n\bm{f}_{\mathcal{D}}\!:=\!(f(\bm{x}_{i}))_{i=1}^{n}. By the GP definition, (𝒇𝒟,f(𝒙))(\bm{f}_{\mathcal{D}},f(\bm{x}_{*})) are assumed to be random variables and follow a joint Gaussian distribution

[𝒇𝒟f(𝒙)]𝒩([𝒎𝒟m(𝒙)],[K𝒟,𝒟𝒌,𝒟𝒌,𝒟k(𝒙,𝒙)]),\displaystyle\begin{bmatrix}\bm{f}_{\mathcal{D}}\\ f(\bm{x}_{*})\end{bmatrix}\sim\mathcal{N}\begin{pmatrix}\begin{bmatrix}\bm{m}_{\mathcal{D}}\\ m(\bm{x}_{*})\end{bmatrix},\,\begin{bmatrix}K_{\mathcal{D},\mathcal{D}}&\bm{k}_{*,\mathcal{D}}\\ \bm{k}_{*,\mathcal{D}}^{\top}&k(\bm{x}_{*},\bm{x}_{*})\end{bmatrix}\end{pmatrix},

where vector 𝒌,𝒟:=[k(𝒙,𝒙1),,k(𝒙,𝒙n)]\bm{k}_{*,\mathcal{D}}\!:=\![k(\bm{x}_{*},\bm{x}_{1}),\cdots,k(\bm{x}_{*},\bm{x}_{n})]^{\top} and 𝒎𝒟:=(m(𝒙i))i=1n\bm{m}_{\mathcal{D}}\!:=\!(m(\bm{x}_{i}))_{i=1}^{n}. K𝒟,𝒟n×nK_{\mathcal{D},\mathcal{D}}\!\in\!\mathbb{R}^{n\times n} is the covariance matrix, whose ijij-component is k(𝒙i,𝒙j)k(\bm{x}_{i},\bm{x}_{j}). Conditioning on the given observations 𝒟\mathcal{D}, it is known that the posterior distribution of f(𝒙)|(𝒇𝒟=𝒚^𝒟)f(\bm{x}_{*})|(\bm{f}_{\mathcal{D}}\!=\!\hat{\bm{y}}_{\mathcal{D}}) is also Gaussian, i.e. 𝒩(μ|𝒟,σ|𝒟2)\mathcal{N}(\mu_{*|\mathcal{D}},\sigma^{2}_{*|\mathcal{D}}), with the closed form

μ|𝒟\displaystyle\mu_{*|\mathcal{D}} =m(𝒙)+𝒌,𝒟K𝒟,𝒟1(𝒚^𝒟𝒎𝒟),\displaystyle=m(\bm{x}_{*})+\bm{k}_{*,\mathcal{D}}^{\top}\,K_{\mathcal{D},\mathcal{D}}^{-1}\,(\hat{\bm{y}}_{\mathcal{D}}-\bm{m}_{\mathcal{D}}), (1a)
σ|𝒟2\displaystyle\sigma^{2}_{*|\mathcal{D}} =k(𝒙,𝒙)𝒌,𝒟K𝒟,𝒟1𝒌,𝒟.\displaystyle=k(\bm{x}_{*},\bm{x}_{*})-\bm{k}_{*,\mathcal{D}}^{\top}\,K_{\mathcal{D},\mathcal{D}}^{-1}\,\bm{k}_{*,\mathcal{D}}. (1b)

Then the mean value μ|𝒟\mu_{*|\mathcal{D}} (1a) can be used as the prediction on f(𝒙)f(\bm{x}_{*}), and the variance σ|𝒟2\sigma^{2}_{*|\mathcal{D}} (1b) provides a confidence estimate for this prediction. The merits of GP include 1) GP is a non-parametric method that avoids the bias of model selection; 2) GP works well with small datasets; 3) GP can incorporate prior domain knowledge by defining priors on hyperparameters or using a particular covariance function. The major issue of GP is the computational complexity, which scales cubically in the number of observations, i.e., O(n3){O}(n^{3}).

II-B Thompson Sampling

Thompson sampling [24] is a prominent Bayesian learning framework that was originally developed to solve the multi-armed bandit (MAB) problem [25] and can be extended to tackle other online learning problems. Consider a classical TT-periods MAB problem where an agent selects an action (called “arm”) ata_{t} from the action set 𝒜\mathcal{A} at each time t{1,2,,T}t\in\{1,2,\cdots,T\}. After taking action ata_{t}, the agent observes an outcome ztz_{t} that is randomly generated from a conditional distribution 𝒫θ(|at)\mathcal{P}_{\theta}(\cdot|a_{t}), and then obtains a reward rt=r(zt)r_{t}=r(z_{t}) with known reward function r()r(\cdot). The agent is initially uncertain about the parameter θ\theta in 𝒫θ\mathcal{P}_{\theta} but aims to maximize the total expected reward using the observation feedback. To achieve good performance, it is generally required to take actions with an effective balance between 1) exploring poorly-understood actions to gather new information that may improve future reward and 2) exploiting what is known for decision to maximize the immediate reward.

TS is a straightforward online learning algorithm that strikes an effective balance between exploration and exploitation. As shown in Algorithm 1, TS treats the unknown θ\theta as a random variable and represents the initial belief on θ\theta with a prior distribution 𝒫\mathcal{P}. At each time tt, TS draws a random sample θ^\hat{\theta} from 𝒫\mathcal{P} (Step 3), then takes the optimal action that maximizes the expected reward based on the sample θ^\hat{\theta} (Step 4). After outcome ztz_{t} is observed, the Bayesian rule is applied to update the distribution 𝒫\mathcal{P} over θ\theta and obtain the posterior (Step 5).

Algorithm 1 Thompson Sampling (TS) Algorithm [24]
1:  Input: Prior distribution 𝒫\mathcal{P} on θ\theta.
2:  for t=1t=1 to TT do
3:     Sample θ^𝒫\hat{\theta}\sim\mathcal{P}.
4:     atargmaxa𝒜𝔼𝒫θ^[r(zt)|at=a]a_{t}\leftarrow\arg\max_{a\in\mathcal{A}}\mathbb{E}_{\mathcal{P}_{\hat{\theta}}}[r(z_{t})|a_{t}=a]. Apply ata_{t} and observe ztz_{t}.
5:     Posterior update: 𝒫𝒫(θ)𝒫θ(zt|at)θ~𝒫(θ~)𝒫θ~(zt|at)𝑑θ~.\ \ \mathcal{P}\leftarrow\frac{\mathcal{P}(\theta)\mathcal{P}_{\theta}(z_{t}|a_{t})}{\int_{\tilde{\theta}}\mathcal{P}(\tilde{\theta})\mathcal{P}_{\tilde{\theta}}(z_{t}|a_{t})\,d\tilde{\theta}}.
6:  end for

The main features of the TS algorithm are listed below:

  • As outcomes accumulate, the predefined prior distribution will be washed out and the posterior converges to the true distribution or concentrates on the true value of θ\theta.

  • The TS algorithm encourages exploration by the random sampling (Step 3). As the posterior distribution gradually concentrates, less exploration and more exploitation will be performed, which leads to an effective balance.

  • The key advantage of the TS algorithm is that a complex online problem is decomposed into a Bayesian learning task (Step 5) and an offline optimization task (Step 4) [26], while the optimization remains the original formulation without being complicated by the learning task.

III Problem Formulation

Consider the residential DR program that controls AC power consumption for load adjustment, where a system aggregator (SA) interacts with NN residential customers over sequential DR events. Each DR event111To avoid confusion, in the following text, we specifically refer to the load adjustment period when referring to “DR event”. is formulated as a finite time horizon [T]:={1,2,,T}[T]:=\{1,2,\cdots,T\} with the time gap Δt\Delta t. Depending on the practical AC control cycle, Δt\Delta t may be different (e.g., 5 minutes or 10 minutes) across DR events222Generally, the control time gap Δt\Delta t should be larger than the metering data collection period, which is enabled by current smart meters with second or minute sampling rates., and the time length TT could also be different (e.g., 2 hours or 3 hours). The SA aims to learn customers’ opt-out behaviors and make real-time AC control decisions to optimize aggregate DR performance. In this paper, we focus on the control of household mini-split AC units, while the proposed method is applicable to heating, ventilation and air conditioning (HVAC) systems, space heaters, and other TCLs. Moreover, the proposed method works for both the AC heating and cooling cases separately.

In this section, we firstly establish the AC control model for each individual customer during a DR event, then use Gaussian process and logistic regression to learn the unknown thermal dynamics and customer opt-out behaviors, respectively. With all these in place, two system-level optimization models with different DR objectives are built for the SA to generate optimal AC power control schemes over NN customers.

III-A Customer-Side AC Control Model During A DR Event

III-A1 Decision Variable

During a DR event, denote ui,tu_{i,t} as the AC power consumption for customer i[N]:={1,,N}i\!\in\![N]\!:=\!\{1,\cdots,N\} at time t[T]t\!\in\![T], which is the decision variable and satisfies

0ui,tu¯i,|ui,tui,t1|Δui,t[T]\displaystyle 0\leq u_{i,t}\leq\bar{u}_{i},\ |u_{i,t}\!-\!u_{i,t-1}|\leq\Delta u_{i},\quad\forall t\in[T] (2)

where u¯i\bar{u}_{i} is the rated AC power capacity and Δui\Delta u_{i} denotes the AC power drift limit that prevents dramatic changes. In (2), ui,tu_{i,t} is continuously controllable, since it indeed denotes the average AC power during the time interval [t1,t][t\!-\!1,t], which can be realized by appropriately adjusting the AC cycling rate, i.e., the time ratio of the on status over Δt\Delta t. Nevertheless, a discrete AC control model for customer ii with

ui,t=δi,tu¯i,δi,t{0,1},t[T]\displaystyle\qquad u_{i,t}=\delta_{i,t}\cdot\bar{u}_{i},\ \delta_{i,t}\in\{0,1\},\quad\forall t\in[T]

can be applied as well, where the AC unit switches between the on (δi,t=1\delta_{i,t}=1) and off (δi,t=0\delta_{i,t}=0) modes.

III-A2 Opt-Out Status and Transition

Denote binary variable zi,t{0,1}z_{i,t}\!\in\!\{0,1\} as the opt-out status of customer ii at time tt, which equals 11 if customer ii stays in and 0 if opts out. Initialize zi,0=1z_{i,0}\!=\!1 for all customers at the beginning of a DR event. As mentioned above, customer opt-out behaviors are influenced by various environmental factors. Accordingly, the binary opt-out statuses 𝒛i:=(zi,t)t[T]\bm{z}_{i}\!:=\!(z_{i,t})_{t\in[T]} of each customer i[N]i\in[N] are modelled as a time series of random variables that are independent of other customers and follow the transition probability (3):

(zi,t=0|zi,t1=0)\displaystyle\mathbb{P}(z_{i,t}\!=\!0|z_{i,t-1}\!=\!0) =1,\displaystyle=1, t[T]\displaystyle\forall t\in[T] (3a)
(zi,t=1|zi,t1=1)\displaystyle\mathbb{P}(z_{i,t}\!=\!1|{z_{i,t-1}}\!=\!1) =pi(𝒘i,t),\displaystyle=p_{i}(\bm{w}_{i,t}), t[T]\displaystyle\forall t\in[T] (3b)

Here, 𝒘i,t\bm{w}_{i,t} is the column vector that collects the environmental factors at time tt, which will be elaborated in the next part. pi(𝒘i,t)[0,1]p_{i}(\bm{w}_{i,t})\in[0,1] is the transition probability function that captures how environmental factors 𝒘i,t\bm{w}_{i,t} affect the customer opt-out behaviors.

Refer to caption
Figure 1: Opt-out status transition and dynamics of AC control in a DR event.

As illustrated in Figure 1, equation (3a) enforces the DR rule that once customer ii opts out at a certain time, this customer will remain the opt-out status for the rest of the current DR event. Equation (3b) indicates that if customer ii stays in at time t1t\!-\!1, this customer may remain stay-in at next time tt with probability pi(𝒘i,t)p_{i}(\bm{w}_{i,t}), or choose to opt out with probability 1pi(𝒘i,t)1-p_{i}(\bm{w}_{i,t}). We further explain the transition model (3) with the following remark.

Remark 1.

The opt-out status transition model (3) exhibits the Markov property, where the transition probability pi(𝒘i,t)p_{i}(\bm{w}_{i,t}) is functional on the environmental factors 𝒘i,t\bm{w}_{i,t} at time tt. This facilitates the subsequent development and solution of the optimal AC control models, but does not sacrifice the modeling generality. Because it is free to choose suitable environmental factors, so that all the useful information is captured in 𝒘i,t\bm{w}_{i,t} and the Markov property is preserved. Basically, by including all the necessary known information in the enlarged state at time tt, which is known as state augmentation, any multi-period control problem can generally be modelled as a Markov decision process [28]. See the next part for the detailed definition and selection of the environmental factors.

III-A3 Environmental Factors

Based on the empirical investigations in [8, 9, 10, 11], we present below several key environmental factors that influence customers’ opt-out behaviors. In particular, the first three factors are affected by the AC control scheme, thus their dynamics models are introduced as well.

\bullet (Indoor Temperature). Denote si,ts_{i,t} as the indoor temperature for customer ii at time tt. The associated thermal dynamical model can be formulated as

si,t=fi(si,t1,si,t1out,ui,t),t[T]\displaystyle s_{i,t}=f_{i}\left(s_{i,t-1},s_{i,t-1}^{\mathrm{out}},u_{i,t}\right),\quad t\in[T] (4)

where si,touts_{i,t}^{\mathrm{out}} is the outdoor temperature at time tt, (si,0,si,0out)(s_{i,0},s_{i,0}^{\mathrm{out}}) are the initial temperatures in the beginning of the DR event, and fi()f_{i}(\cdot) denotes the thermal dynamics function.

\bullet (Accumulated Thermal Discomfort). We define di,td_{i,t} as the accumulated thermal discomfort for customer ii at time tt, and let it follow the dynamics (5) with di,0=0d_{i,0}=0:

di,t=di,t1+Δt(max{si,tsiset, 0})2,t[T]\displaystyle d_{i,t}=d_{i,t-1}+\Delta t\cdot\!\big{(}\!\max\!\big{\{}s_{i,t}-s^{\mathrm{set}}_{i},\,0\big{\}}\big{)}^{2},\quad t\in[T] (5)

where sisets_{i}^{\mathrm{set}} denotes the defaulted AC setting temperature by customer ii. The operator max{x,0}\max\{x,0\} takes the larger value between xx and 0, which means that only the indoor temperature higher than the setpoint will cause thermal discomfort for the AC cooling case in the summer. Besides, the quadratic form in (5) captures that the thermal discomfort increases faster as the temperature deviation becomes larger [29].

\bullet (Incentive Credit). Denote ri,tr_{i,t} as the incentive credit offered to customer ii at time tt. We consider a general incentive scheme (6), while other incentive schemes can be used as well.

ri,t=r0+r1u¯it+r2Δtτ=1t|ui,τsetui,τ|,t[T].\displaystyle r_{i,t}=r_{0}+r_{1}\cdot\bar{u}_{i}\cdot t+r_{2}\cdot\Delta t\sum_{\tau=1}^{t}|u_{i,\tau}^{\mathrm{set}}\!-\!u_{i,\tau}|,\ t\in[T].\! (6)

In (6), the first term r0r_{0} is the base credit for DR participation. The second term is the stay-in bonus that is proportional to time tt and the AC power capacity u¯i\bar{u}_{i} with coefficient r1r_{1}.The third term is the reimbursement for the actual load adjustment with credit coefficient r2r_{2}, where ui,tsetu_{i,t}^{\mathrm{set}} is the associated AC power at time tt to maintain the setting temperature sisets_{i}^{\mathrm{set}}.

Essentially, (ui,tset)t[T](u_{i,t}^{\mathrm{set}})_{t\in[T]} can be regarded as the baseline AC power consumption of customer ii when no DR control is implemented. With the thermal dynamical model (4) and given (siset,si,t1out)(s_{i}^{\mathrm{set}},s_{i,t-1}^{\mathrm{out}}), one can compute ui,tsetu_{i,t}^{\mathrm{set}} by solving equation (7):

sisetfi(siset,si,t1out,ui,tset)=0,\displaystyle s_{i}^{\mathrm{set}}-f_{i}(s_{i}^{\mathrm{set}},s_{i,t-1}^{\mathrm{out}},u_{i,t}^{\mathrm{set}})=0, (7)

which follows the definition of maintaining the defaulted setting indoor temperature sisets_{i}^{\mathrm{set}}.

Other key environmental factors that would influence customers’ opt-out behaviors include real-time electricity prices, weather conditions, duration of the DR event, fatigue effect, etc. These factors are treated as given parameters that can be obtained or predicted ahead of time. Accordingly, the vector 𝒘i,t\bm{w}_{i,t} in (3b) can be defined as the combination of the environmental factors mentioned above:

𝒘i,t:=(si,t,di,t,ri,t,electricity price at t,).\displaystyle\bm{w}_{i,t}:=(s_{i,t},d_{i,t},r_{i,t},\text{electricity price at }t,\,\cdots). (8)
Remark 2.

We note that the definition and selection of useful environment factors are complex and tricky in practice. For instance, the definition of 𝒘i,t\bm{w}_{i,t} in (8) only contains the present status at time tt, while the past values and future predictions may also be included in 𝒘i,t\bm{w}_{i,t} to capture the temporal dependence. This is related to the feature engineering problem in machine learning, which is expected to be conducted based on real data and make a trade-off between complexity and effectiveness. Nevertheless, the proposed learning and AC control method is a general framework that is applicable to different choices of the environmental factors.

One critical issue for the residential AC control is that the thermal dynamics function fi()f_{i}(\cdot) in (4) and the customer opt-out behavior function pi()p_{i}(\cdot) in (3b) are generally unknown. To address this issue, learning techniques are used to estimate the unknown models with real data, which are presented in the following two subsections.

III-B Learning for Thermal Dynamics Model

The practical implementation of AC control for residential DR is generally achieved through smart plugs or smart AC units with built-in controllers and sensors. These smart devices are able to measure, store, and communicate the temperature and AC power data in real time. Hence, the thermal dynamics model (4) can be estimated based on fine-grained historical measurement data. To this end, we provide the following two thermal model estimation schemes.

III-B1 Linear Model

Given a time series of historical indoor/outdoor temperature and AC power data, one can fit a classical linear thermal dynamics model (9) [30] and obtain the coefficients (κi,ηi)(\kappa_{i},\eta_{i}) via linear regression:

si,t=si,t1+κi(si,t1outsi,t1)+ηiui,t,\displaystyle s_{i,t}=s_{i,t-1}+\kappa_{i}\cdot(s_{i,t-1}^{\mathrm{out}}-s_{i,t-1})+\eta_{i}\cdot u_{i,t}, (9)

where coefficients κi\kappa_{i} and ηi\eta_{i} specify the thermal characteristics of the room with AC and the ambient. A positive (negative) ηi\eta_{i} indicates that AC works in the heating (cooling) mode.

III-B2 Gaussian Process Model

An alternative scheme is to employ the Gaussian process method (introduced in Section II) to model the thermal dynamics as (10), which can capture the nonlinearity in the data pattern:

si,t=m(𝒙i,t1)+𝒌,𝒟(𝒙i,t1)K𝒟,𝒟1(𝒚^𝒟𝒎𝒟),\displaystyle s_{i,t}=m(\bm{x}_{i,t-1})+\bm{k}_{*,\mathcal{D}}(\bm{x}_{i,t-1})^{\top}K_{\mathcal{D},\mathcal{D}}^{-1}\,(\hat{\bm{y}}_{\mathcal{D}}-\bm{m}_{\mathcal{D}}), (10)

where 𝒙i,t1:=(si,t1,si,t1out,ui,t)\bm{x}_{i,t-1}\!:=\!(s_{i,t-1},s_{i,t-1}^{\mathrm{out}},u_{i,t}), and the notations with subscript 𝒟\mathcal{D} denote the corresponding terms associated with the historical (training) dataset as presented in (1a).

The main virtue of the linear model (9) lies in its simplicity of implementation and interpretability. In contrast, the non-parametric GP model (10) offers more modeling flexibility and can capture the nonlinear relation and avoid the bias of model class selection, in the cost of computational complexity. Besides, other suitable regression methods can be applied to model the thermal dynamics as well. The choice of model depends on the practical DR requirements on computational efficiency and modeling accuracy. The historical data above refer to the available datasets that have been collected by advanced meters before the DR event, thus the thermal dynamics model can be estimated in an offline fashion. Nevertheless, dynamic regression that uses the latest data to fit an updated model along the DR event is also applicable to further enhance the prediction accuracy.

III-C Learning for Customer Opt-Out Behaviors

Since the customer opt-out status zi,tz_{i,t} is binary, logistic regression [23] is used to model the transition probability function pi(𝒘i,t)p_{i}(\bm{w}_{i,t}) in (3b). Because the output of logistic regression is naturally a probability value within [0,1][0,1], and it is easy to implement and interpret. Moreover, logistic regression is compatible with the online Bayesian learning framework with efficient posterior update approaches (see Section IV-C for details). Accordingly, we formulate the transition probability function pi(𝒘i,t)p_{i}(\bm{w}_{i,t}) as the logistic model (11):

pi(𝒘i,t)=11+exp((αi+𝜷i𝒘i,t)),\displaystyle p_{i}(\bm{w}_{i,t})=\frac{1}{1+\exp(-(\alpha_{i}+\bm{\beta}_{i}^{\top}\bm{w}_{i,t}))}, (11)

where 𝜷i\bm{\beta}_{i} is the weight vector describing how customer ii reacts to the environmental factors 𝒘i,t\bm{w}_{i,t}, and αi\alpha_{i} depicts the individual preference. Define 𝒘^i,t:=(1,𝒘i,t)\hat{\bm{w}}_{i,t}:=(1,\bm{w}_{i,t}) and 𝜽i:=(αi,𝜷i)\bm{\theta}_{i}:=(\alpha_{i},\bm{\beta}_{i}). Then the linear term in (11) becomes 𝒘^i,t𝜽i-\hat{\bm{w}}_{i,t}^{\top}\bm{\theta}_{i}. Without causing any confusion, we use p𝜽i(𝒘i,t)p_{\bm{\theta}_{i}}(\bm{w}_{i,t}) and pi(𝒘i,t)p_{i}(\bm{w}_{i,t}) interchangeably.

As a consequence, the unknown information of customer ii’s behaviors is summarized as vector 𝜽i\bm{\theta}_{i}, which can be estimated from the observations of (𝒘^i,t,zi,t)(\hat{\bm{w}}_{i,t},z_{i,t}) in DR events. In contrast to the thermal dynamics model learning, the observation data of (𝒘^i,t,zi,t)(\hat{\bm{w}}_{i,t},z_{i,t}) are not historically available but can only be obtained along with the real implementation of DR events. This leads to an online customer behavior learning and AC power control problem. Thus we employ the TS framework to develop the online AC control algorithm in Section IV to effectively balance exploration and exploitation.

III-D System-Level Optimal AC Control Models

In a typical DR setting, once a customer opts out, the AC unit will automatically be switched back to the defaulted operation mode with the original customer-set temperature sisets_{i}^{\mathrm{set}}. Taking the opt-out status into account, the actual AC power consumption u^i,t\hat{u}_{i,t} can be formulated as

u^i,t=ui,tzi,t1+ui,tset(1zi,t1),\displaystyle\hat{u}_{i,t}=u_{i,t}\cdot z_{i,t-1}+u_{i,t}^{\mathrm{set}}\cdot(1-z_{i,t-1}), (12)

which equals ui,tu_{i,t} if customer ii stays in (zi,t1=1z_{i,t-\!1}\!=\!1) or ui,tsetu_{i,t}^{\mathrm{set}} if opts out (zi,t1=0z_{i,t-\!1}\!=\!0). Denote 𝒖i:=(ui,t)t[T]\bm{u}_{i}\!:=\!(u_{i,t})_{t\in[T]}, 𝒔i:=(si,t)t[T]\bm{s}_{i}\!:=\!(s_{i,t})_{t\in[T]}, 𝒅i:=(di,t)t[T]\bm{d}_{i}\!:=\!(d_{i,t})_{t\in[T]}, 𝒓i:=(ri,t)t[T]\bm{r}_{i}\!:=\!(r_{i,t})_{t\in[T]}, and 𝒛:=(𝒛i)i[N]\bm{z}:=(\bm{z}_{i})_{i\in[N]}.

To simplify the expression, we reformulate the AC control constraints (2), the opt-out status transition (3), (8), (11), and the dynamics of environmental factors (5), (6), (9) or (10), for customer i[N]i\in[N], as the following compact form (13):

(𝒖i,𝒛i,𝒔i,𝒅i,𝒓i)𝒳i,\displaystyle(\bm{u}_{i},\bm{z}_{i},\bm{s}_{i},\bm{d}_{i},\bm{r}_{i})\in\mathcal{X}_{i}, (13)

where 𝒳i\mathcal{X}_{i} denotes the corresponding feasible set. Then two system-level optimal control (SOC) models, i.e. (14) and (15), with different DR goals are established for the SA to solve optimal AC power control schemes over NN customers.

1) SOC-1 model (14) aims to reduce as much AC load as possible in a DR event, which can be used to flatten the load peaks or mitigate reserve deficiency emergency.

minui,t𝔼𝒛[i=1N[Δtt=1Tu^i,t+ρi(1zi,T)]]\displaystyle\begin{split}\min_{u_{i,t}}&\ \mathbb{E}_{\bm{z}}\Big{[}\sum_{i=1}^{N}\big{[}\Delta t\sum_{t=1}^{T}\hat{u}_{i,t}+\rho_{i}(1\!-z_{i,T})\big{]}\Big{]}\end{split} (14a)
s.t.\displaystyle\text{s.t}. (𝒖i,𝒛i,𝒔i,𝒅i,𝒓i)𝒳i,i[N]\displaystyle\ (\bm{u}_{i},\bm{z}_{i},\bm{s}_{i},\bm{d}_{i},\bm{r}_{i})\in\mathcal{X}_{i},\quad\forall i\in[N] (14b)

where objective (14a) minimizes the expected total AC energy consumption over the DR event, plus an opt-out penalty term in the last time step. ρi\rho_{i} is the penalty coefficient that can be tuned to balance load reduction and opt-out outcomes. 𝔼𝒛[]\mathbb{E}_{\bm{z}}[\cdot] denotes the expectation that is taken over the randomness of the customer opt-out status 𝒛\bm{z}. Constraint (14b) collects the counterparts of (13) for all customers.

2) SOC-2 model (15) aims to closely track a regulated power trajectory (Lt)t[T](L_{t})_{t\in[T]}, which is determined by the upper-level power dispatch or the DR market.

minui,t𝔼𝒛[t=1TΔt(i=1Nu^i,tLt)2+i=1Nρi(1zi,T)]\displaystyle\begin{split}\min_{u_{i,t}}&\ \mathbb{E}_{\bm{z}}\Big{[}\sum_{t=1}^{T}\Delta t\big{(}\sum_{i=1}^{N}\hat{u}_{i,t}-L_{t}\big{)}^{2}+\sum_{i=1}^{N}\rho_{i}(1\!-z_{i,T})\Big{]}\end{split} (15a)
s.t.\displaystyle\text{s.t}. (𝒖i,𝒛i,𝒔i,𝒅i,𝒓i)𝒳i,i[N].\displaystyle\ (\bm{u}_{i},\bm{z}_{i},\bm{s}_{i},\bm{d}_{i},\bm{r}_{i})\in\mathcal{X}_{i},\quad\forall i\in[N]. (15b)

Objective (15a) minimizes the expected total squared power tracking deviation from the global target LtL_{t}, plus the same opt-out penalty term defined in (14a). Constraint (15b) is the same as (14b).

Remark 3.

The penalty term ρi(1zi,T)\rho_{i}(1-z_{i,T}) in the objectives (14a) and (15a) serves as the final state cost in a finite-horizon control planning problem, which is used to restrict the last control action ui,Tu_{i,T}. Without this penalty term, the last control action ui,Tu_{i,T} would be too radical with no regard for the opt-out outcome at time TT and lead to frequent final opt-out zi,T=0z_{i,T}=0. Besides, this penalty term is a useful tool for the SA to make a trade-off between the DR goals and the customer opt-out results through adjusting the coefficient ρi\rho_{i}.

The two SOC models (14) and (15) are indeed discrete-time finite-horizon control planning problems, which are in the form of nonconvex stochastic optimization, and the stochasticity results from the probabilistic opt-out status transition (3). We develop the distributed solution methods for the SOC models (14) and (15) in the next section.

IV Distributed Solution and Algorithm Design

For the real-time AC control in a DR event, we pursue a distributed implementation manner such that

  • 1)

    the control algorithm can be directly embedded in the local home electric appliances or smart phone apps;

  • 2)

    heavy communication burdens between the SA and households are avoided during the DR event;

  • 3)

    the private information of customers can be protected.

In this section, we propose the distributed solution methods for the SOC models (14) and (15), then develop the distributed online AC control algorithm based on the TS framework.

IV-A Distributed Solution of SOC-1 Model (14)

Since the opt-out status transition of one customer is assumed to be independent of other customers in (3), objective (14a) in the SOC-1 model has no substantial coupling among different customers. Hence, the SOC-1 model (14) can be equivalently decomposed into NN local problems, i.e., model (16) for each customer i[N]i\!\in\![N].

minui,t\displaystyle\min_{u_{i,t}} 𝔼𝒛i[Δtt=1Tu^i,t+ρi(1zi,T)]\displaystyle\ \mathbb{E}_{\bm{z}_{i}}\Big{[}\Delta t\sum_{t=1}^{T}\hat{u}_{i,t}+\rho_{i}(1-z_{i,T})\Big{]} (16a)
s.t.\displaystyle\text{s.t}. (𝒖i,𝒛i,𝒔i,𝒅i,𝒓i)𝒳i.\displaystyle\ (\bm{u}_{i},\bm{z}_{i},\bm{s}_{i},\bm{d}_{i},\bm{r}_{i})\in\mathcal{X}_{i}. (16b)

The sum of objectives (16a) over all NN customers is essentially objective (14a) in the SOC-1 model, and constraint (16b) is the individual version of (14b) for customer ii.

The local model (16) is a stochastic optimization with the expectation over 𝒛i\bm{z}_{i} in the objective. Since 𝒛i\bm{z}_{i} follows the transition (3) with the probability function p𝜽i(𝒘i,t)p_{\bm{\theta}_{i}}(\bm{w}_{i,t}) (11), we can derive the analytic form of the expectation in (16a), which leads to expression (17):

t=1T[Δt(ui,tui,tset)τ=1t1p𝜽i(𝒘i,τ)]ρit=1Tp𝜽i(𝒘i,t).\displaystyle\sum_{t=1}^{T}\Big{[}\Delta t(u_{i,t}-u_{i,t}^{\mathrm{set}})\prod_{\tau=1}^{t-1}p_{\bm{\theta}_{i}}(\bm{w}_{i,\tau})\Big{]}\!-\rho_{i}\prod_{t=1}^{T}p_{\bm{\theta}_{i}}(\bm{w}_{i,t}). (17)

Expression (17) only differs from the expectation in (16a) by a constant term Δtt=1Tui,tset+ρi\Delta t\sum_{t=1}^{T}u_{i,t}^{\mathrm{set}}+\rho_{i}, thus they are equivalent in optimization. See Appendix A for the detailed derivation.

When 𝜽i\bm{\theta}_{i} is given, we can obtain the optimal AC control schemes 𝒖i\bm{u}_{i}^{*} for each customer ii via solving the local model (16), which is implemented in a fully distributed manner. Meanwhile, the aggregation of 𝒖i\bm{u}_{i}^{*} over all NN customers is essentially an optimal solution to the SOC-1 model (14).

IV-B Distributed Solution of SOC-2 Model (15)

The objective (15a) in the SOC-2 model has coupling among different customers due to tracking a global power trajectory (Lt)t[T](L_{t})_{t\in[T]}. To solve this problem distributedly, we introduce a local tracking trajectory 𝒍i:=(li,t)t[T]\bm{l}_{i}\!:=\!(l_{i,t})_{t\in[T]} for each customer i[N]i\in[N] with i=1Nli,t=Lt\sum_{i=1}^{N}l_{i,t}=L_{t} for all t[T]t\in[T]. Then we substitute LtL_{t} by i=1Nli,t\sum_{i=1}^{N}l_{i,t} in objective (15a), so that (15a) can be approximated333The approximation is made by dropping the term 2t=1Tij𝔼(u^i,tli,t)𝔼(u^j,tlj,t)2\sum_{t=1}^{T}\sum_{i\neq j}\mathbb{E}(\hat{u}_{i,t}-l_{i,t})\mathbb{E}(\hat{u}_{j,t}-l_{j,t}) from the expansion of (15a). This term is expected to be relatively small and thus neglectable. by the decomposable form (18a), which takes the form of a sum over NN customers. As a result, the SOC-2 model (15) is modified as (18):

minui,t,li,t0\displaystyle\min_{u_{i,t},\,l_{i,t}\geq 0}\, i=1N𝔼𝒛i[t=1TΔt(u^i,tli,t)2+ρi(1zi,T)]\displaystyle\sum_{i=1}^{N}\mathbb{E}_{\bm{z}_{i}}\!\Big{[}\sum_{t=1}^{T}\Delta t\big{(}\hat{u}_{i,t}-l_{i,t}\big{)}^{2}+\rho_{i}(1\!-\!z_{i,T})\Big{]} (18a)
s.t. (𝒖i,𝒛i,𝒔i,𝒅i,𝒓i)𝒳i,i[N]\displaystyle(\bm{u}_{i},\bm{z}_{i},\bm{s}_{i},\bm{d}_{i},\bm{r}_{i})\in\mathcal{X}_{i},\quad\forall i\in[N] (18b)
i=1Nli,t=Lt,t[T]\displaystyle\sum_{i=1}^{N}l_{i,t}=L_{t},\quad\forall t\in[T] (18c)

where li,tl_{i,t} is also a decision variable.

Consequently, the only substantial coupling among customers in the modified SOC-2 model (18) is the equality constraint (18c). Therefore, we can introduce the dual variable 𝝀:=(λt)t[T]\bm{\lambda}\!:=\!(\lambda_{t})_{t\in[T]} for the equality constraint (18c) and employ the dual gradient algorithm [31] to solve the modified SOC-2 model (18) in a distributed manner. The specific distributed solution method is presented as Algorithm 2. The implementation of Algorithm 2 needs the two-way communication of 𝝀\bm{\lambda} and 𝒍i\bm{l}_{i} between the SA and every customer in each iteration. Due to the simple structure with only one equality coupling constraint (18c), Algorithm 2 can converge quickly with appropriate step sizes, which is verified by our simulations.

Algorithm 2 Distributed Solution Algorithm.
1:  Initialization: Set initial dual value (λt0)t[T](\lambda_{t}^{0})_{t\in[T]}, step size γ\gamma, convergence tolerance ϵ\epsilon, and iteration count k0k\leftarrow 0.
2:  Parallel Optimization: With the broadcast dual value (λtk)t[T](\lambda_{t}^{k})_{t\in[T]}, each customer ii solves the local AC control problem (19) and obtains the optimal solution (𝒖i,𝒍i)(\bm{u}_{i}^{*},\bm{l}_{i}^{*}).
minui,t,li,t\displaystyle\min_{u_{i,t},\,l_{i,t}} 𝔼𝒛i[Δtt=1T(u^i,tli,t)2+ρi(1zi,T)]+t=1Tλtkli,t\displaystyle\mathbb{E}_{\bm{z}_{i}}\!\Big{[}\Delta t\sum_{t=1}^{T}(\hat{u}_{i,t}\!-\!l_{i,t})^{2}\!+\!\rho_{i}(1\!-\!z_{i,T})\Big{]}\!+\!\sum_{t=1}^{T}\lambda_{t}^{k}l_{i,t} (19a)
s.t. (𝒖i,𝒛i,𝒔i,𝒅i,𝒓i)𝒳i;li,t0,t[T].\displaystyle(\bm{u}_{i},\bm{z}_{i},\bm{s}_{i},\bm{d}_{i},\bm{r}_{i})\!\in\mathcal{X}_{i};\ l_{i,t}\geq 0,\,\forall t\in[T]. (19b)
3:  Dual Variable Update: Each customer ii uploads 𝒍i\bm{l}_{i}^{*} to the SA, and the SA updates the dual variable by
λtk+1λtk+γ(i=1Nli,tLt),t[T].\displaystyle\lambda_{t}^{k+1}\leftarrow\lambda_{t}^{k}+\gamma\cdot(\sum_{i=1}^{N}l_{i,t}^{*}-L_{t}),\quad\forall t\in[T]. (20)
4:  Convergence Check: If 𝝀k+1𝝀kϵ||\bm{\lambda}^{k+1}\!-\!\bm{\lambda}^{k}||\leq\epsilon, terminate the algorithm. Otherwise, let kk+1k\leftarrow k+1 and go to Step 2.

Similar to (17), we can derive the equivalent analytic form (21) for the expectation term in (19a):

Δtt=1T{(ui,tsetli,t)2+[2(ui,tsetli,t)(ui,tui,tset)+(ui,tui,tset)2]τ=1t1p𝜽i(𝒘i,τ)}ρit=1Tp𝜽i(𝒘i,t).\displaystyle\begin{split}&\Delta t\sum_{t=1}^{T}\bigg{\{}(u_{i,t}^{\mathrm{set}}-l_{i,t})^{2}+\Big{[}2(u_{i,t}^{\mathrm{set}}-l_{i,t})(u_{i,t}-u_{i,t}^{\mathrm{set}})\\ &+(u_{i,t}-u_{i,t}^{\mathrm{set}})^{2}\Big{]}\cdot\prod_{\tau=1}^{t-1}p_{\bm{\theta}_{i}}(\bm{w}_{i,\tau})\bigg{\}}-\rho_{i}\prod_{t=1}^{T}p_{\bm{\theta}_{i}}(\bm{w}_{i,t}).\end{split} (21)

By substituting the expectations terms in (16a) and (19a) with their analytic forms (17) and (21) respectively, the local AC control models (16) and (19) become deterministic nonconvex optimization problems. Given parameter 𝜽i\bm{\theta}_{i}, they can be solved efficiently via available nonlinear optimizer tools, such as the IPOPT solver [32]. For concise expression, we denote the above distributed solution methods together with the optimizer tools as an oracle

𝒪:𝜽i𝒖i,\displaystyle\mathcal{O}:\bm{\theta}_{i}\rightarrow\bm{u}_{i}^{*}, (22)

which generates optimal 𝒖i\bm{u}_{i}^{*} with the input of parameter 𝜽i\bm{\theta}_{i}.

IV-C Distributed Online DR Control Algorithm

Based on the TS framework, we develop the distributed online DR control algorithm as Algorithm 3 to learn customer opt-out behaviors and optimally control the AC power in a DR event. Since this online algorithm is implemented distributedly, we present Algorithm 3 from the perspective of an individual customer i[N]i\in[N]. The practical implementation of Algorithm 3 is illustrated as Figure 2.

Refer to caption
Figure 2: Schematic of the distributed online DR control algorithm implemented on the SOC-2 problem. (For the SOC-1 problem, the communications of 𝝀\bm{\lambda} and 𝒍i\bm{l}^{*}_{i} are not required. )
Algorithm 3 Distributed Online DR Control Algorithm.
1:  Input: Prior distribution 𝒩(𝝁i,𝚺i)\mathcal{N}(\bm{\mu}_{i},\bm{\Sigma}_{i}). Receive the DR objective and associated parameters from the SA.
2:  for t=1t=1 to TT do
3:     Sample 𝜽^i𝒩(𝝁i,𝚺i)\hat{\bm{\theta}}_{i}\sim\mathcal{N}\left({\bm{\mu}_{i}},{\bm{\Sigma}_{i}}\right).
4:     Optimization. Solve the SOC model (14) or (18) distributedly for the remaining time horizon {t,,T}\{t,\cdots,T\}, and obtain the optimal AC power control scheme using the oracle 𝒖i𝒪(𝜽^i)\bm{u}^{*}_{i}\leftarrow\mathcal{O}(\hat{\bm{\theta}}_{i}) (22).
5:     Action. Implement the first control action 𝒖i(1)\bm{u}_{i}^{*}(1), and observe the customer opt-out outcome zi,tz_{i,t}. Collect the environmental factors 𝒘^i,t\hat{\bm{w}}_{i,t} at time tt.
6:     Posterior Update. Initialize variational parameter ξi\xi_{i} by
ξi𝒘^i,t𝚺i𝒘^i,t+(𝒘^i,t𝝁i)2.\displaystyle\xi_{i}\leftarrow\sqrt{{\hat{\bm{w}}}_{i,t}^{\top}{\bm{\Sigma}_{i}}{\hat{\bm{w}}}_{i,t}+({\hat{\bm{w}}}_{i,t}^{\top}{\bm{\mu}}_{i})^{2}}. (23)
Iterate three times between the posterior update
{𝚺^i1𝚺i1+2|(ξi)|𝒘^i,t𝒘^i,t,𝝁^i𝚺^i[𝚺i1𝝁i+(zi,t12)𝒘^i,t],\displaystyle\begin{cases}\bm{\hat{\Sigma}}_{i}^{-1}\leftarrow{\bm{\Sigma}}_{i}^{-1}+2|\ell(\xi_{i})|{\hat{\bm{w}}}_{i,t}{\hat{\bm{w}}}_{i,t}^{\top},\\ \ \bm{\hat{\mu}}_{i}\ \,\leftarrow\bm{\hat{\Sigma}}_{i}\left[{\bm{\Sigma}_{i}^{-1}}\bm{\mu}_{i}+(z_{i,t}-\frac{1}{2}){\hat{\bm{w}}}_{i,t}\right],\end{cases} (24)
where(ξi):=(1/21/(1+eξi))/2ξi.\displaystyle\ \text{where}\ \ell(\xi_{i}):=(1/2-1/(1+e^{-\xi_{i}}))/2\xi_{i}.
and the ξi\xi_{i} update
ξi𝒘^i,t𝚺^i𝒘^i,t+(𝒘^i,t𝝁^i)2.\displaystyle\xi_{i}\ \leftarrow\sqrt{{\hat{\bm{w}}}_{i,t}^{\top}{\bm{\hat{\Sigma}}}_{i}{\hat{\bm{w}}}_{i,t}+({\hat{\bm{w}}}_{i,t}^{\top}\bm{\hat{\mu}}_{i})^{2}}.\ \ (25)
Then set 𝚺i𝚺^i,𝝁i𝝁^i.{\bm{\Sigma}}_{i}\ \leftarrow\bm{\hat{\Sigma}}_{i},\ \mathcal{{\bm{\mu}}}_{i}\ \leftarrow\bm{\hat{\mu}}_{i}.
7:     Check Termination. If zi,t=0z_{i,t}=0, terminate the AC control and change the AC operation mode back to the original customized setting.
8:  end for

In Algorithm 3, the unknown customer behavior parameter 𝜽i\bm{\theta}_{i} is treated as a random variable, and we construct a Gaussian prior distribution 𝒩(𝝁i,𝚺i)\mathcal{N}(\bm{\mu}_{i},\bm{\Sigma}_{i}) for it based on historical information. At each time t[T]t\in[T] of a DR event, 𝜽^i\hat{\bm{\theta}}_{i} is randomly sampled from the distribution for decision-making. Two key techniques used in Algorithm 3 are explained as follows.

1) To utilize the latest information and take future time-slots into account, we employ the model predictive control (MPC) method in the optimization and action steps. Specifically, at each time tt, it solves the SOC model (14) or (15) for the rest of the DR event to obtain the optimal AC control scheme 𝒖i\bm{u}_{i}^{*}, but only implements the first control action 𝒖i\bm{u}_{i}^{*}(1). In addition, the latest predictions or estimations of the environmental factors, the updated thermal dynamics model, and recalculated baseline AC power ui,tsetu_{i,t}^{\mathrm{set}} can be adopted in the optimization step.

2) After observing the outcome pair (zi,t,𝒘^i,t)(z_{i,t},\hat{\bm{w}}_{i,t}), the variational Bayesian inference approach introduced in [33] is applied to obtain the posterior distribution on 𝜽i\bm{\theta}_{i} with the update rules (23)-(25). It is well known that Bayesian inference for the logistic regression model (11) is an intrinsically hard problem [34], and the exact posterior 𝒫(𝜽i|zi,t,𝒘^i,t)\mathcal{P}(\bm{\theta}_{i}|z_{i,t},\hat{\bm{w}}_{i,t}) is intractable to compute. Thus we use the variational approach [33] for efficient Bayesian inference, which provides an accurate Gaussian approximation to the exact posterior with a closed form (24). The scalar ξi\xi_{i} in (23)-(25) is an intermediate parameter that affects the approximation accuracy, thus we alternate three times between the posterior update (24) and the ξi\xi_{i} update (25), which leads to an optimal approximated posterior. See reference [33] for details.

IV-D Performance Measurement

For online learning problems, the notion of “regret” and its variants are standard metrics that are defined to measure the performance of online learning and decision algorithms [35]. Accordingly, we denote 𝜽:=(𝜽i)i[N]\bm{\theta}^{\star}:=(\bm{\theta}^{\star}_{i})_{i\in[N]} as the underlying true customer behavior parameter that the LSEs do not know but aim to learn. Then the regret of the proposed online DR control algorithm at mm-th DR event is defined as

regret(m):=Cm𝜽(𝒖monline)Cm𝜽(𝒖m),\displaystyle\mathrm{regret}(m):=C^{\bm{\theta}^{\star}}_{m}(\bm{u}^{\mathrm{online}}_{m})-C^{\bm{\theta}^{\star}}_{m}(\bm{u}^{\star}_{m}), (26)

where Cm𝜽()C^{\bm{\theta}^{*}}_{m}(\cdot) denotes the objective function (i.e., (14a) in the SOC-1 model or (18a) in the modified SOC-2 model) under the true value 𝜽\bm{\theta}^{\star}. 𝒖monline\bm{u}^{\mathrm{online}}_{m} denotes the AC control scheme generated by the proposed online algorithm, while 𝒖m\bm{u}^{\star}_{m} is the optimal AC control scheme that minimizes the objective Cm𝜽()C^{\bm{\theta}^{*}}_{m}(\cdot). Thus regret(m)\mathrm{regret}(m) in (26) is always non-negative and measures the performance distance between the proposed online algorithm and the underlying best control scheme.

To evaluate the overall learning performance, we further define the cumulative regret until MM-th DR event as

cumu_regret(M):=m=1Mregret(m),\displaystyle\mathrm{cumu\_regret}(M):=\sum_{m=1}^{M}\text{regret}(m), (27)

which is simply the sum of regret(m)\mathrm{regret}(m) over the first MM DR events. Generally, a sublinear cumulative regret over MM is desired, i.e., cumu_regret(M)/M0\mathrm{cumu\_regret}(M)/M\to 0 as M+M\to+\infty, since it indicates that regret(m)0\mathrm{regret}(m)\to 0 as m+m\to+\infty. In other word, it means that the proposed online algorithm can eventually learn the customer behaviors well and make optimal AC control schemes as more and more DR events are experienced. We demonstrate the regret results of the proposed algorithm via numerical simulations in the next section.

V Numerical Simulations

In this section, we first test the performance of the linear thermal dynamics model and the GP model. Then, we implement the proposed distributed online DR control algorithm on the two SOC models.

V-A Indoor Thermal Dynamics Prediction

In this part, we compare the thermal dynamics prediction performance of the linear model (9) and the GP model (10). Real customer metering data, including indoor temperature and AC power, from ThinkEco444ThinkEco Inc. is a New York City-based energy efficiency and demand response solution company (http://www.thinkecoinc.com/). are used for model training and testing. The outdoor temperature data are procured from an open-access meteorological database555Iowa Environmental Mesonet [online]: https://mesonet.agron.iastate.edu/request/download.phtml?network=MA_ASOS.. Specifically, we use the time series of data in consecutive 5 days with the resolution of 15 minutes to fit the two thermal dynamics models (9) and (10). The GPML toolbox [36] is applied to implement the GP model and optimize the hyperparameters. Then, the fitted models are tested on the time series of real data in the next 3 days for indoor temperature prediction.

Refer to caption
Refer to caption
Figure 3: Indoor temperature prediction results with Gaussian process and linear thermal dynamics models. (GP model: blue dotted curve; Linear model: green curve; True indoor temperature: red dashed curve.)

The prediction results of one time step ahead (15 minutes) and three time steps ahead (45 minutes) are presented in Figure 3. The average prediction errors of the indoor temperature are 0.632F0.632^{\circ}F and 0.737F0.737^{\circ}F for the GP model in these two cases, and 0.846F0.846^{\circ}F and 1.16F1.16^{\circ}F for the linear model. It is observed that both the GP model (10) and the linear model (9) work well in the thermal dynamics modelling, and the GP model achieves better prediction accuracy.

V-B Learning and AC Control with SOC-1 Model

V-B1 Simulation Configuration

Each DR event lasts for 3 hours with the AC control period Δt=15\Delta t=15 minutes, which implies a time length T=12T=12. The AC capacity and drift limit are set as u¯i=2\bar{u}_{i}=2kW and Δui=1\Delta u_{i}=1kW, and the AC setting temperature sisets_{i}^{\text{set}} is 72F72^{\circ}F. As defined in Section IV-D, we associate each customer i[N]i\in[N] with a true behavior parameter 𝜽i6\bm{\theta}_{i}^{\star}\in\mathbb{R}^{6} to simulate the opt-out outcomes, whose values are randomly generated but satisfy several basic rules to be reasonable. For example, if no DR control is implemented (defaulted AC setting), the stay-in probability g𝜽i(𝒘i,t)g_{\bm{\theta}_{i}^{\star}}(\bm{w}_{i,t}) should be very close to 1; if the indoor temperature si,ts_{i,t} reaches a high value such as 90F90^{\circ}F, the stay-in probability should be very close to 0. The considered environmental factors include indoor temperature si,ts_{i,t}, accumulated thermal discomfort di,td_{i,t}, incentive credit ri,tr_{i,t}, outdoor temperature si,touts_{i,t}^{\mathrm{out}}, and time-varying electricity price, where the first three factors follow the dynamics (4)-(6) respectively, while the electricity price at each time is normalized and randomly generated from Unif(0,1)\mathrm{Unif}(0,1). Besides, IPOPT solver [32] is employed to solve the nonconvex optimal AC control models.

V-B2 Control and Learning Performance

Since the aggregation of all local optimal AC control schemes is an optimal solution to the SOC-1 model (14), we simulate the AC control and learning for a single customer over sequential DR events. Given the true parameter 𝜽i\bm{\theta}_{i}^{\star}, the optimal AC power trajectory 𝒖i\bm{u}_{i}^{\star} can be computed via solving the local control model (16). Figure 4 illustrates the simulation results associated with the optimal AC control scheme 𝒖i\bm{u}_{i}^{\star}. It is seen that the stay-in probability is maintained close to 1 by the AC control scheme, which tends to make customers comfortable and not opt out for the sake of long-term load reduction. Besides, there is a drop of AC power at the end of the DR event (t=11,12t=11,12), leading to increased indoor temperature and a decrease in the stay-in probability. Intuitively, that is because last-minute opt-out will not affect the DR objective much, and thus a radical AC power reduction is conducted. This effect can be mitigated by increasing the penalty coefficient ρi\rho_{i}.

Refer to caption
Figure 4: Illustration of the optimal AC power control scheme 𝒖i\bm{u}_{i}^{\star}, stay-in probability p𝜽i(𝒘i,t)p_{\bm{\theta}_{i}^{\star}}(\bm{w}_{i,t}), indoor temperature si,ts_{i,t}, outdoor temperature si,touts_{i,t}^{\mathrm{out}} for a customer, which are computed via solving model (16) with known 𝜽i\bm{\theta}_{i}^{\star}.

In practice, the true customer parameter 𝜽i\bm{\theta}_{i}^{\star} is unknown, and we implement the proposed online algorithm to learn customer behaviors and make AC control decisions in sequential DR events. We compare the performance of the proposed algorithm with two baseline schemes: raising the AC setting temperature by 3F3^{\circ}F and 5F5^{\circ}F respectively, which are typically used in real DR programs. The regret results for one customer over 200 DR events are shown as Figure 5. It is observed that the per-event regret(m)\text{regret}(m) of the proposed algorithm decreases dramatically within the first tens of DR events, then converges to almost zero. As a result, the associated cumulative regret exhibits a clear sublinear trend, which verifies the learning efficiency of the proposed online algorithm. In contrast, the baseline schemes that simply raise the AC setting temperature without consideration of customer opt-out behaviors maintain high regret values. Besides, the average amount of load energy reduction of one customer in a DR event is 1.13 kWh1.13\text{ kWh} for the proposed algorithm, which is higher than the baseline schemes with 0.6 kWh0.6\text{ kWh} and 0.82 kWh0.82\text{ kWh} load reduction, respectively.

Refer to caption
Figure 5: The regret comparison between the proposed Algorithm 3 and two baseline schemes (raising the AC setting temperature by 3F3^{\circ}F and 5F5^{\circ}F) for a typical customer under the SOC-1 model (14).

V-C Learning and AC Control with SOC-2 Model

We then perform the proposed online DR control algorithm on the SOC-2 model, where the simulation configurations are the same as Section V-B. Since the SOC-2 problem involves the coordination among all customers, we set the customer number as N=500N=500, and the global tracking target LtL_{t} is randomly generated from Unif(450,550)\text{Unif}(450,550)(kW).

V-C1 Distributed Solution

We apply Algorithm 2 to solve the modified SOC-2 model (18) distributedly over 500 customers. A diminishing step size with γk=max(5×104/k,1×104)\gamma_{k}\!=\!\max(5\times 10^{-4}/\sqrt{k},1\times 10^{-4}) is employed to speed up the convergence. For the case with given 𝜽i\bm{\theta}_{i}^{\star}, the convergence results are shown as Figure 6. It is seen that the distributed algorithm converges to the optimal value within tens of iterations. We note that the convergence curve in Figure 6 and the associated AC power are just intermediate computational values, which are not executed in practice, while only the final converged values are regarded as the AC power control scheme and are implemented. Figure 7 illustrates the simulation results associated with the converged optimal AC control scheme, including the optimal AC power with the local tracking target, the stay-in probability, and the indoor/outdoor temperature profiles for one customer.

Refer to caption
Figure 6: Convergence results of the distributed algorithm (Algorithm 2) for solving the modified SOC-2 model (18).
Refer to caption
Figure 7: Illustration of the optimal AC power control scheme 𝒖i\bm{u}_{i}^{\star} with the local tracking target 𝒍i\bm{l}_{i}^{\star}, stay-in probability p𝜽i(𝒘i,t)p_{\bm{\theta}_{i}^{\star}}(\bm{w}_{i,t}), indoor temperature si,ts_{i,t}, outdoor temperature si,touts_{i,t}^{\mathrm{out}} for a customer, which are computed via solving the modified SOC-2 model (18) with known 𝜽i\bm{\theta}_{i}^{\star}.

V-C2 Learning Performance

In practice, the true parameter 𝜽i\bm{\theta}_{i}^{\star} is unknown, therefore we implement the proposed online algorithm to learn customer opt-out behaviors and make AC control decisions with the modified SOC-2 model (18). The regret results of the proposed algorithm for 200 DR events over 500 customers are presented in Figure 8. A rapidly decreasing regret and a sublinear cumulative regret are observed, which verify that the proposed algorithm can learn the customer behaviors well and generate efficient AC control decisions. Besides, for the per-event regret curve in Figure 8, non-monotonic variations and occasional spikes are observed in the early learning stage (similar in Figure 5). That is because the proposed algorithm follows the TS framework and draws a random sample 𝜽i^\hat{\bm{\theta}_{i}} for decision-making at each time, which results in the non-monotonic variations with stochasticity. In addition, there is a small chance that the sample 𝜽^i\hat{\bm{\theta}}_{i} is quite different from the true value 𝜽i\bm{\theta}^{\star}_{i}, leading to large spikes in the regret curve. However, as the observed customer opt-out outcomes accumulate, the distribution on 𝜽i\bm{\theta}_{i} gradually concentrates on the true value 𝜽i\bm{\theta}_{i}^{\star}, so that there are less large spikes in the later learning stage.

Refer to caption
Figure 8: The regret results of the proposed Algorithm 3 on the modified SOC-2 model (18) with 500 customers.

VI Conclusion

In this paper, we propose a distributed online DR control algorithm to learn customer behaviors and regulate AC loads for incentive-based residential DR. Based on the Thompson sampling framework, the proposed algorithm consists of an online Bayesian learning step and an offline distributed optimization step. Two DR objectives, i.e. minimizing total AC loads and closely tracking a regulated power trajectory, are considered. The numerical simulations show that the distributed solution converges to the optimal value within tens of iterations, and the regret of learning reduces rapidly on average along with the implementation of DR events. Future works include 1) identify significant and effective environmental factors based on real user data; 2) conduct practical DR experiments using the proposed algorithm and analyze its practical performance.

Appendix A Analytic Derivation for Expectation in (16a)

The expectation term in (16a) can be expanded as

𝔼𝒛i[t=1TΔt((ui,tui,tset)zi,t1)ρizi,T]+Δtt=1Tui,tset+ρi.\displaystyle\mathbb{E}_{\bm{z}_{i}}\bigg{[}\sum_{t=1}^{T}\!\Delta t\left((u_{i,t}-u_{i,t}^{\mathrm{set}})z_{i,t-1}\right)\!-\rho_{i}z_{i,T}\bigg{]}\!+\Delta t\sum_{t=1}^{T}u_{i,t}^{\mathrm{set}}+\rho_{i}.

The status transition (3) implies that the customer opt-out time tiopt{t}_{i}^{\mathrm{opt}} actually follows a geometric distribution with the Bernoulli probability pi,t:=p𝜽i(𝒘i,t)p_{i,t}\!:=\!p_{\bm{\theta}_{i}}(\bm{w}_{i,t}) (11). Thus, we can enumerate all the possible realizations with different tiopt{t}_{i}^{\mathrm{opt}} and the probabilities:

{Δt(ui,1ui,1set),(tiopt=1)=1pi,1Δtt=1T(ui,tui,tset),(tiopt=T)=(1pi,T)t=1T1pi,tΔtt=1T(ui,tui,tset)ρi,(tiopt=not)=t=1Tpi,t\displaystyle\begin{cases}\Delta t(u_{i,1}-u_{i,1}^{\mathrm{set}}),&\mathbb{P}({t}_{i}^{\mathrm{opt}}\!=\!1)=1-p_{i,1}\\ \cdots&\cdots\\ \Delta t\!\sum_{t=1}^{T}\!(u_{i,t}\!-\!u_{i,t}^{\mathrm{set}}),&\mathbb{P}({t}_{i}^{\mathrm{opt}}\!=\!T)\!=\!(1\!-\!p_{i,T})\prod_{t=1}^{T\!-\!1}p_{i,t}\\ \Delta t\!\sum_{t=1}^{T}\!(u_{i,t}\!-\!u_{i,t}^{\mathrm{set}})\!-\!\rho_{i},&\mathbb{P}({t}_{i}^{\mathrm{opt}}\!=\!\text{not})\!=\!\prod_{t=1}^{T}p_{i,t}\end{cases}

where “tiopt=not{t}_{i}^{\mathrm{opt}}\!=\!\text{not}” means customer ii does not opt out at any time. Then, the expectation term is computed analytically by summing up all these cases, which leads to (17) plus the constant term Δtt=1Tui,tset+ρi\Delta t\sum_{t=1}^{T}u_{i,t}^{\mathrm{set}}+\rho_{i}.

References

  • [1] Federal Energy Regulatory Commission, 2019 Assessment of Demand Response and Advanced Metering, Dec. 2019.
  • [2] U.S. Department of Energy, Benefit of Demand Response in Electricity Market and Recommendations for Achieving Them, Feb. 2006.
  • [3] U.S. Energy Information Administration, Electric Power Annual, USA, Oct. 18, 2019.
  • [4] A. R. Jordehi, “Optimisation of demand response in electric power systems, a review,” Renew. Sustain. Energy Rev., vol. 103, pp. 308-319, Apr. 2019.
  • [5] M. Muratori and G. Rizzoni, “Residential demand response: dynamic energy management and time-varying electricity pricing,” IEEE Trans. on Power Syst., vol. 31, no. 2, pp. 1108-1117, Mar. 2016.
  • [6] Q. Hu, F. Li, X. Fang and L. Bai, “A framework of residential demand aggregation with financial incentives,” IEEE Trans. on Smart Grid, vol. 9, no. 1, pp. 497-505, Jan. 2018.
  • [7] Y. Li and N. Li, “Mechanism design for reliability in demand response with uncertainty,” in Proc. Amer. Control Conf. (ACC), Seattle, WA, USA, May 2017, pp. 3400–3405.
  • [8] X. Xu, C. Chen, X. Zhu, and Q. Hu “Promoting acceptance of direct load control programs in the United States: Financial incentive versus control option,” Energy, vol. 147, pp. 1278-1287, Mar. 2018.
  • [9] M. J. Fell, D. Shipworth, G. M. Huebner, C. A. Elwell, “Public acceptability of domestic demand-side response in Great Britain: The role of automation and direct load control”, Energy Res. Soc. Sci., vol. 9, pp. 72-84, Sep. 2015.
  • [10] X. Xu, A. Maki, C. Chen, B. Dong, and J. K.Day, “Investigating willingness to save energy and communication about energy use in the American workplace with the attitude-behavior-context model”, Energy Res. Soc. Sci., vol. 32, pp. 13-22, Oct. 2017.
  • [11] Q. Shi, C. Chen, A. Mammoli and F. Li, “Estimating the profile of incentive-based demand response (IBDR) by integrating technical models and social-behavioral factors,” IEEE Trans. on Smart Grid, vol. 11, no. 1, pp. 171-183, Jan. 2020.
  • [12] J. R. Vázquez-Canteli and Z. Nagy, “Reinforcement learning for demand response: A review of algorithms and modeling techniques,” Applied Energy, vol. 235, pp. 1072-1089, Feb. 2019.
  • [13] X. Chen, G. Qu, Y. Tang, S. Low and N. Li, “Reinforcement learning for decision-making and control in power systems: tutorial, review, and vision,” ArXiv Preprint, arXiv:2102.01168, 2021.
  • [14] Z. Wen, D. O’Neill and H. Maei, “Optimal demand response using device-based reinforcement learning,” IEEE Trans. on Smart Grid, vol. 6, no. 5, pp. 2312-2324, Sept. 2015.
  • [15] F. Alfaverh, M. Denaï, and Y. Sun, “Demand response strategy based on reinforcement learning and fuzzy reasoning for home energy management,” IEEE Access, vol. 8, pp. 39310-39321, 2020.
  • [16] R. Lu, S. H. Hong, and M. Yu, “Demand response for home energy management using reinforcement learning and artificial neural network,” IEEE Trans. on Smart Grid, vol. 10, no. 6, pp. 6629-6639, Nov. 2019.
  • [17] H. Li, Z. Wan, and H. He, “Real-time residential demand response,” IEEE Trans. Smart Grid, vol. 11, no. 5, pp. 4144-4154, Sep. 2020.
  • [18] F. Ruelens, B. J. Claessens, S. Vandael, and et al., “Residential demand response of thermostatically controlled loads using batch reinforcement learning,” IEEE Trans. on Smart Grid, vol. 8, no. 5, pp. 2149-2159, Sept. 2017.
  • [19] R. Lu, S. H. Hong, “Incentive-based demand response for smart grid with reinforcement learning and deep neural network,” Applied Energy, vol. 236, pp. 937-949, Feb. 2019.
  • [20] X. Chen, Y. Nie and N. Li, “Online residential demand response via contextual multi-armed bandits,” IEEE Contr. Syst. Lett., vol. 5, no. 2, pp. 433-438, Apr. 2021.
  • [21] Y. Li, Q. Hu, N. Li, ”A reliability-aware multi-armed bandit approach to learn and select users in demand response,” Automatica, vol. 119, Sept. 2020.
  • [22] C. E. Rasmussen, and C. K. I. Williams. Gaussian processes for machine learning. MIT Press, Cambridge, MA, 2006.
  • [23] S. Menard, Applied logistic regression analysis, vol. 106, Sage, 2002.
  • [24] D. J. Russo, B. V. Roy, A. Kazerouni, I. Osband, and Z. Wen, “A tutorial on Thompson sampling”, Found. Trends Mach. Learn., vol. 11, no. 1, pp. 1-96, 2018.
  • [25] A. Slivkins, “Introduction to multi-armed bandits,” arXiv preprint, arXiv:1904.07272, 2019.
  • [26] D. Russo, and B. V. Roy, “Learning to optimize via posterior sampling”, Math. Oper. Res., vol. 39, no. 4, pp. 1221-1243, Apr. 2014.
  • [27] Hall, C. Rasmussen, and J. Maciejowski, “Modelling and control of nonlinear systems using Gaussian processes with partial model information,” in Proc. IEEE Conf. on Decision and Control (CDC), pp. 5266–5271, Dec. 2012.
  • [28] D. P. Bertsekas, Dynamic Programming and Optimal Control: Volume I. 3rd Edition, Athena Scientific Belmont, MA, 2005.
  • [29] D. Liu, Y. Sun, Y. Qu, B. Li and Y. Xu, “Analysis and accurate prediction of user’s response behavior in incentive-based demand response,” IEEE Access, vol. 7, pp. 3170-3180, 2019.
  • [30] N. Li, L. Chen, and S. Low, “Optimal demand response based on utility maximization in power networks,” in Proc. IEEE PES General Meeting, 2011.
  • [31] I. Necoara, V. Nedelcu, “On linear convergence of a distributed dual gradient algorithm for linearly constrained separable convex problems,” Automatica, vol. 55, pp. 209-216, May 2015.
  • [32] IPOPT Solver. [Online]. Available: https://coin-or.github.io/Ipopt/.
  • [33] T. S. Jaakkola and M. I. Jordan, “A variational approach to bayesian logistic regression models andtheir extensions,” in Sixth Intern. Workshop Artif. Intel. Stat., vol. 82, pp. 4, 1997.
  • [34] N. G. Polson, J. G. Scott, J. Windle, “Bayesian inference for logistic models using Pólya–Gamma latent variables,” J. Ameri. Stat. Assoc., vol. 108, no. 504, pp. 1339-1349, Dec. 2013.
  • [35] S. Shalev-Shwartz, “Online learning and online convex optimization,” Found. Trends Mach. Learn., vol. 4, no. 2, pp. 107–194, 2012.
  • [36] C. E. Rasmussen and H. Nickisch, “Gaussian processes for machine learning (gpml) toolbox,” J. Mach. Learn. Technol., vol. 11, pp. 3011-3015, Nov. 2010.