Online Learning and Distributed Control for Residential Demand Response
Abstract
This paper studies the automated control method for regulating air conditioner (AC) loads in incentive-based residential demand response (DR). The critical challenge is that the customer responses to load adjustment are uncertain and unknown in practice. In this paper, we formulate the AC control problem in a DR event as a multi-period stochastic optimization that integrates the indoor thermal dynamics and customer opt-out status transition. Specifically, machine learning techniques including Gaussian process and logistic regression are employed to learn the unknown thermal dynamics model and customer opt-out behavior model, respectively. We consider two typical DR objectives for AC load control: 1) minimizing the total demand, 2) closely tracking a regulated power trajectory. Based on the Thompson sampling framework, we propose an online DR control algorithm to learn customer behaviors and make real-time AC control schemes. This algorithm considers the influence of various environmental factors on customer behaviors and is implemented in a distributed fashion to preserve the privacy of customers. Numerical simulations demonstrate the control optimality and learning efficiency of the proposed algorithm.
Index Terms:
Online learning, uncertain customer behavior, distributed algorithm, incentive-based demand response.I Introduction
Due to increasing renewable generation and growing peak load, electric power systems are inclined to confront a deficiency of reserve capacity. As a typical example, in mid-August 2019, Texas grid experienced record electricity demand and severe reserve emergency that were caused by the heat wave and reduced wind generation. The electricity price once soared to 9$/kWh and the Electric Reliability Council of Texas (ERCOT) issued Level 1 Energy Emergency Alert to call upon voluntary energy conservation and all available generation sources [1]. To cope with such problems, demand response (DR) is an economical and sustainable solution that strategically motivates load adjustment from end users to meet the needs of power supply [2]. In particular, residential loads account for a large share of the total electricity usage (e.g. about in the U.S. [3]), which can release significant power flexibility to facilitate system operation through a coordinated dispatch. Moreover, the widespread deployment of advanced meters, smart plugs, and built-in controllers enables the remote monitoring and control of electric appliances with two-way communications between households and load service entities (LSEs). This makes it technically feasible to implement residential DR, and well-designed DR control algorithms are necessitated to fully exploit potential flexibility.
The mechanisms for residential DR are mainly categorized as “price-based” and “incentive-based” [4]. The price-based DR programs [5] use various pricing schemes, such as time-of-use pricing, critical peak pricing, and real-time pricing, to influence and guide the electricity usage. In incentive-based DR programs [6, 7], the LSEs recruit customers to adjust their load demands in DR events with financial incentives, e.g. cash, coupon, raffle, rebate, etc. A typical residential DR event consists of two periods:1) the preparation period when the LSEs anticipate an upcoming load peak or system emergency and call upon customers to participate with incentives (day-ahead or hours-ahead); 2) the load adjustment period when the electric appliances of participating customers are controlled to achieve certain DR goals. Meanwhile, customers are allowed to opt out (e.g. by clicking the “opt out” button in the smart phone app) if unsatisfied and override the control commands from the LSEs. In the end, the LSEs pay customers according to their actual contributions. In practice, the load adjustment period usually lasts for a few hours, and the control cycle of electric appliances varies from 5 to 30 minutes depending on the actual DR setting, which is enabled by the advanced meters with second or minute sampling rates. This paper focuses on the real-time control of electric appliances during the load adjustment period from the perspective of LSEs.
According to the investigations in [8, 9], offering the override (opt-out) option can greatly enhance the customers’ acceptance of direct load control, which is even more effective than financial incentives. Hence, customers are generally authorized to have the opt-out option in modern residential DR programs. However, the customer opt-out behaviors are uncertain and unknown to LSEs in practice, which brings significant challenges to the real-time DR control. References [8, 9, 10, 11] indicate that customer DR behaviors are influenced by individual preference and environmental factors. The individual preference relates to customers’ intrinsic socio-demographic characteristics, e.g. income, education, age, attitude to energy saving, etc. The environmental factors refer to real-time externalities such as electricity price, indoor temperature, offered incentive, weather conditions, etc. Without considering customers’ actual willingness, a blind DR control scheme may lead to high opt-out rates and inefficient load adjustment.
To address the uncertainty issue, data-driven learning techniques can be employed to learn customer DR behaviors with historical data and through online interactions and observations. Comprehensive reviews on the application of reinforcement learning (RL) for DR are provided in [12, 13]. References [14, 15, 16] design home energy management systems to optimally schedule electric appliances under time-varying electricity price, where -learning is used to learn customer preferences and make rescheduling decisions. In [17], a real-time DR strategy is presented for optimal arrangement of home appliances based on deep RL and policy search algorithms, considering the uncertainty of the resident’s behavior, electricity price, and outdoor temperature. Reference [18] applies batch RL to dispatch thermostatically controlled loads (TCLs) with the exploitation of exogenous data for day-ahead scheduling. Reference [19] proposes an incentive-based DR algorithm using RL and deep neural networks, to assist LSEs in designing the optimal incentive rates with uncertain energy prices and demands. In [20, 21], the multi-armed bandit method and its variants are adopted to select the right customers for DR participation at the preparation period to deal with unknown customer responses. However, for real-time load control in incentive-based DR, most existing works do not consider or overly simplify the uncertainty of customer behaviors, and the influence of various environmental factors is generally neglected. This causes significant mismatches between theory and practice. Hence, the development of real-time DR control algorithms that take into account customer opt-out behaviors remains largely unresolved.
Contribution. This paper studies the incentive-based residential DR programs that control air conditioner (AC) loads to optimize certain DR performances, e.g., minimizing the total AC load or closely tracking a target power trajectory. To this end, we propose a novel framework to model the real-time DR control as a multi-period stochastic optimization problem that integrates thermal dynamics and customer behavior transitions. In particular, Gaussian process (GP) [22] is adopted to build a non-parametric indoor thermal dynamical model from historical metering data, and logistic regression [23] is used to model the customer opt-out behaviors under the influence of environmental factors. Based on the Thompson sampling (TS) framework [24], we develop a distributed online DR control algorithm to learn the customer opt-out behaviors and make real-time AC power control decisions. The main merits of the proposed algorithm are summarized as follows:
-
1)
The individual preferences of customers and time-varying environmental factors are taken into account, which improves the predictions on customer opt-out behaviors and leads to efficient AC control schemes.
-
2)
This algorithm is implemented in a distributed manner and thus can be directly embedded in local household AC appliances, smart plugs, or smart phone apps. Moreover, the communication burdens are mitigated and the customer privacy can be preserved.
-
3)
Inheriting the merits of TS, this algorithm has a convenient decomposition structure of learning and optimization, and strikes an effective balance between exploration and exploitation in the online learning process.
The remainder of this paper is organized as follows: Section II provides a preliminary introduction on GP and TS. Section III presents the optimal AC control models with the learning techniques. Section IV develops the distributed online learning and AC control algorithm. Numerical tests are carried out in Section V, and conclusions are drawn in Section VI.
II Preliminaries on Learning Techniques
This section provides a preliminary introduction on the two key learning techniques used in this paper, i.e., Gaussian process and Thompson sampling.
II-A Gaussian Process
Gaussian process is a non-parametric supervised machine learning method [22] that has been widely used to model nonlinear system dynamics [27]. A formal definition of GP over a function is that any finite number of function realizations are random variables and follow a joint Gaussian distribution, which is fully specified by the mean function and the (kernel) covariance function . Consider learning an unknown function based on a training dataset of noisy observations, i.e. . It aims to infer the function value for a new point . Denote and . By the GP definition, are assumed to be random variables and follow a joint Gaussian distribution
where vector and . is the covariance matrix, whose -component is . Conditioning on the given observations , it is known that the posterior distribution of is also Gaussian, i.e. , with the closed form
(1a) | ||||
(1b) |
Then the mean value (1a) can be used as the prediction on , and the variance (1b) provides a confidence estimate for this prediction. The merits of GP include 1) GP is a non-parametric method that avoids the bias of model selection; 2) GP works well with small datasets; 3) GP can incorporate prior domain knowledge by defining priors on hyperparameters or using a particular covariance function. The major issue of GP is the computational complexity, which scales cubically in the number of observations, i.e., .
II-B Thompson Sampling
Thompson sampling [24] is a prominent Bayesian learning framework that was originally developed to solve the multi-armed bandit (MAB) problem [25] and can be extended to tackle other online learning problems. Consider a classical -periods MAB problem where an agent selects an action (called “arm”) from the action set at each time . After taking action , the agent observes an outcome that is randomly generated from a conditional distribution , and then obtains a reward with known reward function . The agent is initially uncertain about the parameter in but aims to maximize the total expected reward using the observation feedback. To achieve good performance, it is generally required to take actions with an effective balance between 1) exploring poorly-understood actions to gather new information that may improve future reward and 2) exploiting what is known for decision to maximize the immediate reward.
TS is a straightforward online learning algorithm that strikes an effective balance between exploration and exploitation. As shown in Algorithm 1, TS treats the unknown as a random variable and represents the initial belief on with a prior distribution . At each time , TS draws a random sample from (Step 3), then takes the optimal action that maximizes the expected reward based on the sample (Step 4). After outcome is observed, the Bayesian rule is applied to update the distribution over and obtain the posterior (Step 5).
The main features of the TS algorithm are listed below:
-
•
As outcomes accumulate, the predefined prior distribution will be washed out and the posterior converges to the true distribution or concentrates on the true value of .
-
•
The TS algorithm encourages exploration by the random sampling (Step 3). As the posterior distribution gradually concentrates, less exploration and more exploitation will be performed, which leads to an effective balance.
- •
III Problem Formulation
Consider the residential DR program that controls AC power consumption for load adjustment, where a system aggregator (SA) interacts with residential customers over sequential DR events. Each DR event111To avoid confusion, in the following text, we specifically refer to the load adjustment period when referring to “DR event”. is formulated as a finite time horizon with the time gap . Depending on the practical AC control cycle, may be different (e.g., 5 minutes or 10 minutes) across DR events222Generally, the control time gap should be larger than the metering data collection period, which is enabled by current smart meters with second or minute sampling rates., and the time length could also be different (e.g., 2 hours or 3 hours). The SA aims to learn customers’ opt-out behaviors and make real-time AC control decisions to optimize aggregate DR performance. In this paper, we focus on the control of household mini-split AC units, while the proposed method is applicable to heating, ventilation and air conditioning (HVAC) systems, space heaters, and other TCLs. Moreover, the proposed method works for both the AC heating and cooling cases separately.
In this section, we firstly establish the AC control model for each individual customer during a DR event, then use Gaussian process and logistic regression to learn the unknown thermal dynamics and customer opt-out behaviors, respectively. With all these in place, two system-level optimization models with different DR objectives are built for the SA to generate optimal AC power control schemes over customers.
III-A Customer-Side AC Control Model During A DR Event
III-A1 Decision Variable
During a DR event, denote as the AC power consumption for customer at time , which is the decision variable and satisfies
(2) |
where is the rated AC power capacity and denotes the AC power drift limit that prevents dramatic changes. In (2), is continuously controllable, since it indeed denotes the average AC power during the time interval , which can be realized by appropriately adjusting the AC cycling rate, i.e., the time ratio of the on status over . Nevertheless, a discrete AC control model for customer with
can be applied as well, where the AC unit switches between the on () and off () modes.
III-A2 Opt-Out Status and Transition
Denote binary variable as the opt-out status of customer at time , which equals if customer stays in and if opts out. Initialize for all customers at the beginning of a DR event. As mentioned above, customer opt-out behaviors are influenced by various environmental factors. Accordingly, the binary opt-out statuses of each customer are modelled as a time series of random variables that are independent of other customers and follow the transition probability (3):
(3a) | |||||
(3b) |
Here, is the column vector that collects the environmental factors at time , which will be elaborated in the next part. is the transition probability function that captures how environmental factors affect the customer opt-out behaviors.

As illustrated in Figure 1, equation (3a) enforces the DR rule that once customer opts out at a certain time, this customer will remain the opt-out status for the rest of the current DR event. Equation (3b) indicates that if customer stays in at time , this customer may remain stay-in at next time with probability , or choose to opt out with probability . We further explain the transition model (3) with the following remark.
Remark 1.
The opt-out status transition model (3) exhibits the Markov property, where the transition probability is functional on the environmental factors at time . This facilitates the subsequent development and solution of the optimal AC control models, but does not sacrifice the modeling generality. Because it is free to choose suitable environmental factors, so that all the useful information is captured in and the Markov property is preserved. Basically, by including all the necessary known information in the enlarged state at time , which is known as state augmentation, any multi-period control problem can generally be modelled as a Markov decision process [28]. See the next part for the detailed definition and selection of the environmental factors.
III-A3 Environmental Factors
Based on the empirical investigations in [8, 9, 10, 11], we present below several key environmental factors that influence customers’ opt-out behaviors. In particular, the first three factors are affected by the AC control scheme, thus their dynamics models are introduced as well.
(Indoor Temperature). Denote as the indoor temperature for customer at time . The associated thermal dynamical model can be formulated as
(4) |
where is the outdoor temperature at time , are the initial temperatures in the beginning of the DR event, and denotes the thermal dynamics function.
(Accumulated Thermal Discomfort). We define as the accumulated thermal discomfort for customer at time , and let it follow the dynamics (5) with :
(5) |
where denotes the defaulted AC setting temperature by customer . The operator takes the larger value between and , which means that only the indoor temperature higher than the setpoint will cause thermal discomfort for the AC cooling case in the summer. Besides, the quadratic form in (5) captures that the thermal discomfort increases faster as the temperature deviation becomes larger [29].
(Incentive Credit). Denote as the incentive credit offered to customer at time . We consider a general incentive scheme (6), while other incentive schemes can be used as well.
(6) |
In (6), the first term is the base credit for DR participation. The second term is the stay-in bonus that is proportional to time and the AC power capacity with coefficient .The third term is the reimbursement for the actual load adjustment with credit coefficient , where is the associated AC power at time to maintain the setting temperature .
Essentially, can be regarded as the baseline AC power consumption of customer when no DR control is implemented. With the thermal dynamical model (4) and given , one can compute by solving equation (7):
(7) |
which follows the definition of maintaining the defaulted setting indoor temperature .
Other key environmental factors that would influence customers’ opt-out behaviors include real-time electricity prices, weather conditions, duration of the DR event, fatigue effect, etc. These factors are treated as given parameters that can be obtained or predicted ahead of time. Accordingly, the vector in (3b) can be defined as the combination of the environmental factors mentioned above:
(8) |
Remark 2.
We note that the definition and selection of useful environment factors are complex and tricky in practice. For instance, the definition of in (8) only contains the present status at time , while the past values and future predictions may also be included in to capture the temporal dependence. This is related to the feature engineering problem in machine learning, which is expected to be conducted based on real data and make a trade-off between complexity and effectiveness. Nevertheless, the proposed learning and AC control method is a general framework that is applicable to different choices of the environmental factors.
One critical issue for the residential AC control is that the thermal dynamics function in (4) and the customer opt-out behavior function in (3b) are generally unknown. To address this issue, learning techniques are used to estimate the unknown models with real data, which are presented in the following two subsections.
III-B Learning for Thermal Dynamics Model
The practical implementation of AC control for residential DR is generally achieved through smart plugs or smart AC units with built-in controllers and sensors. These smart devices are able to measure, store, and communicate the temperature and AC power data in real time. Hence, the thermal dynamics model (4) can be estimated based on fine-grained historical measurement data. To this end, we provide the following two thermal model estimation schemes.
III-B1 Linear Model
Given a time series of historical indoor/outdoor temperature and AC power data, one can fit a classical linear thermal dynamics model (9) [30] and obtain the coefficients via linear regression:
(9) |
where coefficients and specify the thermal characteristics of the room with AC and the ambient. A positive (negative) indicates that AC works in the heating (cooling) mode.
III-B2 Gaussian Process Model
An alternative scheme is to employ the Gaussian process method (introduced in Section II) to model the thermal dynamics as (10), which can capture the nonlinearity in the data pattern:
(10) |
where , and the notations with subscript denote the corresponding terms associated with the historical (training) dataset as presented in (1a).
The main virtue of the linear model (9) lies in its simplicity of implementation and interpretability. In contrast, the non-parametric GP model (10) offers more modeling flexibility and can capture the nonlinear relation and avoid the bias of model class selection, in the cost of computational complexity. Besides, other suitable regression methods can be applied to model the thermal dynamics as well. The choice of model depends on the practical DR requirements on computational efficiency and modeling accuracy. The historical data above refer to the available datasets that have been collected by advanced meters before the DR event, thus the thermal dynamics model can be estimated in an offline fashion. Nevertheless, dynamic regression that uses the latest data to fit an updated model along the DR event is also applicable to further enhance the prediction accuracy.
III-C Learning for Customer Opt-Out Behaviors
Since the customer opt-out status is binary, logistic regression [23] is used to model the transition probability function in (3b). Because the output of logistic regression is naturally a probability value within , and it is easy to implement and interpret. Moreover, logistic regression is compatible with the online Bayesian learning framework with efficient posterior update approaches (see Section IV-C for details). Accordingly, we formulate the transition probability function as the logistic model (11):
(11) |
where is the weight vector describing how customer reacts to the environmental factors , and depicts the individual preference. Define and . Then the linear term in (11) becomes . Without causing any confusion, we use and interchangeably.
As a consequence, the unknown information of customer ’s behaviors is summarized as vector , which can be estimated from the observations of in DR events. In contrast to the thermal dynamics model learning, the observation data of are not historically available but can only be obtained along with the real implementation of DR events. This leads to an online customer behavior learning and AC power control problem. Thus we employ the TS framework to develop the online AC control algorithm in Section IV to effectively balance exploration and exploitation.
III-D System-Level Optimal AC Control Models
In a typical DR setting, once a customer opts out, the AC unit will automatically be switched back to the defaulted operation mode with the original customer-set temperature . Taking the opt-out status into account, the actual AC power consumption can be formulated as
(12) |
which equals if customer stays in () or if opts out (). Denote , , , , and .
To simplify the expression, we reformulate the AC control constraints (2), the opt-out status transition (3), (8), (11), and the dynamics of environmental factors (5), (6), (9) or (10), for customer , as the following compact form (13):
(13) |
where denotes the corresponding feasible set. Then two system-level optimal control (SOC) models, i.e. (14) and (15), with different DR goals are established for the SA to solve optimal AC power control schemes over customers.
1) SOC-1 model (14) aims to reduce as much AC load as possible in a DR event, which can be used to flatten the load peaks or mitigate reserve deficiency emergency.
(14a) | ||||
(14b) |
where objective (14a) minimizes the expected total AC energy consumption over the DR event, plus an opt-out penalty term in the last time step. is the penalty coefficient that can be tuned to balance load reduction and opt-out outcomes. denotes the expectation that is taken over the randomness of the customer opt-out status . Constraint (14b) collects the counterparts of (13) for all customers.
2) SOC-2 model (15) aims to closely track a regulated power trajectory , which is determined by the upper-level power dispatch or the DR market.
(15a) | ||||
(15b) |
Objective (15a) minimizes the expected total squared power tracking deviation from the global target , plus the same opt-out penalty term defined in (14a). Constraint (15b) is the same as (14b).
Remark 3.
The penalty term in the objectives (14a) and (15a) serves as the final state cost in a finite-horizon control planning problem, which is used to restrict the last control action . Without this penalty term, the last control action would be too radical with no regard for the opt-out outcome at time and lead to frequent final opt-out . Besides, this penalty term is a useful tool for the SA to make a trade-off between the DR goals and the customer opt-out results through adjusting the coefficient .
The two SOC models (14) and (15) are indeed discrete-time finite-horizon control planning problems, which are in the form of nonconvex stochastic optimization, and the stochasticity results from the probabilistic opt-out status transition (3). We develop the distributed solution methods for the SOC models (14) and (15) in the next section.
IV Distributed Solution and Algorithm Design
For the real-time AC control in a DR event, we pursue a distributed implementation manner such that
-
1)
the control algorithm can be directly embedded in the local home electric appliances or smart phone apps;
-
2)
heavy communication burdens between the SA and households are avoided during the DR event;
-
3)
the private information of customers can be protected.
In this section, we propose the distributed solution methods for the SOC models (14) and (15), then develop the distributed online AC control algorithm based on the TS framework.
IV-A Distributed Solution of SOC-1 Model (14)
Since the opt-out status transition of one customer is assumed to be independent of other customers in (3), objective (14a) in the SOC-1 model has no substantial coupling among different customers. Hence, the SOC-1 model (14) can be equivalently decomposed into local problems, i.e., model (16) for each customer .
(16a) | ||||
(16b) |
The sum of objectives (16a) over all customers is essentially objective (14a) in the SOC-1 model, and constraint (16b) is the individual version of (14b) for customer .
The local model (16) is a stochastic optimization with the expectation over in the objective. Since follows the transition (3) with the probability function (11), we can derive the analytic form of the expectation in (16a), which leads to expression (17):
(17) |
Expression (17) only differs from the expectation in (16a) by a constant term , thus they are equivalent in optimization. See Appendix A for the detailed derivation.
IV-B Distributed Solution of SOC-2 Model (15)
The objective (15a) in the SOC-2 model has coupling among different customers due to tracking a global power trajectory . To solve this problem distributedly, we introduce a local tracking trajectory for each customer with for all . Then we substitute by in objective (15a), so that (15a) can be approximated333The approximation is made by dropping the term from the expansion of (15a). This term is expected to be relatively small and thus neglectable. by the decomposable form (18a), which takes the form of a sum over customers. As a result, the SOC-2 model (15) is modified as (18):
(18a) | ||||
s.t. | (18b) | |||
(18c) |
where is also a decision variable.
Consequently, the only substantial coupling among customers in the modified SOC-2 model (18) is the equality constraint (18c). Therefore, we can introduce the dual variable for the equality constraint (18c) and employ the dual gradient algorithm [31] to solve the modified SOC-2 model (18) in a distributed manner. The specific distributed solution method is presented as Algorithm 2. The implementation of Algorithm 2 needs the two-way communication of and between the SA and every customer in each iteration. Due to the simple structure with only one equality coupling constraint (18c), Algorithm 2 can converge quickly with appropriate step sizes, which is verified by our simulations.
(19a) | ||||
s.t. | (19b) |
(20) |
Similar to (17), we can derive the equivalent analytic form (21) for the expectation term in (19a):
(21) |
By substituting the expectations terms in (16a) and (19a) with their analytic forms (17) and (21) respectively, the local AC control models (16) and (19) become deterministic nonconvex optimization problems. Given parameter , they can be solved efficiently via available nonlinear optimizer tools, such as the IPOPT solver [32]. For concise expression, we denote the above distributed solution methods together with the optimizer tools as an oracle
(22) |
which generates optimal with the input of parameter .
IV-C Distributed Online DR Control Algorithm
Based on the TS framework, we develop the distributed online DR control algorithm as Algorithm 3 to learn customer opt-out behaviors and optimally control the AC power in a DR event. Since this online algorithm is implemented distributedly, we present Algorithm 3 from the perspective of an individual customer . The practical implementation of Algorithm 3 is illustrated as Figure 2.

(23) |
(24) | ||||
(25) |
In Algorithm 3, the unknown customer behavior parameter is treated as a random variable, and we construct a Gaussian prior distribution for it based on historical information. At each time of a DR event, is randomly sampled from the distribution for decision-making. Two key techniques used in Algorithm 3 are explained as follows.
1) To utilize the latest information and take future time-slots into account, we employ the model predictive control (MPC) method in the optimization and action steps. Specifically, at each time , it solves the SOC model (14) or (15) for the rest of the DR event to obtain the optimal AC control scheme , but only implements the first control action (1). In addition, the latest predictions or estimations of the environmental factors, the updated thermal dynamics model, and recalculated baseline AC power can be adopted in the optimization step.
2) After observing the outcome pair , the variational Bayesian inference approach introduced in [33] is applied to obtain the posterior distribution on with the update rules (23)-(25). It is well known that Bayesian inference for the logistic regression model (11) is an intrinsically hard problem [34], and the exact posterior is intractable to compute. Thus we use the variational approach [33] for efficient Bayesian inference, which provides an accurate Gaussian approximation to the exact posterior with a closed form (24). The scalar in (23)-(25) is an intermediate parameter that affects the approximation accuracy, thus we alternate three times between the posterior update (24) and the update (25), which leads to an optimal approximated posterior. See reference [33] for details.
IV-D Performance Measurement
For online learning problems, the notion of “regret” and its variants are standard metrics that are defined to measure the performance of online learning and decision algorithms [35]. Accordingly, we denote as the underlying true customer behavior parameter that the LSEs do not know but aim to learn. Then the regret of the proposed online DR control algorithm at -th DR event is defined as
(26) |
where denotes the objective function (i.e., (14a) in the SOC-1 model or (18a) in the modified SOC-2 model) under the true value . denotes the AC control scheme generated by the proposed online algorithm, while is the optimal AC control scheme that minimizes the objective . Thus in (26) is always non-negative and measures the performance distance between the proposed online algorithm and the underlying best control scheme.
To evaluate the overall learning performance, we further define the cumulative regret until -th DR event as
(27) |
which is simply the sum of over the first DR events. Generally, a sublinear cumulative regret over is desired, i.e., as , since it indicates that as . In other word, it means that the proposed online algorithm can eventually learn the customer behaviors well and make optimal AC control schemes as more and more DR events are experienced. We demonstrate the regret results of the proposed algorithm via numerical simulations in the next section.
V Numerical Simulations
In this section, we first test the performance of the linear thermal dynamics model and the GP model. Then, we implement the proposed distributed online DR control algorithm on the two SOC models.
V-A Indoor Thermal Dynamics Prediction
In this part, we compare the thermal dynamics prediction performance of the linear model (9) and the GP model (10). Real customer metering data, including indoor temperature and AC power, from ThinkEco444ThinkEco Inc. is a New York City-based energy efficiency and demand response solution company (http://www.thinkecoinc.com/). are used for model training and testing. The outdoor temperature data are procured from an open-access meteorological database555Iowa Environmental Mesonet [online]: https://mesonet.agron.iastate.edu/request/download.phtml?network=MA_ASOS.. Specifically, we use the time series of data in consecutive 5 days with the resolution of 15 minutes to fit the two thermal dynamics models (9) and (10). The GPML toolbox [36] is applied to implement the GP model and optimize the hyperparameters. Then, the fitted models are tested on the time series of real data in the next 3 days for indoor temperature prediction.


The prediction results of one time step ahead (15 minutes) and three time steps ahead (45 minutes) are presented in Figure 3. The average prediction errors of the indoor temperature are and for the GP model in these two cases, and and for the linear model. It is observed that both the GP model (10) and the linear model (9) work well in the thermal dynamics modelling, and the GP model achieves better prediction accuracy.
V-B Learning and AC Control with SOC-1 Model
V-B1 Simulation Configuration
Each DR event lasts for 3 hours with the AC control period minutes, which implies a time length . The AC capacity and drift limit are set as kW and kW, and the AC setting temperature is . As defined in Section IV-D, we associate each customer with a true behavior parameter to simulate the opt-out outcomes, whose values are randomly generated but satisfy several basic rules to be reasonable. For example, if no DR control is implemented (defaulted AC setting), the stay-in probability should be very close to 1; if the indoor temperature reaches a high value such as , the stay-in probability should be very close to 0. The considered environmental factors include indoor temperature , accumulated thermal discomfort , incentive credit , outdoor temperature , and time-varying electricity price, where the first three factors follow the dynamics (4)-(6) respectively, while the electricity price at each time is normalized and randomly generated from . Besides, IPOPT solver [32] is employed to solve the nonconvex optimal AC control models.
V-B2 Control and Learning Performance
Since the aggregation of all local optimal AC control schemes is an optimal solution to the SOC-1 model (14), we simulate the AC control and learning for a single customer over sequential DR events. Given the true parameter , the optimal AC power trajectory can be computed via solving the local control model (16). Figure 4 illustrates the simulation results associated with the optimal AC control scheme . It is seen that the stay-in probability is maintained close to 1 by the AC control scheme, which tends to make customers comfortable and not opt out for the sake of long-term load reduction. Besides, there is a drop of AC power at the end of the DR event (), leading to increased indoor temperature and a decrease in the stay-in probability. Intuitively, that is because last-minute opt-out will not affect the DR objective much, and thus a radical AC power reduction is conducted. This effect can be mitigated by increasing the penalty coefficient .

In practice, the true customer parameter is unknown, and we implement the proposed online algorithm to learn customer behaviors and make AC control decisions in sequential DR events. We compare the performance of the proposed algorithm with two baseline schemes: raising the AC setting temperature by and respectively, which are typically used in real DR programs. The regret results for one customer over 200 DR events are shown as Figure 5. It is observed that the per-event of the proposed algorithm decreases dramatically within the first tens of DR events, then converges to almost zero. As a result, the associated cumulative regret exhibits a clear sublinear trend, which verifies the learning efficiency of the proposed online algorithm. In contrast, the baseline schemes that simply raise the AC setting temperature without consideration of customer opt-out behaviors maintain high regret values. Besides, the average amount of load energy reduction of one customer in a DR event is for the proposed algorithm, which is higher than the baseline schemes with and load reduction, respectively.

V-C Learning and AC Control with SOC-2 Model
We then perform the proposed online DR control algorithm on the SOC-2 model, where the simulation configurations are the same as Section V-B. Since the SOC-2 problem involves the coordination among all customers, we set the customer number as , and the global tracking target is randomly generated from (kW).
V-C1 Distributed Solution
We apply Algorithm 2 to solve the modified SOC-2 model (18) distributedly over 500 customers. A diminishing step size with is employed to speed up the convergence. For the case with given , the convergence results are shown as Figure 6. It is seen that the distributed algorithm converges to the optimal value within tens of iterations. We note that the convergence curve in Figure 6 and the associated AC power are just intermediate computational values, which are not executed in practice, while only the final converged values are regarded as the AC power control scheme and are implemented. Figure 7 illustrates the simulation results associated with the converged optimal AC control scheme, including the optimal AC power with the local tracking target, the stay-in probability, and the indoor/outdoor temperature profiles for one customer.


V-C2 Learning Performance
In practice, the true parameter is unknown, therefore we implement the proposed online algorithm to learn customer opt-out behaviors and make AC control decisions with the modified SOC-2 model (18). The regret results of the proposed algorithm for 200 DR events over 500 customers are presented in Figure 8. A rapidly decreasing regret and a sublinear cumulative regret are observed, which verify that the proposed algorithm can learn the customer behaviors well and generate efficient AC control decisions. Besides, for the per-event regret curve in Figure 8, non-monotonic variations and occasional spikes are observed in the early learning stage (similar in Figure 5). That is because the proposed algorithm follows the TS framework and draws a random sample for decision-making at each time, which results in the non-monotonic variations with stochasticity. In addition, there is a small chance that the sample is quite different from the true value , leading to large spikes in the regret curve. However, as the observed customer opt-out outcomes accumulate, the distribution on gradually concentrates on the true value , so that there are less large spikes in the later learning stage.

VI Conclusion
In this paper, we propose a distributed online DR control algorithm to learn customer behaviors and regulate AC loads for incentive-based residential DR. Based on the Thompson sampling framework, the proposed algorithm consists of an online Bayesian learning step and an offline distributed optimization step. Two DR objectives, i.e. minimizing total AC loads and closely tracking a regulated power trajectory, are considered. The numerical simulations show that the distributed solution converges to the optimal value within tens of iterations, and the regret of learning reduces rapidly on average along with the implementation of DR events. Future works include 1) identify significant and effective environmental factors based on real user data; 2) conduct practical DR experiments using the proposed algorithm and analyze its practical performance.
Appendix A Analytic Derivation for Expectation in (16a)
The expectation term in (16a) can be expanded as
The status transition (3) implies that the customer opt-out time actually follows a geometric distribution with the Bernoulli probability (11). Thus, we can enumerate all the possible realizations with different and the probabilities:
where “” means customer does not opt out at any time. Then, the expectation term is computed analytically by summing up all these cases, which leads to (17) plus the constant term .
References
- [1] Federal Energy Regulatory Commission, 2019 Assessment of Demand Response and Advanced Metering, Dec. 2019.
- [2] U.S. Department of Energy, Benefit of Demand Response in Electricity Market and Recommendations for Achieving Them, Feb. 2006.
- [3] U.S. Energy Information Administration, Electric Power Annual, USA, Oct. 18, 2019.
- [4] A. R. Jordehi, “Optimisation of demand response in electric power systems, a review,” Renew. Sustain. Energy Rev., vol. 103, pp. 308-319, Apr. 2019.
- [5] M. Muratori and G. Rizzoni, “Residential demand response: dynamic energy management and time-varying electricity pricing,” IEEE Trans. on Power Syst., vol. 31, no. 2, pp. 1108-1117, Mar. 2016.
- [6] Q. Hu, F. Li, X. Fang and L. Bai, “A framework of residential demand aggregation with financial incentives,” IEEE Trans. on Smart Grid, vol. 9, no. 1, pp. 497-505, Jan. 2018.
- [7] Y. Li and N. Li, “Mechanism design for reliability in demand response with uncertainty,” in Proc. Amer. Control Conf. (ACC), Seattle, WA, USA, May 2017, pp. 3400–3405.
- [8] X. Xu, C. Chen, X. Zhu, and Q. Hu “Promoting acceptance of direct load control programs in the United States: Financial incentive versus control option,” Energy, vol. 147, pp. 1278-1287, Mar. 2018.
- [9] M. J. Fell, D. Shipworth, G. M. Huebner, C. A. Elwell, “Public acceptability of domestic demand-side response in Great Britain: The role of automation and direct load control”, Energy Res. Soc. Sci., vol. 9, pp. 72-84, Sep. 2015.
- [10] X. Xu, A. Maki, C. Chen, B. Dong, and J. K.Day, “Investigating willingness to save energy and communication about energy use in the American workplace with the attitude-behavior-context model”, Energy Res. Soc. Sci., vol. 32, pp. 13-22, Oct. 2017.
- [11] Q. Shi, C. Chen, A. Mammoli and F. Li, “Estimating the profile of incentive-based demand response (IBDR) by integrating technical models and social-behavioral factors,” IEEE Trans. on Smart Grid, vol. 11, no. 1, pp. 171-183, Jan. 2020.
- [12] J. R. Vázquez-Canteli and Z. Nagy, “Reinforcement learning for demand response: A review of algorithms and modeling techniques,” Applied Energy, vol. 235, pp. 1072-1089, Feb. 2019.
- [13] X. Chen, G. Qu, Y. Tang, S. Low and N. Li, “Reinforcement learning for decision-making and control in power systems: tutorial, review, and vision,” ArXiv Preprint, arXiv:2102.01168, 2021.
- [14] Z. Wen, D. O’Neill and H. Maei, “Optimal demand response using device-based reinforcement learning,” IEEE Trans. on Smart Grid, vol. 6, no. 5, pp. 2312-2324, Sept. 2015.
- [15] F. Alfaverh, M. Denaï, and Y. Sun, “Demand response strategy based on reinforcement learning and fuzzy reasoning for home energy management,” IEEE Access, vol. 8, pp. 39310-39321, 2020.
- [16] R. Lu, S. H. Hong, and M. Yu, “Demand response for home energy management using reinforcement learning and artificial neural network,” IEEE Trans. on Smart Grid, vol. 10, no. 6, pp. 6629-6639, Nov. 2019.
- [17] H. Li, Z. Wan, and H. He, “Real-time residential demand response,” IEEE Trans. Smart Grid, vol. 11, no. 5, pp. 4144-4154, Sep. 2020.
- [18] F. Ruelens, B. J. Claessens, S. Vandael, and et al., “Residential demand response of thermostatically controlled loads using batch reinforcement learning,” IEEE Trans. on Smart Grid, vol. 8, no. 5, pp. 2149-2159, Sept. 2017.
- [19] R. Lu, S. H. Hong, “Incentive-based demand response for smart grid with reinforcement learning and deep neural network,” Applied Energy, vol. 236, pp. 937-949, Feb. 2019.
- [20] X. Chen, Y. Nie and N. Li, “Online residential demand response via contextual multi-armed bandits,” IEEE Contr. Syst. Lett., vol. 5, no. 2, pp. 433-438, Apr. 2021.
- [21] Y. Li, Q. Hu, N. Li, ”A reliability-aware multi-armed bandit approach to learn and select users in demand response,” Automatica, vol. 119, Sept. 2020.
- [22] C. E. Rasmussen, and C. K. I. Williams. Gaussian processes for machine learning. MIT Press, Cambridge, MA, 2006.
- [23] S. Menard, Applied logistic regression analysis, vol. 106, Sage, 2002.
- [24] D. J. Russo, B. V. Roy, A. Kazerouni, I. Osband, and Z. Wen, “A tutorial on Thompson sampling”, Found. Trends Mach. Learn., vol. 11, no. 1, pp. 1-96, 2018.
- [25] A. Slivkins, “Introduction to multi-armed bandits,” arXiv preprint, arXiv:1904.07272, 2019.
- [26] D. Russo, and B. V. Roy, “Learning to optimize via posterior sampling”, Math. Oper. Res., vol. 39, no. 4, pp. 1221-1243, Apr. 2014.
- [27] Hall, C. Rasmussen, and J. Maciejowski, “Modelling and control of nonlinear systems using Gaussian processes with partial model information,” in Proc. IEEE Conf. on Decision and Control (CDC), pp. 5266–5271, Dec. 2012.
- [28] D. P. Bertsekas, Dynamic Programming and Optimal Control: Volume I. 3rd Edition, Athena Scientific Belmont, MA, 2005.
- [29] D. Liu, Y. Sun, Y. Qu, B. Li and Y. Xu, “Analysis and accurate prediction of user’s response behavior in incentive-based demand response,” IEEE Access, vol. 7, pp. 3170-3180, 2019.
- [30] N. Li, L. Chen, and S. Low, “Optimal demand response based on utility maximization in power networks,” in Proc. IEEE PES General Meeting, 2011.
- [31] I. Necoara, V. Nedelcu, “On linear convergence of a distributed dual gradient algorithm for linearly constrained separable convex problems,” Automatica, vol. 55, pp. 209-216, May 2015.
- [32] IPOPT Solver. [Online]. Available: https://coin-or.github.io/Ipopt/.
- [33] T. S. Jaakkola and M. I. Jordan, “A variational approach to bayesian logistic regression models andtheir extensions,” in Sixth Intern. Workshop Artif. Intel. Stat., vol. 82, pp. 4, 1997.
- [34] N. G. Polson, J. G. Scott, J. Windle, “Bayesian inference for logistic models using Pólya–Gamma latent variables,” J. Ameri. Stat. Assoc., vol. 108, no. 504, pp. 1339-1349, Dec. 2013.
- [35] S. Shalev-Shwartz, “Online learning and online convex optimization,” Found. Trends Mach. Learn., vol. 4, no. 2, pp. 107–194, 2012.
- [36] C. E. Rasmussen and H. Nickisch, “Gaussian processes for machine learning (gpml) toolbox,” J. Mach. Learn. Technol., vol. 11, pp. 3011-3015, Nov. 2010.