This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\optauthor\Name

Yulai Zhao \Email[email protected]
\addrPrinceton University

Optimizing the Performative Risk under
Weak Convexity Assumptions

Abstract

In performative prediction, a predictive model impacts the distribution that generates future data, a phenomenon that is being ignored in classical supervised learning. In this closed-loop setting, the natural measure of performance named performative risk (PR\mathrm{PR}), captures the expected loss incurred by a predictive model after deployment. The core difficulty of using the performative risk as an optimization objective is that the data distribution itself depends on the model parameters. This dependence is governed by the environment and not under the control of the learner. As a consequence, even the choice of a convex loss function can result in a highly non-convex PR\mathrm{PR} minimization problem. Prior work has identified a pair of general conditions on the loss and the mapping from model parameters to distributions that implies the convexity of the performative risk. In this paper, we relax these assumptions and focus on obtaining weaker notions of convexity, without sacrificing the amenability of the PR\mathrm{PR} minimization problem for iterative optimization methods.

1 Introduction

Predictions in the real world are often performative. This means predictions because they are made public, or because they inform downstream actions, and influence the distribution of the features or the target variable they aim to predict. Thus, shifting the data distribution the model has been trained on. Prominent examples include stock price prediction, credit scoring, and predictive policing.

As a result, the performance of a model should not be measured with respect to a fixed underlying distribution, as in classical supervised learning, but instead on the very distribution, the model induces. This has been conceptualized by [Perdomo et al. (2020)] through the framework of performative prediction. As a dynamic setting gives rise to interesting and new optimization challenges, the objective function in performative prediction is the risk incurred by the learner on the distribution induced after deploying the model:

PR(θ)𝔼z𝒟(θ)(z;θ),θΘ\displaystyle\mathrm{PR}(\theta)\coloneqq\mathbb{E}_{z\sim{\mathcal{D}}(\theta)}\ell(z;\theta),\theta\in\Theta (1)

where Θ\Theta is a closed and convex set in d{\mathbb{R}}^{d}. And PR\mathrm{PR} is referred to as the performative risk and measures the loss of a predictor θ\theta on the distribution 𝒟(θ){\mathcal{D}}(\theta) resulting from its deployment. We focus on finding the minima of the performative risk (1), known as performative optimal points. The main challenge of performative risk minimization from an optimization perspective is that the dependence of the performative risk on the model parameters is two-fold: it appears both in the distribution 𝒟{\mathcal{D}} and the loss \ell. To decouple these two dependencies, let us define the decoupled performative risk

DPR(θ1,θ2)𝔼z𝒟(θ1)(z;θ2)\displaystyle\mathrm{DPR}(\theta_{1},\theta_{2})\coloneqq\mathbb{E}_{z\sim{\mathcal{D}}(\theta_{1})}\ell(z;\theta_{2})

to denote the loss of θ2\theta_{2} measured over the distribution 𝒟(θ1){\mathcal{D}}(\theta_{1}). According to this definition we have PR(θ)=DPR(θ,θ).\mathrm{PR}(\theta)=\mathrm{DPR}(\theta,\theta).

Optimizing the performative risk directly is challenging because the structure of the objective PR(θ)=𝔼z𝒟(θ)(z;θ)\mathrm{PR}(\theta)=\mathbb{E}_{z\sim{\mathcal{D}}(\theta)}\ell(z;\theta) is not under the control of the learner and can be highly non-convex. Thus, for tackling performative risk minimization by building on classical tools from the optimization literature we need to better understand the structure of the objective function, and how properties of the distribution map and the loss interact.

As an important first step, Miller et al. (2021) identified a natural set of properties of the loss function and model-induced distribution shift under which the performative risk is convex. In particular, they showed that linear dependence of the distribution map on the model parameters θ\theta, together with strong convexity of the loss function \ell implies global convexity of PR\mathrm{PR}. In this work, we are interested in studying problems beyond this specific setting and we intend to address the following question:

How and under what conditions can we optimize the performative risk?

In particular, we study how to optimize the performative risk through first-order information and analyze the suboptimality gap of performative risks. Noticing the fact that convexity can be relaxed to weaker conditions while the same convergence guarantees are preserved by gradient-based optimizer, in Section 3, we show weaker conditions on DPR\mathrm{DPR} still imply desirable local properties of PR\mathrm{PR}. The properties enjoy generality since they do not rely on global structural restrictions. In Section 4, we discuss what information about θPO{\theta_{\mathrm{PO}}} we could have, if starting from a stable point. The underlying idea is to make use of previous retraining methods (Perdomo et al., 2020; Mendler-Dünner et al., 2020) converging to θPS{\theta_{\mathrm{PS}}} while the ultimate goal is still reaching θPO{\theta_{\mathrm{PO}}}.

2 Background

Performativity is a phenomenon that is widely recognized in finance (Clarke, 2012), economics (Callon et al., 2007; Cochoy et al., 2010) and other social sciences (Roberts, 2009). Following the seminal work by Perdomo et al. (2020) performativity has increasingly gained attention from the machine learning community. Performative distribution shifts constitute an instance of a more general distribution shift problem, where the shift is endogenous to the problem and entirely caused by the deployed predictive model. In other words, the deployment of a predictive model represents an intervention to the population (Mendler-Dünner et al., 2022). The resulting causal feedback effects invalidate the standard assumption of supervised learning where there is a static and model-independent data distribution describing the population, surfacing as distribution shifts.

2.1 Key concepts

We first give an overview of two existing concepts characterizing whether θ\theta is a desirable solution of performative prediction.

Performative stability.

Performative prediction can be viewed as a game between the decision-maker and the population that responds to the deployed model. The first concept of optimality in this context is performative stability

θPS=argminθDPR(θPS,θ).{\theta_{\mathrm{PS}}}=\operatorname*{arg\,min}_{\theta}\mathrm{DPR}({\theta_{\mathrm{PS}}},\theta). (2)

The expected loss over 𝒟(θPS){\mathcal{D}}(\theta_{\mathrm{PS}}) is minimized at θPS\theta_{\mathrm{PS}}. As such stable points constitute a fixed point of so-called retraining methods, which are heuristically adopted in dealing with distribution shifts. More specifically, at the tt-th iteration, these methods find a set of parameters that minimizes the risk on the previously-generated distribution 𝒟(θt){\mathcal{D}}(\theta_{t}). It can be formally described as θt+1=argminθΘDPR(θt,θ).\theta_{t+1}=\arg\min_{\theta\in\Theta}\mathrm{DPR}(\theta_{t},\theta). Several works have studied the convergence of population-based and stochastic retraining methods to stable points in the standard framework (Perdomo et al., 2020; Mendler-Dünner et al., 2020; Drusvyatskiy and Xiao, 2022) as well as state-dependent extensions (Brown et al., 2022; Li and Wai, 2022; Harris et al., 2021).

Performative optimality.

An alternative solution concept is performative optimality. Performative optimal points describe models that achieve the minimal loss on the distribution they induce. In this sense, they are globally optimal in the following sense:

θPO=argminθΘDPR(θ,θ).{\theta_{\mathrm{PO}}}=\operatorname*{arg\,min}_{\theta\in\Theta}\mathrm{DPR}(\theta,\theta).

It follows from the definition that PR(θPO)PR(θPS)\mathrm{PR}({\theta_{\mathrm{PO}}})\leq\mathrm{PR}({\theta_{\mathrm{PS}}}) while in general, stable points are not necessarily optimal and optimal points are not necessarily stable. Furthermore, the two solutions θPS{\theta_{\mathrm{PS}}} and θPO{\theta_{\mathrm{PO}}} can be arbitrarily far apart from each other, see (Perdomo et al., 2020; Miller et al., 2021). Thus, retraining does not, in general, serve as a reliable heuristic for finding optimal points. That being said, we focus on finding performative optima in this work. There is only a handful of works that study approaches for finding optimal points. The first to tackle performative optimality was Miller et al. (2021). They proposed an iterative optimization method for minimizing the performative risk directly by first learning a model of the distribution map and then applying a gradient-based procedure to the resulting estimation of the performative risk. To guarantee the global convergence of such an approach they studied structural assumptions under which the performative risk is convex. Around the same time,  Izzo et al. (2021) studied performative gradient descent, an algorithm that directly estimates the gradient of the performative risk. Technically, they assumed the existence of an estimator f^(θ)\hat{f}(\theta) so that one can use samples from the distribution 𝒟(θ){\mathcal{D}}(\theta) to infer θ\theta, an assumption we will revisit later in this manuscript. To prove convergence to optimal points they directly assume convexity of the performative risk, an assumption that is hard to verify.

A different approach has recently been taken by Jagadeesan et al. (2022) who leveraged tools from the bandits’ literature for minimizing regret under performativity without making structural assumptions on the performative risk apart from weak regularity assumptions. Technically, they use data collected from experiments to build performative confidence bounds to characterize the regret of unexplored parameters. Their method achieves sublinear regret and is the first to find optimal points for non-convex performative risk functions. However, a significant drawback of such a bandit approach is the need for global exploration, i.e., one needs to explore the entire parameter space to find the minimizer, something that is often not feasible in practical applications. Therefore, the focus of this work is on iterative descent methods that gradually approach optimal points.

2.2 Assumptions on the loss function

The loss function in performative prediction takes the same role as the loss in supervised learning. We discuss classical assumptions from the optimization literature such as strong convexity and smoothness on the loss function.

Assumption 1 (Smoothness)

The loss (;)\ell(\cdot;\cdot) is β\beta-smooth if for all z,z,θz,z^{\prime},\theta,

θ(z;θ)θ(z;θ)2βzz2.\|\nabla_{\theta}\ell(z;\theta)-\nabla_{\theta}\ell(z^{\prime};\theta)\|_{2}\leq\beta\|z-z^{\prime}\|_{2}.
Assumption 2 (Strong convexity)

The loss (;)\ell(\cdot;\cdot) is γ\gamma-strongly convex in θ\theta if for all θ,θ,z\theta,\theta^{\prime},z,

(z;θ)(z;θ)θ(z;θ)(θθ)+γ2θθ22.\ell(z;\theta^{\prime})-\ell(z;\theta)\geq\nabla_{\theta}\ell(z;\theta)^{\top}(\theta^{\prime}-\theta)+\frac{\gamma}{2}\|\theta-\theta^{\prime}\|_{2}^{2}.

Note that this condition reduces to convexity when γ=0\gamma=0.

These two assumptions characterize the dependence of the loss function on the model parameters θ\theta and they are assumed to hold for any realization of zz. We will adopt the smoothness assumption in this work and we will discuss the strong-convexity assumption as well as weaker notions of convexity.

In addition, we will need a regularity assumption for relating changes in the data distribution to changes in the loss function. Therefore, we will adopt the Lipschitzness condition w.r.t. zz, which is common when dealing with performative distribution shifts (Perdomo et al., 2020; Miller et al., 2021), namely

Assumption 3 (Lipschitzness)

The loss (;)\ell(\cdot;\cdot) is LL-Lipschitz in zz if for all z,z,θz,z^{\prime},\theta,

z(z;θ)z(z;θ)2Lzz2.\|\nabla_{z}\ell(z;\theta)-\nabla_{z}\ell(z^{\prime};\theta)\|_{2}\leq L\|z-z^{\prime}\|_{2}.

2.3 Characterizing the performative shift

Since performative effects are not under the control of the learner and are typically unknown, we need assumptions on these distribution shifts for studying the problem and establishing convergence guarantees of algorithmic approaches. In contrast to classical supervised learning where optimization techniques are commonly adopted under conditions on the loss function, we need additional assumptions for performative optimization. This observation is at the core of the main difficulty in performative prediction and the art is to couple conditions on the loss function with conditions on the distribution shift to obtain favorable properties of the resulting performative risk minimization problem.

We shall first introduce the concept of Wasserstein distance, then we proceed to the key sensitivity assumption that relates changes in the model parameters to changes in the distribution.

Definition 2.1 (Wasserstein Distance).

The Wasserstein-1 distance, also called Earth-Mover (EM) distance, is defined as

W(Pr,Pg)=infγΠ(Pr,Pg)𝔼(x,y)γ[xy],W(P_{r},P_{g})=\inf_{\gamma\in\Pi(P_{r},P_{g})}\mathbb{E}_{(x,y)\sim\gamma}\big{[}\>\|x-y\|\>\big{]}~{}, (3)

where Π(Pr,Pg)\Pi(P_{r},P_{g}) denotes the set of all joint distributions γ(x,y)\gamma(x,y) whose marginals are respectively PrP_{r} and PgP_{g}.

Intuitively, γ(x,y)\gamma(x,y) indicates how much mass must be transported from xx to yy in order to transform the distributions PrP_{r} into the distribution PgP_{g}. The EM distance then is the ”cost” of the optimal transport plan. The infimum in Eq. 3 is highly intractable. On the other hand, the Kantorovich-Rubinstein duality tells us that

W(Pr,Pθ)=supfL1[f(x)𝑑Pr(x)f(x)𝑑Pθ(x)]W(P_{r},P_{\theta})=\sup_{\|f\|_{L}\leq 1}\left[\int f(x)dP_{r}(x)-\int f(x)dP_{\theta}(x)\right]

This allows us to relate the distance of the expectation of a Lipschitz function over two different distributions by their Wasserstein distance.

Sensitivity.

One commonly used assumption is a weak regularity condition that posits Lipschitzness of the distribution shift relative to changes in the model parameters. Intuitively, the idea is, if decisions are made according to similar predictive models, then the resulting distributions should also be close. More precisely, the sensitivity condition introduced by Perdomo et al. (2020) relates changes in the model parameter as measured in Euclidean distance to changes in the distribution as measured in Wasserstein distance.

Assumption 4 (Sensitivity)

The distribution map 𝒟(){\mathcal{D}}(\cdot) is ϵ\epsilon-sensitive if for all θ,θ\theta,\theta^{\prime},

W(𝒟(θ),𝒟(θ))ϵθθ2.W({\mathcal{D}}(\theta),{\mathcal{D}}(\theta^{\prime}))\leq\epsilon\|\theta-\theta^{\prime}\|_{2}.

A prominent example of a distribution shift that satisfies the sensitivity assumption is strategic classification with quadratic costs (Perdomo et al., 2020). Strategic classification microfounds performative prediction with the best response model for individual actions (Jagadeesan et al., 2021).

Mixture dominance.

Below we present a stronger and more structural assumption on the distribution shift introduced by Miller et al. (2021) which guarantees the convexity of DPR(θ,θ)\mathrm{DPR}(\theta^{\prime},\theta) w.r.t. the first argument θ\theta^{\prime}.

Assumption 5 (Mixture dominance)

A set of losses and functions satisfy the mixture dominance condition if DPR(θ,θ)\mathrm{DPR}(\theta^{\prime},\theta) is convex w.r.t. θ\theta^{\prime}, thus for all θ,θ\theta,\theta^{\prime},

𝔼z𝒟(θ)[(z;θ)θlog(pθ(z))](θθ)DPR(θ,θ)PR(θ).\mathbb{E}_{z\sim{\mathcal{D}}(\theta)}[\ell(z;\theta)\nabla_{\theta}\log{(p_{\theta}(z))}]^{\top}(\theta^{\prime}-\theta)\leq\mathrm{DPR}(\theta^{\prime},\theta)-\mathrm{PR}(\theta).

It has been demonstrated by Miller et al. (2021) that this assumption is satisfied in performative environments by the well-known location-scale distribution family, which is a family of distributions formed by translation and re-scaling of a standard family member. For example, let the underlying distribution be 𝒟θ{\mathcal{D}}_{\theta}, data points zθ𝒟θz_{\theta}\sim{\mathcal{D}}_{\theta} satisfies zθ=z0+Aθz_{\theta}=z_{0}+A\theta, where the matrix AA is an unknown parameter, and z0z_{0} is is a zero-mean random variable. Here the additive term AθA\theta represents the effects of performativity. Standard examples include Gaussian, Cauchy and uniform distributions. There have been wide applications for these families in economic (Wong and Ma, 2008) and statistics (Hazra et al., 2017). Examples of location families can also be found in strategic classification (Hardt et al., 2016).

3 Beyond convexity of the performative risk

In studying the convergence of gradient-based methods to performative optima, previous works on performative prediction have focused on a setting where the performative risk is convex. While some works such as  Izzo et al. (2021) directly start from apriori hard-to-verify assumptions, the notable work by Miller et al. (2021) provides conditions on the loss function and the distribution map under which global convexity of PR\mathrm{PR} emerges. More specifically they show that strong convexity of the loss function, together with Assumptions 4-5 to control the performativity effects of distributions, is sufficient to establish (strong) convexity of the performative risk. They obtain a zeroth-order algorithm that provably converges towards globally performative optimal points (θPO{\theta_{\mathrm{PO}}}(Miller et al., 2021), which could avoid generally difficult computing performative gradients.

In the following, we take a step further by demonstrating that the strong convexity assumption on the loss function can be relaxed while preserving the amenability of the performative risk minimization problem to gradient-based optimization. We build on the existing literature on first-order optimization that showed that convexity can be relaxed to weaker conditions while maintaining the same convergence guarantees of gradient-based optimizers.

3.1 Weaker notions of strong convexity

In the optimization literature, strongly-convex optimization objectives are playing an important role as they can be optimized using gradient-descent methods at a linear rate. In the context of finding performatively stable points, this linear rate can be maintained by retraining methods in the presence of weak performative shifts (Perdomo et al., 2020).

There are various works in the optimization literature that aim to relax the strong-convexity condition on the objective function while maintaining favorable convergence properties. Examples include error bounds (EB) (Luo and Tseng, 1993; Drusvyatskiy and Lewis, 2018), essential strong convexity (ESC) (Strömberg, 2011; Liu et al., 2014), weak strong-convexity (WSC) (Ma et al., 2016; Necoara et al., 2019), restricted secant inequality (RSI) (Zhang and Yin, 2013; Zhang, 2017), restricted strong-convexity (RSC) (Negahban and Wainwright, 2012; Zhang and Yin, 2013), Polyak-Lojasiewicz (PL) inequality (Polyak, 1963; Loizou et al., 2021) and quadratic growth (QG) condition (Anitescu, 2000; Drusvyatskiy and Lewis, 2018). The relations between these conditions have been studied in (Liu and Wright, 2015; Karimi et al., 2016; Bolte et al., 2017; Zhang, 2017). In particular, for a smooth function with Lipschitz-continuous gradients,  Karimi et al. (2016) showed the following chain of implications:

(SC)(ESC)(WSC)(RSI)(EB)(PL)(QG).(SC)\xrightarrow[]{}(ESC)\xrightarrow[]{}(WSC)\xrightarrow[]{}(RSI)\xrightarrow{}(EB)\equiv(PL)\xrightarrow[]{}(QG). (4)

This implication chain is widely used in numerical optimization (Wang et al., 2017; Nouiehed et al., 2019; Vaswani et al., 2019).

We build on this work to analyze performative prediction from an optimization perspective. Previous works assumed the loss \ell satisfies strong convexity (SC) globally which is the strongest condition according to the implication chain in Eq. 4. Our contribution is to show that weaker conditions on DPR\mathrm{DPR} still imply desirable local properties of PR\mathrm{PR}. Specifically, in this work, we prove that assuming WSC and RSI of DPR\mathrm{DPR} is sufficient to establish WC and RSI of PR\mathrm{PR}, respectively, following similar steps taken in Miller et al. (2021). These local conditions only need to hold around the minimizer and do not posit strong structural restrictions on the overall landscape such as SC, thus making the assumptions more realistic in practice.

3.2 Establishing weak convexity of the performative risk

First, we introduce the weak strong convexity (Karimi et al., 2016; Necoara et al., 2019) in order to relax the strong convexity conditions. We will follow the convention that xpx_{p} denotes the projection of xx onto the optimal solution set 𝒳{\mathcal{X}}^{*} throughout this paper.

Definition 3.1 (WSC).

A function f:df:{\mathbb{R}}^{d}\to{\mathbb{R}} is μ\mu-WSC if for all xx we have

ff(x)+f(x),xpx+μ2xpx2.f^{*}\geq f(x)+\langle\nabla f(x),x_{p}-x\rangle+\frac{\mu}{2}\|x_{p}-x\|^{2}. (5)

where xp𝒳=argminxf(x)x_{p}\in{\mathcal{X}}^{*}=\arg\min_{x}f(x) and f=f(xp)f^{*}=f(x_{p}).

We translate this definition to the decoupled performative risk as follows:

Assumption 6

For performative optimum θPO{\theta_{\mathrm{PO}}} and its induced distribution 𝒟=𝒟(θPO){\mathcal{D}}={\mathcal{D}}({\theta_{\mathrm{PO}}}), suppose the optimal solution for minimizing DPR(θPO,)\mathrm{DPR}({\theta_{\mathrm{PO}}},\cdot) is θ\theta^{*}. We say DPR(θPO,)\mathrm{DPR}({\theta_{\mathrm{PO}}},\cdot) satisfies μ\mu-WSC, if for any θΘ\theta\in\Theta it holds that

DPR(θPO,θ)DPR(θPO,θ)+θDPR(θPO,θ)(θθ)+μ2θθ2.\displaystyle\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta^{*})\geq\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta)+\nabla_{\theta}\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta)^{\top}(\theta^{*}-\theta)+\frac{\mu}{2}\|\theta^{*}-\theta\|^{2}. (6)

This definition requires that the minimizer for the expected loss on the distribution 𝒟{\mathcal{D}} is unique: θ=argminθΘDPR(θPO,θ).\theta^{*}=\operatorname*{arg\,min}_{\theta\in\Theta}\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta). This is a mild assumption used by previous work focusing on repeated optimization in performative environments (Mendler-Dünner et al., 2020; Drusvyatskiy and Xiao, 2022; Wood et al., 2021). Intuitively, we posit a local property on the structure of DPR\mathrm{DPR} over the specific distribution 𝒟{\mathcal{D}} induced by θPO{\theta_{\mathrm{PO}}}. The quantity is measured over the domain of model parameter θ\theta. We note that this condition is already weaker than the strong convexity Assumption 2 which is assumed to be true for any data point zz (Miller et al., 2021), thus depending heavily on the structure of the loss function.

Equipped with Assumption 6, we are ready to state the following new theorem.

Theorem 3.2.

Under Assumptions 1, 4-6, suppose the performative optimum is also performative stable, then the performative risk is guaranteed to be weakly convex if μ2βϵ\frac{\mu}{2\beta}\geq\epsilon.

We make justification on the assumption that θPO{\theta_{\mathrm{PO}}} is also performative stable. As mentioned, optimal points and stable points do not generally imply each other. Therefore, simply adopting retraining (e.g.,  (Perdomo et al., 2020)) does not return global optima. Nevertheless, the local properties of PR(θ)\mathrm{PR}(\theta) under this assumption would be really helpful in determining the optimality of a performative stable point. In the next section, we will extend to more results on what information about θPO{\theta_{\mathrm{PO}}} we could obtain at a specific stable point without the assumption.

Proof 3.3 (Proof sketch).

Technically, our derivation starts from Eq. 6. After expanding the inequality around the minimizer of θPO{\theta_{\mathrm{PO}}}, we split PR\nabla\mathrm{PR} into two terms: gradient of the loss function 𝔼z𝒟(θ)θ(z;θ)\mathbb{E}_{z\sim{\mathcal{D}}(\theta)}\nabla_{\theta}\ell(z;\theta), and gradient on the variable distribution 𝔼z𝒟(θ)(z;θ)θlogpθ(z)\mathbb{E}_{z\sim{\mathcal{D}}(\theta)}\ell(z;\theta)\nabla_{\theta}\log{p_{\theta}(z)}, where pθp_{\theta} is the probability distribution function of 𝒟(θ){\mathcal{D}}(\theta). We then use the sensitivity Assumption 4 to bound the first term by the quadratic distance θPOθ2\|{\theta_{\mathrm{PO}}}-\theta\|^{2} and use the mixture dominance Assumption 5 to control the second term.

The μ2β\frac{\mu}{2\beta} term in the theorem is a sharp threshold for weak convexity of the performative risk, which aligns with the observations in (Perdomo et al., 2020; Miller et al., 2021). See details in Appendix A.

3.3 Establishing RSI of the performative risk

The second condition we consider is that Restricted Secant Inequality (Zhang and Yin, 2013). It is defined as follows:

Definition 3.4 (RSI).

A function f:df:{\mathbb{R}}^{d}\to{\mathbb{R}} satisfies Restricted Secant Inequality (RSI) if for all xx we have

f(x),xxpμxpx2.\langle\nabla f(x),x-x_{p}\rangle\geq\mu\|x_{p}-x\|^{2}. (7)

Again, xpx_{p} is the projection of xx onto the optimal solution set 𝒳{\mathcal{X}}^{*}.

Remark 3.5.

A function satisfies restricted strongly convex (RSC) if it is convex and also satisfies RSI.

Assumption 7

For performative optimum θPO{\theta_{\mathrm{PO}}} and its induced distribution 𝒟{\mathcal{D}}, suppose the optimal solution for minimizing DPR(θPO,)\mathrm{DPR}({\theta_{\mathrm{PO}}},\cdot) is unique and is denoted as θ\theta^{*}. We say DPR(θPO,)\mathrm{DPR}({\theta_{\mathrm{PO}}},\cdot) satisfies μ\mu-RSI, if for any θ\theta it holds

θDPR(θPO,θ),θθμθθ2\displaystyle\langle\nabla_{\theta}\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta),\theta-\theta^{*}\rangle\geq\mu\|\theta^{*}-\theta\|^{2} (8)
Theorem 3.6.

Under Assumptions 3-5, and  7, suppose the performative optima is also performative stable, then PR(θ)\mathrm{PR}(\theta) satisfies RSI.

Note that compared to WSC, the RSI condition is weaker and thus generalizes more easily in first-order optimization. See details in Appendix A. We conjecture that specific implementations discussed by Negahban and Wainwright (2012); Zhang and Yin (2013); Zhang (2017) could be extended to optimize the PR\mathrm{PR} in this setting.

4 When are local properties sufficient for stability and optimality?

As aforementioned, existing methods in the literature that seek stable points do not have global optimality guarantees. Meanwhile, current methods aiming at optimizing PR\mathrm{PR} directly require local structural assumptions around the performative optima that may not hold in real settings.

In this section, we investigate relations between stable points and optimal points. We relate to previous work focusing on the convergence of retraining methods to stable points (Mendler-Dünner et al., 2020; Drusvyatskiy and Xiao, 2022; Perdomo et al., 2020; Wood et al., 2021; Bianchin et al., 2021) and argue that if stable points offer a good approximation for optimal points, then they can serve as a good starting point to ensure convergence to optimal points with only local assumption on the performative risk.

Suppose the Lipschitzness condition for loss (c.f. Assumption 3) holds throughout this section, which is standard and mild in learning theory or optimization theory literature. For any θ,θ\theta,\theta^{\prime} in the parameter domain, we could quantify the gap between PR(θ)\mathrm{PR}(\theta) and PR(θ)\mathrm{PR}(\theta^{\prime}) as

PR(θ)\displaystyle\mathrm{PR}(\theta^{\prime}) =DPR(θ,θ)\displaystyle=\mathrm{DPR}(\theta^{\prime},\theta^{\prime})
DPR(θ,θ)LW(𝒟(θ),𝒟(θ))\displaystyle\geq\mathrm{DPR}(\theta,\theta^{\prime})-LW({\mathcal{D}}(\theta),{\mathcal{D}}(\theta^{\prime}))
=PR(θ)+[DPR(θ,θ)DPR(θ,θ)]LW(𝒟(θ),𝒟(θ)).\displaystyle=\mathrm{PR}(\theta)+[\mathrm{DPR}(\theta,\theta^{\prime})-\mathrm{DPR}(\theta,\theta)]-LW({\mathcal{D}}(\theta),{\mathcal{D}}(\theta^{\prime})).

The second term on the RHS measures the distance in the function value between θ\theta and θ\theta^{\prime} over the distribution 𝒟(θ){\mathcal{D}}(\theta).

In the remainder of this section, we pick θ\theta to be a stable point (i.e., θ=argminxDPR(θ,x)\theta=\arg\min_{x}\mathrm{DPR}(\theta,x)) so we can work with the suboptimality gap of the function. First rewrite the above equation as

PR(θ)\displaystyle\mathrm{PR}(\theta^{\prime}) PR(θ)+Δθ(θ)LW(𝒟(θ),𝒟(θ)).\displaystyle\geq\mathrm{PR}(\theta)+\Delta_{\theta}(\theta^{\prime})-LW({\mathcal{D}}(\theta),{\mathcal{D}}(\theta^{\prime})). (9)

where Δθ(θ):=DPR(θ,θ)PR(θ)0\Delta_{\theta}(\theta^{\prime}):=\mathrm{DPR}(\theta,\theta^{\prime})-\mathrm{PR}(\theta)\geq 0 is the suboptimality of θ\theta^{\prime} measured over 𝒟(θ){\mathcal{D}}(\theta)

Proposition 4.1.

Under Assumption 3, if

Δθ(θ)LW(𝒟(θ),𝒟(θ)),\Delta_{\theta}(\theta^{\prime})\geq LW({\mathcal{D}}(\theta),{\mathcal{D}}(\theta^{\prime})), (10)

it holds that

PR(θ)PR(θ).\mathrm{PR}(\theta)\leq\mathrm{PR}(\theta^{\prime}).

The proof is straightforward by noticing that, when Eq. 10 holds, we have PR(θ)PR(θ)\mathrm{PR}(\theta)\leq\mathrm{PR}(\theta^{\prime}) for all θΘ\theta^{\prime}\in\Theta. Thus θ\theta is performative optimal according to definition. Intuitively, Eq. 10 relates changes in the loss to changes in the distribution.

Starting from θ\theta.

Suppose we have reached θ\theta which is a performative stable point. To achieve θPO{\theta_{\mathrm{PO}}}, we may consider two directions. (1) How is the performance of PR(θ)\mathrm{PR}(\theta)? What is the gap between PR(θ)\mathrm{PR}(\theta) and the optimal performative risk (denoted as PR\mathrm{PR}^{*})? (2) What geometric information about θPO{\theta_{\mathrm{PO}}} could we obtain at θ\theta? As we may believe that θ\theta will serve as a good approximation for θPO{\theta_{\mathrm{PO}}} when the Euclidean distance is close. We believe investigating along the two directions is promising since it makes use of the large amount of work focusing on reaching θPS{\theta_{\mathrm{PS}}}.

Below, we give three preliminary examples on how some general structural conditions on the loss function and distribution maps could lead to useful results related to the two directions.

We show the suboptimality of θ\theta under Lipschitzness and a type of boundedness.

Example 4.2.

Assume performative shifts are bounded by an absolute value, i.e.,

W(𝒟(θ),𝒟(θ))B.W({\mathcal{D}}(\theta),{\mathcal{D}}(\theta^{\prime}))\leq B.

We have the following bound characterizing the suboptimality of θ\theta

PR(θ)PR\displaystyle\mathrm{PR}(\theta)-\mathrm{PR}^{*} LB.\displaystyle\leq LB. (11)
Proof 4.3.

The proof is straightforward by noticing Δθ(θ)0\Delta_{\theta}(\theta^{\prime})\geq 0 when θ\theta being stable. From Eq. 9 we have PR(θ)PR(θ)LB\mathrm{PR}(\theta^{\prime})\geq\mathrm{PR}(\theta)-LB.

The next two examples will show that θPO{\theta_{\mathrm{PO}}} could not be far from θ\theta.

Example 4.4.

Assume (1) performative shifts are bounded by an absolute value BB and (2) Δθ(θ)\Delta_{\theta}(\theta^{\prime}) satisfies quadratic growth, i.e., Δθ(θ)γθθ2\Delta_{\theta}(\theta^{\prime})\geq\gamma\|\theta-\theta^{\prime}\|^{2}, we have that performative optimal point θPO{\theta_{\mathrm{PO}}} satisfies

θPOθLBγ.\|{\theta_{\mathrm{PO}}}-\theta\|\leq\sqrt{\frac{LB}{\gamma}}.
Proof 4.5.

We proof by contradiction. Suppose there exists a performative optimal point θPO{\theta_{\mathrm{PO}}} which is LB/γ\sqrt{\nicefrac{{LB}}{{\gamma}}}-close to θ\theta. The quadratic growth shows

Δθ(θPO)LBLW(𝒟(θ),𝒟(θPO)).\Delta_{\theta}({\theta_{\mathrm{PO}}})\geq LB\geq LW({\mathcal{D}}(\theta),{\mathcal{D}}({\theta_{\mathrm{PO}}})).

From Eq. 9 we have PR(θPO)PR(θ)\mathrm{PR}({\theta_{\mathrm{PO}}})\geq\mathrm{PR}(\theta) which contradicts with the optimality of θPO{\theta_{\mathrm{PO}}}.

Example 4.6.

Under Assumption 4, suppose Δθ(θ)\Delta_{\theta}(\theta^{\prime}) satisfies quadratic growth, the performative optimal point θPO{\theta_{\mathrm{PO}}} satisfies

θPOθLϵγ.\|{\theta_{\mathrm{PO}}}-\theta\|\leq\frac{L\epsilon}{\gamma}.
Proof 4.7.

We proof by contradiction. Suppose there exists a performative optimal point θPO{\theta_{\mathrm{PO}}} which is Lϵ/γ\nicefrac{{L\epsilon}}{{\gamma}}-close to θ\theta. The quadratic growth shows Δθ(θPO)L2ϵ2γ\Delta_{\theta}({\theta_{\mathrm{PO}}})\geq\frac{L^{2}\epsilon^{2}}{\gamma}. At the meantime, senstitivity assumption 4 shows

W(𝒟(θ),𝒟(θPO))ϵθθPOLϵ2γ.W({\mathcal{D}}(\theta),{\mathcal{D}}({\theta_{\mathrm{PO}}}))\leq\epsilon\|\theta-{\theta_{\mathrm{PO}}}\|\leq\frac{L\epsilon^{2}}{\gamma}.

Therefore we have Δθ(θPO)LW(𝒟(θ),𝒟(θPO))\Delta_{\theta}({\theta_{\mathrm{PO}}})\geq LW({\mathcal{D}}(\theta),{\mathcal{D}}({\theta_{\mathrm{PO}}})). Substituting the relation into Eq. 9 shows PR(θPO)PR(θ)\mathrm{PR}({\theta_{\mathrm{PO}}})\geq\mathrm{PR}(\theta) which contradicts with the optimality of θPO{\theta_{\mathrm{PO}}}.

To conclude, in this section we attempt to make use of stable points in finding optima points. Though θPS{\theta_{\mathrm{PS}}} and θPO{\theta_{\mathrm{PO}}} do not imply each other as discussed aforementioned, we aim at revealing when and how stable points could be served as starting points for finding optima points. In such scenario, previous work focusing on convergence of retraining methods to stable points would still be worthy.

Concretely, we have shown several conditions under which stable points offer a good approximation for optimal points. Then, with only local assumption on PR\mathrm{PR}, we show that one could start from stable points in an optimization procedure, and finally, converge to optimal points.

Remark 4.8.

An interesting special case where stable points are optimal without requiring some sort of global optimality is when the Bayesian error of every distribution is the same. This means all minima are on one level-set. Hence, no other point can have a smaller loss.

5 Conclusion

This paper studies optimization aspects of the performative risk minimization problem.

First, we relax the standard strong convexity assumption on the loss function required in prior work to show convergence of iterative optimization methods in performative prediction. In particular, we study two weaker conditions on DPR\mathrm{DPR}: WSC and RSI – both are sufficient to establish weak convexity of the performative risk under suitable assumptions on the distribution map. Our work takes a first step towards importing advanced assumptions from the classical optimization literature into performative prediction.

Second, we make a contribution by raising interesting questions about the meaning of local regularity assumptions in the context of performative prediction. If stable points can be found heuristically and serve as a good approximation for global optima, the local regularity of PR\mathrm{PR} can be sufficient to find optimal points. We provide some bounds on the distance between optimal and stable points but it remains to better understand how they relate in practical settings.

As of future work, we note that a very interesting and important topic to study is the PL condition which is among the weakest assumptions according to the implication chain and is ubiquitous in over-parameterized neural networks (Zhou et al., 2021).

Suppose the following μ\mu-PL condition holds

12θDPR(θ,θ)2μ(DPR(θ,θ)PR(θ))=μΔθ(θ).\frac{1}{2}\left\|\nabla_{\theta^{\prime}}\mathrm{DPR}(\theta,\theta^{\prime})\right\|^{2}\geq\mu(\mathrm{DPR}(\theta,\theta^{\prime})-\mathrm{PR}(\theta))=\mu\Delta_{\theta}(\theta^{\prime}).

Naturally, we expand the RHS through Eq. 9

θDPR(θ,θ)22μ\displaystyle\frac{\left\|\nabla_{\theta^{\prime}}\mathrm{DPR}(\theta,\theta^{\prime})\right\|^{2}}{2\mu} PR(θ)PR(θ)LW(𝒟(θ),𝒟(θ));.\displaystyle\geq\mathrm{PR}(\theta^{\prime})-\mathrm{PR}(\theta)-LW({\mathcal{D}}(\theta),{\mathcal{D}}(\theta^{\prime}));.

Intuitively, we raise several interesting questions related to finding performative optima.

  1. 1.

    Understanding when and how (e.g., some structural properties of loss function or a natural set of distributions), it holds that W(𝒟(θ),𝒟(θ))CθDPR(θ,θ)2W({\mathcal{D}}(\theta),{\mathcal{D}}(\theta^{\prime}))\leq C\|\nabla_{\theta^{\prime}}\mathrm{DPR}(\theta,\theta^{\prime})\|^{2} where CC is a certain constant. Since the condition characterizes local properties of DPR\mathrm{DPR} near performative stable points, it could be more common in practice. Notice that current performative prediction literature heavily rely on sensitivity relation (c.f. Assumption 4).

  2. 2.

    What is the impact of data pre-processing steps on the implications of performative shifts? Can they influence performativity effects? For example, can data whitening, result in distribution with favorable properties, such as location-scale family?

Acknowledgements

The author gratefully acknowledges Aurelien Lucchi and Celestine Mendler-Dünner for numerous helpful discussions and advice during the internship at ETH Zürich.

References

  • Anitescu (2000) Mihai Anitescu. Degenerate nonlinear programming with a quadratic growth condition. SIAM Journal on Optimization, 10(4):1116–1135, 2000.
  • Bianchin et al. (2021) Gianluca Bianchin, Miguel Vaquero, Jorge Cortes, and Emiliano Dall’Anese. Online stochastic optimization for unknown linear systems: Data-driven synthesis and controller analysis. arXiv preprint arXiv:2108.13040, 2021.
  • Bolte et al. (2017) Jérôme Bolte, Trong Phong Nguyen, Juan Peypouquet, and Bruce W Suter. From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming, 165(2):471–507, 2017.
  • Brown et al. (2022) Gavin Brown, Shlomi Hod, and Iden Kalemaj. Performative Prediction in a Stateful World. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151, pages 6045–6061. PMLR, 2022.
  • Callon et al. (2007) Michel Callon et al. What does it mean to say that economics is performative. Do economists make markets, pages 311–357, 2007.
  • Clarke (2012) Chris Clarke. Financial engineering, not economic photography: Popular discourses of finance and the layered performances of the sub-prime crisis. Journal of Cultural Economy, 5(3):261–278, 2012.
  • Cochoy et al. (2010) Franck Cochoy, Martin Giraudeau, and Liz McFall. Performativity, economics and politics: An overview. Journal of Cultural Economy, 3(2):139–146, 2010.
  • Drusvyatskiy and Lewis (2018) Dmitriy Drusvyatskiy and Adrian S Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 43(3):919–948, 2018.
  • Drusvyatskiy and Xiao (2022) Dmitriy Drusvyatskiy and Lin Xiao. Stochastic optimization with decision-dependent distributions. Mathematics of Operations Research, 2022.
  • Hardt et al. (2016) Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, pages 111–122, 2016.
  • Harris et al. (2021) Keegan Harris, Hoda Heidari, and Steven Z Wu. Stateful strategic regression. Advances in Neural Information Processing Systems, 34:28728–28741, 2021.
  • Hazra et al. (2017) Nil Kamal Hazra, Mithu Rani Kuiti, Maxim Finkelstein, and Asok K Nanda. On stochastic comparisons of maximum order statistics from the location-scale family of distributions. Journal of Multivariate Analysis, 160:31–41, 2017.
  • Izzo et al. (2021) Zachary Izzo, Lexing Ying, and James Zou. How to learn when data reacts to your model: performative gradient descent. In International Conference on Machine Learning, pages 4641–4650. PMLR, 2021.
  • Jagadeesan et al. (2021) Meena Jagadeesan, Celestine Mendler-Dünner, and Moritz Hardt. Alternative microfoundations for strategic classification. In International Conference on Machine Learning, pages 4687–4697. PMLR, 2021.
  • Jagadeesan et al. (2022) Meena Jagadeesan, Tijana Zrnic, and Celestine Mendler-Dünner. Regret Minimization with Performative Feedback. In International Conference on Machine Learning, pages 9760–9785. PMLR, 2022.
  • Karimi et al. (2016) Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
  • Li and Wai (2022) Qiang Li and Hoi-To Wai. State dependent performative prediction with stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pages 3164–3186. PMLR, 2022.
  • Liu and Wright (2015) Ji Liu and Stephen J Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1):351–376, 2015.
  • Liu et al. (2014) Ji Liu, Steve Wright, Christopher Ré, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. In International Conference on Machine Learning, pages 469–477. PMLR, 2014.
  • Loizou et al. (2021) Nicolas Loizou, Sharan Vaswani, Issam Hadj Laradji, and Simon Lacoste-Julien. Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pages 1306–1314. PMLR, 2021.
  • Luo and Tseng (1993) Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research, 46(1):157–178, 1993.
  • Ma et al. (2016) Chenxin Ma, Rachael Tappenden, and Martin Takáč. Linear convergence of randomized feasible descent methods under the weak strong convexity assumption. The Journal of Machine Learning Research, 17(1):8138–8161, 2016.
  • Mendler-Dünner et al. (2020) Celestine Mendler-Dünner, Juan C. Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic Optimization for Performative Prediction. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, 2020.
  • Mendler-Dünner et al. (2022) Celestine Mendler-Dünner, Frances Ding, and Yixin Wang. Predicting from Predictions. arXiv preprint arXiv:2208.07331, 2022.
  • Miller et al. (2021) John P Miller, Juan C Perdomo, and Tijana Zrnic. Outside the Echo Chamber: Optimizing the Performative Risk. In International Conference on Machine Learning, pages 7710–7720. PMLR, 2021.
  • Necoara et al. (2019) Ion Necoara, Yu Nesterov, and Francois Glineur. Linear convergence of first order methods for non-strongly convex optimization. Mathematical Programming, 175(1):69–107, 2019.
  • Negahban and Wainwright (2012) Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13(1):1665–1697, 2012.
  • Nouiehed et al. (2019) Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 32, 2019.
  • Perdomo et al. (2020) Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599–7609. PMLR, 2020.
  • Polyak (1963) Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal vychislitel’noi matematiki i matematicheskoi fiziki, 3(4):643–653, 1963.
  • Roberts (2009) Brian Roberts. Performative social science: A consideration of skills, purpose and context. Historical Social Research/Historische Sozialforschung, pages 307–353, 2009.
  • Strömberg (2011) Thomas Strömberg. Duality between Fréchet differentiability and strong convexity. Positivity, 15(3):527–536, 2011.
  • Vaswani et al. (2019) Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1195–1204. PMLR, 2019.
  • Wang et al. (2017) Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited: Faster and more general. Advances in Neural Information Processing Systems, 30, 2017.
  • Wong and Ma (2008) Wing-Keung Wong and Chenghu Ma. Preferences over location-scale family. Economic Theory, 37(1):119–146, 2008.
  • Wood et al. (2021) Killian Wood, Gianluca Bianchin, and Emiliano Dall’Anese. Online projected gradient descent for stochastic optimization with decision-dependent distributions. IEEE Control Systems Letters, 6:1646–1651, 2021.
  • Zhang (2017) Hui Zhang. The restricted strong convexity revisited: analysis of equivalence to error bound and quadratic growth. Optimization Letters, 11(4):817–833, 2017.
  • Zhang and Yin (2013) Hui Zhang and Wotao Yin. Gradient methods for convex minimization: better rates under weaker conditions. arXiv preprint arXiv:1303.4645, 2013.
  • Zhou et al. (2021) Mo Zhou, Rong Ge, and Chi Jin. A local convergence theory for mildly over-parameterized two-layer neural network. In Conference on Learning Theory, pages 4577–4632. PMLR, 2021.

Appendix A Proofs for Section 3

[Proofs for Theorem 3.2]

Proof A.1.

At θPO{\theta_{\mathrm{PO}}}, since performative optimum is also performative stable, it is known θ=θPO\theta^{*}={\theta_{\mathrm{PO}}}.

Using WSC property we have

𝔼z𝒟(θPO)(z;θPO)𝔼z𝒟(θPO)(z;θ)+θDPR(θPO,θ),θPOθ+μ2θPOθ2.\mathbb{E}_{z\sim{\mathcal{D}}({\theta_{\mathrm{PO}}})}\ell(z;{\theta_{\mathrm{PO}}})\geq\mathbb{E}_{z\sim{\mathcal{D}}({\theta_{\mathrm{PO}}})}\ell(z;\theta)+\langle\nabla_{\theta}\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta),{\theta_{\mathrm{PO}}}-\theta\rangle+\frac{\mu}{2}\|{\theta_{\mathrm{PO}}}-\theta\|^{2}. (12)

Our goal is to characterize the function class of PR(θ)=𝔼z𝒟(θ)[l(z;θ)]\mathrm{PR}(\theta)=\mathbb{E}_{z\sim{\mathcal{D}}(\theta)}[l(z;\theta)]

Potentially, we expect it to be weak convex w.r.t. θ\theta, i.e.

PR(θPO)PR(θ)+PR(θ),θPOθ,\mathrm{PR}({\theta_{\mathrm{PO}}})\geq\mathrm{PR}(\theta)+\langle\nabla\mathrm{PR}(\theta),{\theta_{\mathrm{PO}}}-\theta\rangle, (13)

or using the definition of PR(θ)\mathrm{PR}(\theta):

𝔼z𝒟(θPO)[l(z;θPO)]𝔼z𝒟(θ)[l(z;θ)]+𝔼z𝒟(θ)[l(z;θ)],θPOθ.\mathbb{E}_{z\sim{\mathcal{D}}({\theta_{\mathrm{PO}}})}[l(z;{\theta_{\mathrm{PO}}})]\geq\mathbb{E}_{z\sim{\mathcal{D}}(\theta)}[l(z;\theta)]+\langle\nabla\mathbb{E}_{z\sim{\mathcal{D}}(\theta)}[l(z;\theta)],{\theta_{\mathrm{PO}}}-\theta\rangle. (14)

First, we separate the gradient of PR\mathrm{PR} into

θPR(θ)=𝔼z𝒟(θ)θ(z;θ)1+𝔼z𝒟(θ)(z;θ)θlogpθ(z)2.\displaystyle\nabla_{\theta}\mathrm{PR}(\theta)=\overbrace{\mathbb{E}_{z\sim{\mathcal{D}}(\theta)}\nabla_{\theta}\ell(z;\theta)}^{\nabla_{1}}+\overbrace{\mathbb{E}_{z\sim{\mathcal{D}}(\theta)}\ell(z;\theta)\nabla_{\theta}\log{p_{\theta}(z)}}^{\nabla_{2}}. (15)

We have

PR(θPO)PR(θ)+PR(θ),θPOθ\displaystyle\qquad\mathrm{PR}({\theta_{\mathrm{PO}}})\geq\mathrm{PR}(\theta)+\langle\nabla\mathrm{PR}(\theta),{\theta_{\mathrm{PO}}}-\theta\rangle
(i)DPR(θPO,θ)PR(θ)+(θPOθ)𝔼z𝒟(θPO)(z;θ),θPOθμ2θPOθ2\displaystyle\stackrel{{\scriptstyle(i)}}{{\Leftarrow}}\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta)\geq\mathrm{PR}(\theta)+\nabla^{\top}({\theta_{\mathrm{PO}}}-\theta)-\langle\mathbb{E}_{z\sim{\mathcal{D}}({\theta_{\mathrm{PO}}})}\nabla\ell(z;\theta),{\theta_{\mathrm{PO}}}-\theta\rangle-\frac{\mu}{2}\|{\theta_{\mathrm{PO}}}-\theta\|^{2}
DPR(θPO,θ)PR(θ)+2(θPOθ)+1𝔼z𝒟(θPO)(z;θ),θPOθμ2θPOθ2\displaystyle\Leftrightarrow\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta)\geq\mathrm{PR}(\theta)+\nabla_{2}^{\top}({\theta_{\mathrm{PO}}}-\theta)+\langle\nabla_{1}-\mathbb{E}_{z\sim{\mathcal{D}}({\theta_{\mathrm{PO}}})}\nabla\ell(z;\theta),{\theta_{\mathrm{PO}}}-\theta\rangle-\frac{\mu}{2}\|{\theta_{\mathrm{PO}}}-\theta\|^{2}
(ii)DPR(θPO,θ)PR(θ)+2(θPOθ)+(βϵμ2)θPOθ2\displaystyle\stackrel{{\scriptstyle(ii)}}{{\Leftarrow}}\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta)\geq\mathrm{PR}(\theta)+\nabla_{2}^{\top}({\theta_{\mathrm{PO}}}-\theta)+(\beta\epsilon-\frac{\mu}{2})\|{\theta_{\mathrm{PO}}}-\theta\|^{2}
(iii)μ2βϵ\displaystyle\stackrel{{\scriptstyle(iii)}}{{\Leftarrow}}\frac{\mu}{2}\geq\beta\epsilon
  1. (i).

    Here we use: PR(θPO)DPR(θPO,θ)+𝔼z𝒟(θPO)(z;θ),θPOθ+μ2θPOθ2\mathrm{PR}({\theta_{\mathrm{PO}}})\geq\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta)+\langle\mathbb{E}_{z\sim{\mathcal{D}}({\theta_{\mathrm{PO}}})}\nabla\ell(z;\theta),{\theta_{\mathrm{PO}}}-\theta\rangle+\frac{\mu}{2}\|{\theta_{\mathrm{PO}}}-\theta\|^{2}.

  2. (ii).

    This is because ϵ\epsilon-sensitive of distribution map 𝒟{\mathcal{D}}.

  3. (iii).

    Mixture dominance implies: DPR(θPO,θ)PR(θ)+2(θPOθ)\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta)\geq\mathrm{PR}(\theta)+\nabla_{2}^{\top}({\theta_{\mathrm{PO}}}-\theta).

We conclude that we only need weak strong convexity of DPR\mathrm{DPR}, local mixture dominance and local distribution sensitivity to guarantee that PR\mathrm{PR} is weakly-convex to θPO{\theta_{\mathrm{PO}}}, instead of strongly convex everywhere.

[Proofs for Theorem 3.6]

Proof A.2.

At θPO{\theta_{\mathrm{PO}}}, because performative optimum is also performative stable, it is known θ=θPO\theta^{*}={\theta_{\mathrm{PO}}}.

Using RSI property we have

θDPR(θPO,θ),θθPOμθPOθ2.\langle\nabla_{\theta}\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta),\theta-{\theta_{\mathrm{PO}}}\rangle\geq\mu\|{\theta_{\mathrm{PO}}}-\theta\|^{2}.

We derive back as following, we want to show PR\mathrm{PR} is RSIRSI, i.e.,

θPR(θ),θθPOμθPOθ2\displaystyle\qquad\langle\nabla_{\theta}\mathrm{PR}(\theta),\theta-{\theta_{\mathrm{PO}}}\rangle\geq\mu^{\prime}\|{\theta_{\mathrm{PO}}}-\theta\|^{2}
0μθPOθ2+1,θPOθ+2,θPOθ\displaystyle\Leftrightarrow 0\geq\mu^{\prime}\|{\theta_{\mathrm{PO}}}-\theta\|^{2}+\langle\nabla_{1},{\theta_{\mathrm{PO}}}-\theta\rangle+\langle\nabla_{2},{\theta_{\mathrm{PO}}}-\theta\rangle
0μθPOθ2+1θDPR(θPO,θ),θPOθ+θDPR(θPO,θ),θPOθ+2,θPOθ\displaystyle\Leftrightarrow 0\geq\mu^{\prime}\|{\theta_{\mathrm{PO}}}-\theta\|^{2}+\langle\nabla_{1}-\nabla_{\theta}\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta),{\theta_{\mathrm{PO}}}-\theta\rangle+\langle\nabla_{\theta}\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta),{\theta_{\mathrm{PO}}}-\theta\rangle+\langle\nabla_{2},{\theta_{\mathrm{PO}}}-\theta\rangle
(i)0(μ+βϵ)θPOθ2+θDPR(θPO,θ),θPOθ+2,θPOθ\displaystyle\stackrel{{\scriptstyle(i)}}{{\xLeftarrow{}}}0\geq(\mu^{\prime}+\beta\epsilon)\|{\theta_{\mathrm{PO}}}-\theta\|^{2}+\langle\nabla_{\theta}\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta),{\theta_{\mathrm{PO}}}-\theta\rangle+\langle\nabla_{2},{\theta_{\mathrm{PO}}}-\theta\rangle
(ii)0(μ+βϵ)θPOθ2+DPR(θPO,θ),θPOθ+DPR(θPO,θ)PR(θ)\displaystyle\stackrel{{\scriptstyle(ii)}}{{\xLeftarrow{}}}0\geq(\mu^{\prime}+\beta\epsilon)\|{\theta_{\mathrm{PO}}}-\theta\|^{2}+\langle\nabla\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta),{\theta_{\mathrm{PO}}}-\theta\rangle+\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta)-\mathrm{PR}(\theta)
(iii)0(μ+βϵ+Lϵ)max(θPOθ,θPOθ2)+DPR(θPO,θ),θPOθ\displaystyle\stackrel{{\scriptstyle(iii)}}{{\xLeftarrow{}}}0\geq(\mu^{\prime}+\beta\epsilon+L\epsilon)\max(\|{\theta_{\mathrm{PO}}}-\theta\|,\|{\theta_{\mathrm{PO}}}-\theta\|^{2})+\langle\nabla\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta),{\theta_{\mathrm{PO}}}-\theta\rangle

because

  1. (i).

    Because we have ϵ\epsilon-sensitiveness of distribution map 𝒟{\mathcal{D}}.

  2. (ii).

    From mixture dominance we know DPR(θPO,θ)PR(θ)+2,θPOθ\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta)\geq\mathrm{PR}(\theta)+\langle\nabla_{2},{\theta_{\mathrm{PO}}}-\theta\rangle.

  3. (iii).

    This is because of LL-Lipschitzness of loss function w.r.t. zz.

We have 1) If θPOθ2θPOθ\|{\theta_{\mathrm{PO}}}-\theta\|^{2}\leq\|{\theta_{\mathrm{PO}}}-\theta\| (close to optimum), then

DPR(θPO,θ),θθPOμθPOθ2(μ+βϵ+Lϵ)θPOθ\displaystyle\underbrace{\langle\nabla\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta),\theta-{\theta_{\mathrm{PO}}}\rangle}_{\geq\mu\|{\theta_{\mathrm{PO}}}-\theta\|^{2}}\geq(\mu^{\prime}+\beta\epsilon+L\epsilon)\|{\theta_{\mathrm{PO}}}-\theta\| (16)

The retrodiction shows when assuming DPR\mathrm{DPR} is μ\mu-RSI, then our goal is achieved if

μ=μθPOθ(β+L)ϵ0\displaystyle\mu^{\prime}=\mu\|{\theta_{\mathrm{PO}}}-\theta\|-(\beta+L)\epsilon\geq 0 (17)

2) If θPOθθPOθ2\|{\theta_{\mathrm{PO}}}-\theta\|\leq\|{\theta_{\mathrm{PO}}}-\theta\|^{2} then

DPR(θPO,θ),θθPOμθPOθ2(μ+βϵ+Lϵ)θPOθ2\displaystyle\langle\nabla\mathrm{DPR}({\theta_{\mathrm{PO}}},\theta),\theta-{\theta_{\mathrm{PO}}}\rangle\geq\mu\|{\theta_{\mathrm{PO}}}-\theta\|^{2}\geq(\mu^{\prime}+\beta\epsilon+L\epsilon)\|{\theta_{\mathrm{PO}}}-\theta\|^{2} (18)

The retrodiction shows: when assuming DPR\mathrm{DPR} is μ\mu-RSI, then our goal is achieved if

μ=μ(β+L)ϵ0\displaystyle\mu^{\prime}=\mu-(\beta+L)\epsilon\geq 0 (19)

Proof is completed.