This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Diffusion Model for Data-Driven Black-Box Optimizationthanks: \diamond denotes equal contribution listed in alphabetical order. \ddagger denotes co-correspondence.thanks: This paper is a journal version of Yuan et al. (2023).

Zihao Li1,⋄  Hui Yuan1,⋄  Kaixuan Huang1  Chengzhuo Ni1
Yinyu Ye2  Minshuo Chen1,‡  Mengdi Wang1,‡
1Princeton University  2Stanford University
Abstract

Generative AI has redefined artificial intelligence, enabling the creation of innovative content and customized solutions that drive business practices into a new era of efficiency and creativity. In this paper, we focus on diffusion models, a powerful generative AI technology, and investigate their potential for black-box optimization over complex structured variables. Consider the practical scenario where one wants to optimize some structured design in a high-dimensional space, based on massive unlabeled data (representing design variables) and a small labeled dataset. We study two practical types of labels: 1) noisy measurements of a real-valued reward function and 2) human preference based on pairwise comparisons. The goal is to generate new designs that are near-optimal and preserve the designed latent structures. Our proposed method reformulates the design optimization problem into a conditional sampling problem, which allows us to leverage the power of diffusion models for modeling complex distributions. In particular, we propose a reward-directed conditional diffusion model, to be trained on the mixed data, for sampling a near-optimal solution conditioned on high predicted rewards. Theoretically, we establish sub-optimality error bounds for the generated designs. The sub-optimality gap nearly matches the optimal guarantee in off-policy bandits, demonstrating the efficiency of reward-directed diffusion models for black-box optimization. Moreover, when the data admits a low-dimensional latent subspace structure, our model efficiently generates high-fidelity designs that closely respect the latent structure. We provide empirical experiments validating our model in decision-making and content-creation tasks.

1 Introduction

Generative AI has revolutionized the field of artificial intelligence by introducing capabilities that allow for the creation of new, original content — from realistic images and music to sophisticated text and code — paving the way for unprecedented innovations across industries such as entertainment, marketing, and business management (Kshetri et al., 2023; Ooi et al., 2023). In particular, diffusion models stand out as a powerful and universal generative AI technology that serves as the backbone for Stable Diffusion, DALLE 3, Sora, etc. (Ho et al., 2020; Song et al., 2021; Rombach et al., 2022; Liu et al., 2024; Esser et al., 2024). Diffusion models, known as a family of score-matching generative models, have demonstrated the state-of-the-art performance in various domains, such as image generation (Rombach et al., 2022; Ramesh et al., 2022; Balaji et al., 2022) and audio generation, with fascinating potentials in broader domains, including text modeling (Austin et al., 2021; Li et al., 2022), reinforcement learning (Janner et al., 2022; Ajay et al., 2023; Pearce et al., 2023; Liang et al., 2023) and protein structure modeling (Lee et al., 2023b). Diffusion models are trained to sample from the training data’s distribution and generate a new sample through a sequential denoising process. The denoising network (a.k.a. score network) s(x,t)s(x,t) approximates the score function logpt(x)\nabla\log p_{t}(x) (Song et al., 2020a, b), and controls the behavior of diffusion models. People can incorporate any control information cc as an additional input to the score network s(x,c,t)s(x,c,t) during the training and inference (Ramesh et al., 2022; Zhang and Agrawala, 2023).

In this paper, we focus on capitalizing diffusion models for data-driven black-box optimization, where the goal is to generate new solutions that optimize an unknown objective function. Black-box optimization problems, also known as model-based optimization in machine learning, cover various application domains including computational biology (Watson et al., 2023; Guo et al., 2023), reinforcement learning (Ajay et al., 2022; Levine et al., 2020; Jin et al., 2021), and management science, e.g., pricing (Bu et al., 2023; Homburg et al., 2019; Wang et al., 2021) and recommendation systems (Pan and Wu, 2020; Pan et al., 2019). In these applications, the data is high dimensional and the unknown objective function is nonconvex, posing challenges to search for good solutions. To further complicate matters, plausible solutions, e.g., molecule structures in drug discovery (Barrera et al., 2016), are often high-dimensional with latent geometric structures – they lie on a low-dimensional manifold (Kumar and Levine, 2020). This additionally poses a critical requirment for black-box optimization: we need to capture data geometric structures to avoid suggesting unrealistic solutions that deviate severely from the original data domain.

Data-driven black-box optimization differentiates from conventional optimization problems, as it exposes the knowledge of the objective function only via a pre-collected dataset (Kumar and Levine, 2020; Brookes et al., 2019; Fu and Levine, 2021), while additional interactions with the objective function are prohibitive. This excludes the possibility of sequentially searching for good solutions, e.g., gradient ascent, aided by feedback from the objective function at each iteration. Instead, we reformulate the data-driven black-box optimization as sampling from a conditional distribution. This novel perspective is demonstrated in Figure 1, and the objective function value is the conditioning in the conditional distribution. We learn such a conditional distribution from the pre-collected dataset using diffusion models, and optimal solutions are then generated by conditioning on large objective values. Meanwhile, the generated solutions are expected to respect the data latent structures, since the data distribution encodes the corresponding data geometry.

Refer to caption
Figure 1: Generative AI for black-optimization via conditional sampling. We convert the problem of black-box optimization into the problem of sampling from a conditional distribution learned from a pre-collected dataset.

The subtlety of this method, however, lies in that the new solution generation potentially conflicts with the training process: diffusion models are learned to generate solutions similar to the training distribution, however, optimizing the objective function drives the model to deviate from the training distribution. In other words, the model needs to both “interpolate" and “extrapolate". A higher value of the conditioning provides a stronger signal that guides the diffusion model towards higher objective function values, while the increasing distribution shift (i.e., the mismatch between the training data and the targeted generated data; see Figure 2(a)(b)) may hurt the generated data’ quality.

In this paper, we term various objective functions as reward functions and we introduce reward-directed conditional diffusion models for data-driven black-box optimization, with both the real-valued rewards and human preference. We show that reward-directed diffusion models enjoy strong theoretical guarantees and appealing empirical implementations. The theoretical crux to the success of our method centers around the following questions:

How to provably estimate the reward-conditioned distribution in black-box optimization via diffusion?
How to properly choose high reward guidance in diffusion models for data generation, so as to ensure reward improvement among generated data?

Our Approach.

To answer the questions above, we consider a semi-supervised learning setting, where we are given a small annotated dataset 𝒟label\mathcal{D}_{\rm label}, and a massive unlabeled dataset 𝒟unlabel\mathcal{D}_{\rm unlabel}. We study two forms of rewards in 𝒟label\mathcal{D}_{\rm label}:

  • (Real-valued reward) The dataset consists of data and reward pairs, i.e., the reward is a real-valued noise-perturbed version of the underlying ground-truth reward;

  • (Human preference) The dataset consists of triples taking two comparable data points and a binary preference label. The preference label indicates that the corresponding data point is likely to have an edge in the underlying reward over the other one.

Our semi-supervised formulation spans a wide range of applications. For example, in protein design problems, there is a large collection of unlabeled structures, while only a small fraction underwent expensive wet-lab tests. We remark that our method can also be applied to learning scenarios, where only labeled data is available.

Our proposed method begins with estimating the reward function using 𝒟label\mathcal{D}_{\rm label} and leveraging the estimator for pseudo-labeling on 𝒟unlabel\mathcal{D}_{\rm unlabel}. Then we train a reward-conditioned diffusion model using the pseudo-labeled data. In real-world applications, there are other ways to incorporate the knowledge from the massive dataset 𝒟unlabel\mathcal{D}_{\rm unlabel}, e.g., finetuning from a pre-trained model (Ouyang et al., 2022; Zhang and Agrawala, 2023). We focus on the pseudo-labeling approach, as it provides a cleaner formulation and exposes the error dependency on data size and distribution shift. The intuition behind and the message are applicable to other semi-supervised approaches; see experiments in Section 7.2.

To model the data intrinsic low-dimension structures from a theoretical standpoint, we consider data xx to have a latent linear representation. Specifically, we assume x=Azx=Az for some unknown matrix AA with orthonormal columns and zz being a latent variable. The latent variable often has a smaller dimension, reflecting the fact that practical datasets are often low-dimensional (Gong et al., 2019; Tenenbaum et al., 2000; Pope et al., 2021). The representation matrix AA should be learned to promote sample efficiency and generation quality (Chen et al., 2023a). Our theoretical analysis reveals an intricate interplay between reward guidance, distribution shift, and implicit representation learning.

Contributions.

Our results are summarized as follows.

  • (1)

    We show that the reward-conditioned diffusion model implicitly learns the latent subspace representation of xx. Consequently, the model provably generates high-fidelity data that stays close to the subspace (Theorem 5.4).

  • (2)

    Given a target reward value, we analyze the sub-optimality of reward-directed generation, measured by the difference between the target value and the average reward of the generated data. In the case of a linear reward model, we show that the regret mimics the off-policy regret of linear bandits with full knowledge of the subspace feature. In other words, the reward-conditioned generation can be viewed as a form of off-policy bandit learning in the latent feature space. We further instantiate the real-valued reward and human preference setups, and prove that our algorithm achieves a converging sub-optimality under both cases (Theorem 6.3 and 6.9).

  • (3)

    We further extend our theory to nonparametric settings where reward prediction and the training of diffusion models are parameterized by general function classes, which covers the wildly adopted ReLU neural networks in real-world implementations (Section 6.2).

  • (4)

    We provide numerical experiments on both synthetic data, text-to-image generation and offline reinforcement learning to support our theory (Section 7).

To our best knowledge, our results present the first statistical theory for conditioned diffusion models and provably reward improvement guarantees for reward-directed generation.

Refer to caption
Figure 2: Illustration of distribution shifts in samples and reward, as well as encoder-decoder score networks. When performing reward-directed conditional generation, (a) the distribution of the generated data shifts, but still stays close to the feasible data support; (b) the distribution of the rewards for the generated data shifts and the average reward improves. (c). The score network for reward-directed conditioned diffusion adopts an encoder-decoder structure.

2 Related Work

Our work lies in the intersection of three different streams: guided diffusion, theory for diffusion model, and black-box optimization.

2.1 Guided Diffusion Models

Diffusion models successfully incorporate diverse guidance in practical applications. For image generation, guiding the backward diffusion process towards higher log probabilities predicted by a classifier (which can be viewed as the reward signal) leads to improved sample quality, where the classifier can either be externally trained, i.e., classifier guidance (Dhariwal and Nichol, 2021) or implicitly specified by a conditioned diffusion model, i.e., classifier-free guidance (Ho and Salimans, 2022). Classifier-free guidance has become a standard technique in the state-of-the-art text-to-image diffusion models (Rombach et al., 2022; Ramesh et al., 2022; Balaji et al., 2022). Other types of guidance are also explored in Nichol et al. (2021); Graikos et al. (2022); Bansal et al. (2023).

Similar ideas have been explored in sequence modeling problems. In offline reinforcement learning, Decision Diffuser (Ajay et al., 2023) is a diffusion model trained on offline trajectories and can be conditioned to generate new trajectories with high returns, satisfying certain safety constraints, or composing skills. For discrete generations, Diffusion LM (Li et al., 2022) manages to train diffusion models on discrete text space with an additional embedding layer and a rounding step. The authors further show that gradients of any classifier can be incorporated to control and guide the text generation. These appealing empirical performance raises condensed curiosity of their theoretical underpinnings.

2.2 Theory of Diffusion Models

The theoretical foundation of diffusion models are gradually developing in the recent two years. A line of work studies diffusion models from a sampling perspective (De Bortoli et al., 2021; Albergo et al., 2023; Block et al., 2020; Lee et al., 2022a; Chen et al., 2022b; Lee et al., 2022b; Chen et al., 2023c, b; Benton et al., 2023; Montanari and Wu, 2023). When assuming access to a score function that can accurately approximate the ground truth score function in LL^{\infty} or L2L^{2} norm, Chen et al. (2022b); Lee et al. (2023a) provide polynomial convergence guarantees of score-based diffusion models. De Bortoli (2022) further studies diffusion models under the manifold hypothesis.

Recently, Chen et al. (2023a); Oko et al. (2023); Mei and Wu (2023); Wibisono et al. (2024) provide an end-to-end analysis of diffusion models, leveraging both sampling and statistical tools. In particular, they develop score estimation and distribution estimation guarantees of diffusion models. Statistical sample complexities are established and explain the success of diffusion models by their minimax optimality and adaptivity to data intrinsic structures. These results largely motivate our theory, whereas, we are the first to consider conditional score matching and statistical analysis of conditional diffusion models.

2.3 Black-Box Optimization

In this paper, we concentrate on black-box optimization using a pre-collected dataset, while additional first/second-order information cannot be accessed beyond the given dataset. This is sometimes termed the offline black-box or zero-order optimization (Fu and Levine, 2021; Shahriari et al., 2015; Luenberger et al., 2021), and is different from the widely studied setting where active interactions with the objective function are allowed (Peters and Schaal, 2007; Snoek et al., 2014; Hebbal et al., 2019; Snoek et al., 2012). To solve the black-box optimization without active function interactions, Kumar and Levine (2020) propose a model inversion network that learns a stochastic inverse mapping from a reward value to the data instances. Then they generate new instances corresponding to the best possible reward. Such a method demonstrates empirical success, leaving theoretical understandings unexplored. Notably, Krishnamoorthy et al. (2023) also empirically study offline black-box optimization with diffusion model with classifier-free guidance.

Our proposed method and analysis is related the offline bandit/RL theory (Munos and Szepesvári, 2008; Liu et al., 2018; Chen and Jiang, 2019; Fan et al., 2020; Jin et al., 2021; Nguyen-Tang et al., 2021; Brandfonbrener et al., 2021). In particular, our theory extensively deals with distribution shift in the pre-collected dataset by class restricted divergence measures, which are commonly adopted in offline RL (Duan et al., 2020; Xie et al., 2021; Ji et al., 2022). Moreover, our established sub-optimality bound of the generated data consists of an error term that coincides with off-policy linear bandits. However, our analysis goes far beyond the scope of bandit/RL.

3 Problem Setup

Recall that we are presented with a pre-collected dataset for solving data-driven black-box optimization. The massive unlabeled dataset 𝒟unlabel\mathcal{D}_{\rm unlabel} contains only feature vectors {xi}i=1n1\{x_{i}\}_{i=1}^{n_{1}}, where n1n_{1} denotes the sample size in 𝒟unlabel\mathcal{D}_{\rm unlabel}. For the labeled dataset 𝒟label\mathcal{D}_{\rm label}, we study real-valued rewards and human preferences. In both cases, there exists a ground-truth reward function ff^{*}, evaluating the quality of any data point xx.

\bullet Real-valued reward. In this setting, we assume an access to label yiy_{i} for each xix_{i}. In such a case,

𝒟label={(xi,yi)}i=1n2,\mathcal{D}_{\rm label}=\{(x_{i},y_{i})\}_{i=1}^{n_{2}},

where n2n_{2} is the sample size, xix_{i} is randomly sampled and yiy_{i} is a noisy observation of the reward:

yi=f(xi)+ξiforξi𝖭(0,σ2).y_{i}=f^{*}(x_{i})+\xi_{i}\quad\text{for}\quad\xi_{i}\sim{\sf N}(0,\sigma^{2}). (3.1)

Here, σ>0\sigma>0 is the variance of the observation noise. Such a setting corresponds to the scenario when we have a noisy labeler, who can “inform” us how good the generated data point is.

\bullet Human preference. In many real-life applications, a direct evaluation of reward can be impractical and highly biased, e.g., 𝔼[ξi]0\mathbb{E}[\xi_{i}]\neq 0 in (3.1). Existing methods such as Schlett et al. (2022) and Žiga Babnik et al. (2023) resort to biological or visual expressions to assess the data quality. Nonetheless, these methods are limited when it comes to aligning the generation process with human desired properties such as aesthetic quality and stylish requirements (Wang et al., 2023). Thus, in many of these applications, extensive reward engineering or case-by-case training is required, which tends to be costly and time-consuming. Fortunately, recent revolutionary success in language models suggests that human preference is a sensible resource for handling hard-to-perceive abstract rewards (Ouyang et al., 2022; Lee et al., 2021; Wang et al., 2022; Wu et al., 2021; Christiano et al., 2017). In such scenarios, a human expert makes his/her preference among a pair of randomly sampled data points. The human preference reflects the ground-truth reward value, yet circumvents direct queries on the absolute reward values. Formally, our labeled dataset becomes

𝒟label={(xi(1),xi(2),ui)}i=1n2,\displaystyle\mathcal{D}_{\rm label}=\{(x_{i}^{(1)},x_{i}^{(2)},u_{i})\}_{i=1}^{n_{2}},

where xi(1)x^{(1)}_{i} and xi(2)x^{(2)}_{i} are i.i.d. randomly sampled and ui{xi(1),xi(2)}u_{i}\in\{x_{i}^{(1)},x_{i}^{(2)}\} is the preferred instance.

We model the preference as a probabilistic outcome dependent on the ground-truth reward of data xx. In particular, we consider the widely adopted model Bradley-Terry model (Zhu et al., 2023; Ouyang et al., 2022), where we have

(ux(1),x(2))=exp(f(u))exp(f(x(1)))+exp(f(x(2))).\displaystyle{\mathbb{P}}(u\mid x^{(1)},x^{(2)})=\frac{\exp(f^{*}(u))}{\exp(f^{*}(x^{(1)}))+\exp(f^{*}(x^{(2)}))}. (3.2)

Based on the preference information, we can estimate the underlying reward function and further utilize the estimator for inference. Since there is no need to perform real-valued reward engineering, using preference data reduces the reward tuning efforts and leads to outstanding empirical performance (Ouyang et al., 2022; Zhu et al., 2023; Li et al., 2023; Stiennon et al., 2020; Glaese et al., 2022; Zhang et al., 2023).

We assume without loss of generality that 𝒟label\mathcal{D}_{\rm label} and 𝒟unlabel\mathcal{D}_{\rm unlabel} are independent. In both datasets, xx is i.i.d. sampled from an unknown population distribution PxP_{x}. In our subsequent analysis, we focus on the case where PxP_{x} is supported on a latent subspace, meaning that the raw data xx admits a low-dimensional representation as in the following assumption.

Assumption 3.1 .

Data sampling distribution PxP_{x} is supported on a low-dimensional linear subspace, i.e., x=Azx=Az for an unknown AD×dA\in\mathbb{R}^{D\times d} with orthonormal columns and zdz\in\mathbb{R}^{d} is a latent variable.

Note that our setup covers the full-dimensional setting as a special case when d=Dd=D. Yet the case of d<Dd<D is much more interesting, as practical datasets are rich in intrinsic geometric structures (Gong et al., 2019; Pope et al., 2021; Tenenbaum et al., 2000; Kumar and Levine, 2020). Furthermore, the representation matrix AA may encode critical constraints on the generated data. For example, in protein design, the generated samples need to be similar to natural proteins and abide rules of biology, otherwise they easily fail to stay stable, leading to substantial reward decay (Lee et al., 2023b). In those applications, off-support data may be risky and suffer from a large degradation of reward. Base on this motivation, we further assume that the ground-truth reward has the following decomposition.

Assumption 3.2 .

We assume that f(x)=g(x)+h(x2)f^{*}(x)=g^{*}(x_{\parallel})+h^{*}(\|x_{\perp}\|_{2}). Here hh^{*} is always nonpositive and x=x+xx=x_{\parallel}+x_{\perp}, where xx_{\parallel} is the projection of xx onto subspace spanned by columns of AA, and xx_{\perp} is the orthogonal complement.

Assumption 3.2 assumes that the reward can be decomposed into two terms: 1) a positive function that only concerns the projection of xx on the underlying representation AA, and 2) a penalty term that measures how the data deviates from the subspace spanned by matrix AA.

4 Reward-Directed Generation via Conditional Diffusion Models

In this section, we develop a conditioned diffusion model-based method to generate high-fidelity samples with desired properties. In real-world applications such as image/text generation and protein design, one often has access to abundant unlabeled data, but relatively limited number of labeled data. This motivates us to consider a semi-supervised learning setting.

Notation: PxyP_{xy} denotes ground truth joint distribution of xx and its label yy; PxP_{x} is the marginal distribution of xx. Any pair of data in 𝒟label\mathcal{D}_{\rm label} follows PxyP_{xy} and any data in 𝒟unlabel\mathcal{D}_{\rm unlabel} follows PxP_{x}. PP is used to denote a distribution and pp denotes its corresponding density. P(xy=a)P(x\mid y=a) and P(x,y=a)P(x,y=a) are the conditionals of PxyP_{xy} Similarly, we also use the notation Pxy^P_{x\widehat{y}}, P(xy^=a)P(x\mid\widehat{y}=a) for the joint and conditional of (x,y^)(x,\widehat{y}), where the learned reward model predicts y^\widehat{y}. Also, denote a generated distribution using diffusion by P^\widehat{P} (density p^\widehat{p}) followed by the same argument in parentheses as the true distribution it approximates, e.g. P^(xy=a)\widehat{P}(x\mid y=a) is generated as an approximation of P(xy=a)P(x\mid y=a).

4.1 Meta Algorithm

Algorithm 1 Reward-Conditioned Generation via Diffusion Model (RCGDM)
1:  Input: Datasets 𝒟unlabel\mathcal{D}_{\rm unlabel}, 𝒟label\mathcal{D}_{\rm label}, target reward value aa, early-stopping time t0t_{0}, noise level ν\nu.(Note: in the following psuedo-code, ϕt(x)\phi_{t}(x) is the Gaussian density and η\eta is the step size of discrete backward SDE, see Section 4.2 for elaborations on conditional diffusion)
2:  Reward Learning: Estimate the reward function by
f^argminfL(f,𝒟label),\displaystyle\widehat{f}\in\mathop{\mathrm{argmin}}_{f\in\mathcal{F}}{L(f,\mathcal{D}_{\rm label})}, (4.1)
where \ell is a loss that depends on the dataset setup, see §3, and \mathcal{F} is a function class.
3:  Pseudo labeling: Use the learned function f^\widehat{f} to evaluate unlabeled data 𝒟unlabel\mathcal{D}_{\rm unlabel} and augment it with pseudo labeles: 𝒟~={(xj,y^j)=f^(xj)+ξj}j=1n1\widetilde{\mathcal{D}}=\{(x_{j},\widehat{y}_{j})=\widehat{f}(x_{j})+\xi_{j}\}_{j=1}^{n_{1}} for ξji.i.d.𝖭(0,ν2)\xi_{j}\overset{\text{i.i.d.}}{\sim}{\sf N}(0,\nu^{2}).
4:  Conditional score matching: Minimize over s𝒮s\in{\mathcal{S}} (𝒮{\mathcal{S}} constructed as 4.8) on data set 𝒟~\widetilde{\mathcal{D}} via
s^argmins𝒮t0T𝔼^(x,y^)𝒟~𝔼x𝖭(α(t)x,h(t)ID)[xlogϕt(x|x)s(x,y^,t)22]dt.\displaystyle\widehat{s}\in\mathop{\mathrm{argmin}}_{s\in{\mathcal{S}}}\int_{t_{0}}^{T}\widehat{\mathbb{E}}_{(x,\widehat{y})\in\widetilde{\mathcal{D}}}\mathbb{E}_{x^{\prime}\sim{\sf N}(\alpha(t)x,h(t)I_{D})}\left[\|\nabla_{x^{\prime}}\log\phi_{t}(x^{\prime}|x)-s(x^{\prime},\widehat{y},t)\|_{2}^{2}\right]\mathrm{d}t. (4.2)
5:  Conditioned generation: Use the estimated score s^(,a,)\widehat{s}(\cdot,a,\cdot) to sample from the backward SDE:
dX~tt,=[12X~ty,+s^(X~kηy,,a,Tkη)]dt+dW¯tfort[kη,(k+1)η].\displaystyle\mathrm{d}\widetilde{X}_{t}^{t,\Leftarrow}=\left[\frac{1}{2}\widetilde{X}_{t}^{y,\Leftarrow}+\widehat{s}(\widetilde{X}_{k\eta}^{y,\Leftarrow},a,T-k\eta)\right]\mathrm{d}t+\mathrm{d}\overline{W}_{t}\quad\text{for}\quad t\in[k\eta,(k+1)\eta]. (4.3)
6:  Return: Generated population P^(|y^=a)\widehat{P}(\cdot|\widehat{y}=a), learned subspace representation VV contained in s^\widehat{s}.

We present our main algorithm in Algorithm 1 and also see a graphical illustration in Figure 3. In order to generate novel samples with both high fidelity and high rewards, we propose Reward-Conditioned Generation via Diffusion Models (RCGDM). Specifically, we learn the reward function f^\widehat{f} from the dataset 𝒟label\mathcal{D}_{\rm label} via regression, where L(f,𝒟label)L(f,\mathcal{D}_{\rm label}) is the corresponding loss function. In the real-valued reward setting, we specify

L(f,𝒟label)=i=1n2(yif(xi))2+λρ(f),L(f,\mathcal{D}_{\rm label})=\sum_{i=1}^{n_{2}}(y_{i}-f(x_{i}))^{2}+\lambda\rho(f),

where ρ(f)\rho(f) is a regularization term. When ff is a linear function of xx, we are actually doing ridge regression. In the human preference setting, we specify

L(f,𝒟label)=i=1n2{f(ui)log(exp(f(xi(1)))+exp(f(xi(2))))},L(f,\mathcal{D}_{\rm label})=\sum_{i=1}^{n_{2}}\{f(u_{i})-\log\big{(}\exp(f(x_{i}^{(1)}))+\exp\big{(}f(x_{i}^{(2)})\big{)}\big{)}\},

which is essentially an MLE objective. We then use f^\widehat{f} to augment the unlabeled data 𝒟unlabel\mathcal{D}_{\rm unlabel} with “pseudo labeling" and additive noise, i.e., 𝒟~={(xj,y^j=f^(xj)+ξj)}j=1n1\widetilde{\mathcal{D}}=\{(x_{j},\widehat{y}_{j}=\widehat{f}(x_{j})+\xi_{j})\}_{j=1}^{n_{1}} with ξj𝖭(0,ν2)\xi_{j}\sim{\sf N}(0,\nu^{2}) of a small variance ν2\nu^{2}. Here, we add noise ξj\xi_{j} merely for technical reasons in the proof. We denote the joint distribution of (x,y^)(x,\widehat{y}) as Pxy^P_{x\widehat{y}}. Next, we train a conditional diffusion model using the augmented dataset D~\widetilde{D}. If we specify a target value of the reward, for example letting y^=a\widehat{y}=a, we can generate conditioned samples from the distribution P^(x|y^=a)\widehat{P}(x|\widehat{y}=a) by backward diffusion.

Refer to caption
Figure 3: Overview of solving black-box optimization using reward-directed conditional diffusion models. We estimate the reward function from the labeled dataset. Then we compute the estimated reward for each instance in the unlabeled dataset. Next, we train a reward-conditioned diffusion model using the pseudo-labeled data, and generate high reward new instances under proper target reward values.

4.2 Training of Conditional Diffusion Model

We provide details about the training and sampling of conditioned diffusion in Algorithm 1 (Line 4: conditional score matching and Line 5: conditional generation). In Algorithm 1, conditional diffusion model is learned with 𝒟~={(xj,y^j=f^(xj)+ξj)}j=1n1\widetilde{\mathcal{D}}=\{(x_{j},\widehat{y}_{j}=\widehat{f}(x_{j})+\xi_{j})\}_{j=1}^{n_{1}}, where (x,y^)Pxy^(x,\widehat{y})\sim P_{x\widehat{y}}. For simplicity, till the end of this section we use yy instead of y^\widehat{y} to denote the condition variable. The diffusion model is to approximate the conditional probability P(xy^)P(x\mid\widehat{y}).

4.3 Conditional Score Matching

The working flow of conditional diffusion models is nearly identical to that of unconditioned diffusion models reviewed in Chen et al. (2023a). A major difference is we learn a conditional score logpt(x|y)\nabla\log p_{t}(x|y) instead of the unconditional one. Here ptp_{t} denotes the marginal density function at time tt of the following forward O-U process,

dXty=12g(t)Xtydt+g(t)dWtwithX0yP0(x|y)andt(0,T],\displaystyle\mathrm{d}X_{t}^{y}=-\frac{1}{2}g(t)X_{t}^{y}\mathrm{d}t+\sqrt{g(t)}\mathrm{d}W_{t}\quad\text{with}\quad X_{0}^{y}\sim P_{0}(x|y)~{}\text{and}~{}t\in(0,T], (4.4)

where similarly TT is a terminal time, (Wt)t0(W_{t})_{t\geq 0} is a Wiener process, and the initial distribution P0(x|y)P_{0}(x|y) is induced by the (x,y^)(x,\widehat{y})-pair distribution Pxy^P_{x\widehat{y}}. Note here the noise is only added on xx but not on yy. Throughout the paper, we consider g(t)=1g(t)=1 for simplicity. We denote by Pt(xt|y)P_{t}(x_{t}|y) the distribution of XtyX_{t}^{y} and let pt(xt|y)p_{t}(x_{t}|y) be its density and Pt(xt,y)P_{t}(x_{t},y) be the corresponding joint, shorthanded as PtP_{t}. A key step is to estimate the unknown logpt(xt|y)\nabla\log p_{t}(x_{t}|y) through denoising score matching (Song et al., 2020b). A conceptual way is to minimize the following quadratic loss with 𝒮{\mathcal{S}}, a concept class:

argmins𝒮0T𝔼(xt,y)Pt[logpt(xt|y)s(xt,y,t)22]dt,\mathop{\mathrm{argmin}}_{s\in{\mathcal{S}}}\int_{0}^{T}\mathbb{E}_{(x_{t},y)\sim P_{t}}\left[\|\nabla\log p_{t}(x_{t}|y)-s(x_{t},y,t)\|_{2}^{2}\right]\mathrm{d}t, (4.5)

Unfortunately, the loss in (4.5) is intractable since logpt(xt|y)\nabla\log p_{t}(x_{t}|y) is unknown. Inspired by Hyvärinen and Dayan (2005) and Vincent (2011), we choose a new objective (4.2) and show their equivalence in the following Proposition. The proof is provided in Appendix A.1.

Proposition 4.1 (Score Matching Objective for Implementation).

For any t>0t>0 and score estimator ss, there exists a constant CtC_{t} independent of ss such that

𝔼(xt,y)Pt[logpt(xt|y)s(xt,y,t)22]\displaystyle\mathbb{E}_{(x_{t},y)\sim P_{t}}\left[\|\nabla\log p_{t}(x_{t}|y)-s(x_{t},y,t)\|_{2}^{2}\right]
=𝔼(x,y)Pxy^𝔼x𝖭(α(t)x,h(t)ID)[xlogϕt(x|x)s(x,y,t)22]+Ct,\displaystyle=\mathbb{E}_{(x,y)\sim P_{x\widehat{y}}}\mathbb{E}_{x^{\prime}\sim{\sf N}(\alpha(t)x,h(t)I_{D})}\left[\|\nabla_{x^{\prime}}\log\phi_{t}(x^{\prime}|x)-s(x^{\prime},y,t)\|_{2}^{2}\right]+C_{t}, (4.6)

where xlogϕt(x|x)=xα(t)xh(t)\nabla_{x^{\prime}}\log\phi_{t}(x^{\prime}|x)=-\frac{x^{\prime}-\alpha(t)x}{h(t)}, where ϕt(x|x)\phi_{t}(x^{\prime}|x) is the density of 𝖭(α(t)x,h(t)ID){\sf N}(\alpha(t)x,h(t)I_{D}) with α(t)=exp(t/2)\alpha(t)=\exp(-t/2) and h(t)=1exp(t)h(t)=1-\exp(-t).

The proof of Proposition 4.1 can be found in Appendix A.1. Here Equation (4.1) allows an efficient implementation, since Pxy^P_{x\widehat{y}} can be approximated by the empirical data distribution in 𝒟~\widetilde{\mathcal{D}} and xx^{\prime} is easy to sample. Integrating (4.1) over time tt leads to a practical conditional score matching object

argmins𝒮t0T𝔼^(x,y)Pxy^𝔼x𝖭(α(t)x,h(t)ID)[xlogϕt(x|x)s(x,y,t)22]dt,\displaystyle\mathop{\mathrm{argmin}}_{s\in{\mathcal{S}}}\int_{t_{0}}^{T}\widehat{\mathbb{E}}_{(x,y)\sim P_{x\widehat{y}}}\mathbb{E}_{x^{\prime}\sim{\sf N}(\alpha(t)x,h(t)I_{D})}\left[\|\nabla_{x^{\prime}}\log\phi_{t}(x^{\prime}|x)-s(x^{\prime},y,t)\|_{2}^{2}\right]\mathrm{d}t, (4.7)

where t0>0t_{0}>0 is an early-stopping time to stabilize the training (Song and Ermon, 2020; Vahdat et al., 2021) and 𝔼^\widehat{\mathbb{E}} denotes the empirical distribution.

Constructing a function class 𝒮{\mathcal{S}} adaptive to data structure is beneficial for learning the conditional score. In the same spirit of Chen et al. (2023a), we propose the score network architecture (see Figure 2(c) for an illustration):

𝒮={𝐬V,ψ(x,y,t)=1h(t)(Vψ(Vx,y,t)x)\displaystyle{\mathcal{S}}=\bigg{\{}\mathbf{s}_{V,\psi}(x,y,t)=\frac{1}{h(t)}(V\cdot\psi(V^{\top}x,y,t)-x) :VD×d,ψΨ:d+1×[t0,T]d},\displaystyle:~{}V\in\mathbb{R}^{D\times d},~{}\psi\in\Psi:\mathbb{R}^{d+1}\times[t_{0},T]\to\mathbb{R}^{d}~{}\bigg{\}}, (4.8)

with VV being any D×dD\times d matirx with orthonormal columns and Φ\Phi a customizable function class. This design has a linear encoder-decoder structure, catering to the latent subspace structure in data. Also 1h(t)x-\frac{1}{h(t)}x is included as a shortcut connection.

4.4 Conditioned Generation

Sampling from the model is realized by running a discretized backward process with step size η>0\eta>0 described as follows:

dX~tt,=[12X~ty,+s^(X~kηy,,y,Tkη)]dt+dW¯tfort[kη,(k+1)η].\displaystyle\mathrm{d}\widetilde{X}_{t}^{t,\Leftarrow}=\left[\frac{1}{2}\widetilde{X}_{t}^{y,\Leftarrow}+\widehat{s}(\widetilde{X}_{k\eta}^{y,\Leftarrow},y,T-k\eta)\right]\mathrm{d}t+\mathrm{d}\overline{W}_{t}\quad\text{for}\quad t\in[k\eta,(k+1)\eta]. (Eqn. (4.3) revisited)

initialized with X~tt,𝖭(0,ID)\widetilde{X}_{t}^{t,\Leftarrow}\sim{\sf N}(0,I_{D}) and W¯t\overline{W}_{t} is a reversed Wiener process. Note that in (4.3), the unknown conditional score pt(x|y)\nabla p_{t}(x|y) is substituted by s^(x,y,t)\widehat{s}(x,y,t).

Remark 4.2 (Alternative methods).

In Algorithm 1 we train the conditional diffusion model via conditional score matching (Line 4). This approach is suitable when we have access to the unlabeled dataset and need to train a brand-new diffusion model from scratch. Practically, we can utilize a pre-trained diffusion model of the unlabeled data directly and incorporate the reward information to the pre-trained model. Popular methods in this direction include classifier-based guidance (Dhariwal and Nichol, 2021), fine-tuning (Zhang and Agrawala, 2023), and self-distillation (Song et al., 2023); all sharing a similar spirit with Algorithm 1.

5 Statistical Analysis of Reward-Directed Condition Diffusion

In this section, let us analyze the reward-conditioned diffusion model give by Algorithm 1 for data-driven optimization. In particular, we are interested in two properties of the generated samples: 1) the reward levels of these new samples and 2) their level of fidelity – how much do new samples deviate from the latent subspace. To measure these properties, we adopt two sets of metrics. Firstly, the sub-optimality gap quantifies the average reward of the generated samples to the target reward value. Secondly, the subspace angle and an expected l2l_{2} deviation quantify the sample fidelity. In the sequel, we analyze these performance metrics, relating them to the reward learning and diffusion model training steps in Algorithm 1.

5.1 Sub-Optimality Decomposition

We introduce the performance metric for a generated distribution with respect to a certain target reward value. Let yy^{*} be a target reward value and PP be a distribution learned by Algorithm 1 (we omit the hat notation on PP for simplicity). We define the sub-optimality of PP as

SubOpt(P;y)=y𝔼xP[f(x)],\texttt{SubOpt}(P;y^{*})=y^{*}-\mathbb{E}_{x\sim P}[f^{*}(x)],

which measures the gap between the expected reward of xPx\sim P and the target value yy^{*}. In the language of bandit learning, this gap can be viewed as a form of off-policy sub-optimality.

We introduce the following proposition, which decomposes the sub-optimality SubOpt(P;y)\texttt{SubOpt}(P;y^{*}). Specifically, we show that SubOpt(P;y=a)\texttt{SubOpt}(P;y^{*}=a) comprises of three components: off-policy bandit regret which comes from the estimation error of the reward function, on-support and off-support errors coming from estimating conditional distributions with diffusion.

Proposition 5.1.

Set the target reward value y=ay^{*}=a. For a generated distribution PP of Algorithm 1, it holds that

SubOpt(P;y=a)𝔼xPa[|f^(x)g(x)|]1:off-policy bandit regret+|𝔼xPa[g(x)]𝔼xP[g(x)]|2:on-support diffusion error+𝔼xP[h(x2)]3:off-support diffusion error.\displaystyle\texttt{SubOpt}(P;y^{*}=a)\leq\underbrace{\mathbb{E}_{x\sim P_{a}}\left[\left|\widehat{f}(x)-g^{*}(x)\right|\right]}_{\mathcal{E}_{1}:\textrm{off-policy bandit regret}}+\underbrace{\left|\mathbb{E}_{x\sim P_{a}}[g^{*}(x_{\parallel})]-\mathbb{E}_{x\sim P}[g^{*}(x_{\parallel})]\right|}_{\mathcal{E}_{2}:\textrm{on-support diffusion error}}+\underbrace{\mathbb{E}_{x\sim P}[h^{*}(\|x_{\perp}\|_{2})]}_{\mathcal{E}_{3}:\textrm{off-support diffusion error}}.

The proof is provided in Appendix B.3.1. Error 1\mathcal{E}_{1} corresponds to the reward function estimation error, errors 2\mathcal{E}_{2} and 3\mathcal{E}_{3} are induced by the training of diffusion model.

5.2 Sample Fidelity

In this section, we analyze the fidelity of the conditional generation process specified by Algorithm 1. Specifically, we assess its ability to capture the underlying representation, and how the data generated by the conditional diffusion model deviates the ground truth representation. Recall that Algorithm 1 has two outputs: the generated distribution P^(|y^=a)\widehat{P}(\cdot|\widehat{y}=a) and the learned representation matrix VV. We use notations P^a:=P^(|y^=a)\widehat{P}_{a}:=\widehat{P}(\cdot|\widehat{y}=a) (generated distribution) and Pa:=P(|y^=a)P_{a}:=P(\cdot|\widehat{y}=a) (target distribution) for better clarity in the rest of the presentation. We define two metrics

(V,A)=VVAAF2 and 𝔼xP^a[x2].\displaystyle\angle({V},{A})=\|VV^{\top}-AA^{\top}\|_{\rm F}^{2}\text{\quad and \quad}\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x_{\perp}\|_{2}]. (5.1)

Here, (V,A)\angle({V},{A}) is defined for matrices VV and AA, where AA is the matrix encoding the ground truth subspace. Clearly, (V,A)\angle({V},{A}) measures the difference in the column span of VV and AA, which is also known as the subspace angle. Term 𝔼xP^a[x2]\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x_{\perp}\|_{2}] is defined as the expected l2l_{2} distance between xx and the subspace.

We impose the following commonly adopted light-tail assumption (Wainwright, 2019) on the latent distribution zz for quantifying sample fidelity.

Assumption 5.2 .

The latent variable zz follows distribution PzP_{z} with density pzp_{z}, such that there exist constants B,C1,C2B,C_{1},C_{2} verifying pz(z)(2π)d/2C1exp(C2z22/2)p_{z}(z)\leq(2\pi)^{-d/2}C_{1}\exp\left(-C_{2}\|z\|_{2}^{2}/2\right) whenever z2>B\|z\|_{2}>B. Moreover, the ground truth score is realizable, i.e., logpt(xy^)𝒮\nabla\log p_{t}(x\mid\widehat{y})\in{\mathcal{S}}.

We remark that the realizability condition is for technical convenience, which can be removed in our nonparametric extension in Section 6.2. Assumption 5.2 is also mild and covers arbitrary sub-Gaussian latent data distributions, which include the following Gaussian assumption as a special case.

Assumption 5.3 .

The latent variable z𝖭(0,Σ)z\sim{\sf N}(0,\Sigma) with a covariance matrix Σ\Sigma satisfying λminIdΣλmaxId\lambda_{\min}I_{d}\preceq\Sigma\preceq\lambda_{\max}I_{d} for λminλmax1\lambda_{\min}\leq\lambda_{\max}\leq 1 and λmin\lambda_{\min} being positive.

Now we present sample fidelity guarantees.

Theorem 5.4 (Subspace Fidelity of Generated Data).

Under Assumption 3.1, if Assumption 5.2 holds with c0Id𝔼zPz[zz]c_{0}I_{d}\preceq\mathbb{E}_{z\sim P_{z}}\left[zz^{\top}\right] for some constant c0>0c_{0}>0, then with high probability on data, it holds that

(V,A)=𝒪~(1c0𝒩(𝒮,1/n1)Dn1)\angle({V},{A})=\widetilde{\mathcal{O}}\left(\frac{1}{c_{0}}\sqrt{\frac{\mathcal{N}({\mathcal{S}},1/n_{1})D}{n_{1}}}\right) (5.2)

with 𝒩(𝒮,1/n1)\mathcal{N}({\mathcal{S}},1/n_{1}) being the log covering number of function class 𝒮{\mathcal{S}} as in Equation (4.8). Moreover, with Assumption 5.3 and Dd2D\geq d^{2}, we have 𝒩(𝒮,1/n1)=𝒪((d2+Dd)log(Ddn1))\mathcal{N}({\mathcal{S}},1/n_{1})=\mathcal{O}((d^{2}+Dd)\log(Ddn_{1})) and thus (V,A)=𝒪~(1λminDd2+D2dn1)\angle({V},{A})=\widetilde{\mathcal{O}}(\frac{1}{\lambda_{\min}}\sqrt{\frac{Dd^{2}+D^{2}d}{n_{1}}}).

The proof is provided in Appendix B.2. In Theorem 5.4, we provide a fidelity guarantee for the learned subspace VV. Specifically, when the covariance of the underlying representation zz is invertible, we prove that the error in subspace learning converges in a parametric rate.

Remark 5.5.

We make the following remarks for Theorem 5.4.

  1. 1.

    Guarantee (5.2) applies to general distributions with light tail as assumed in Assumption 5.2.

  2. 2.

    Guarantee (6.1) guarantees high fidelity of generated data in terms of staying in the subspace when we have access to a large unlabeled dataset.

  3. 3.

    Guarantee (6.1) shows that 𝔼xP^a[x2]\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x_{\perp}\|_{2}] scales up when t0t_{0} goes up, which aligns with the dynamic in backward process that samples are concentrating to the learned subspace as t0t_{0} goes to 0. Taking t00t_{0}\to 0, 𝔼xP^a[x]\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x_{\perp}\|] has the decay in O(n114)O(n_{1}^{-\frac{1}{4}}). However, taking t00t_{0}\to 0 is not ideal for the sake of high reward of xx, we take the best trade-off of t0t_{0} later in Theorem 6.3.

6 Optimization Theory

In this section, we provide optimization guarantees by instantiating our performance metrics. We consider two theoretical settings: the parametric and nonparametric settings. In the parametric setting, we focus on a linear reward model with off-support penalties. The reward observation in 𝒟label\mathcal{D}_{\rm label} covers both the real-valued and human preference cases. Through detailed analysis, we establish handy guidance on how to choose a proper target reward value for ensuring quality improvement. We then extend our results to nonparametric settings, where a ReLU neural network is used for reward estimation. We then present the sub-optimality guarantee in the corresponding context.

6.1 Results under Real-Valued Reward

In this subsection, we consider the case when we have access to a noisy reward label for every xix_{i} in 𝒟label\mathcal{D}_{\rm label}, i.e. 𝒟label={(xi.yi)}i[n2]\mathcal{D}_{\rm label}=\{(x_{i}.y_{i})\}_{i\in[n_{2}]}, where yiy_{i} is sampled according to (3.1). Recall that in Assumption 3.2, the ground truth reward has the decomposition f(x)=g(x)+h(x)f^{*}(x)=g^{*}(x_{\parallel})+h^{*}(x_{\perp}), where xx_{\parallel} is the projection of xx on subspace AA, and xx_{\perp} is the part orthogonal to AA. We consider two setups for g(x)g^{*}(x_{\parallel}): 1) the parametric setting, where g(x)g^{*}(x_{\parallel}) is a linear function of xx_{\parallel}; 2) the nonparametric setting, where g(x)g^{*}(x_{\parallel}) is a general Hölder continuous function, and we use a deep ReLU neural network for reward estimation.

Parametric Setting.

For the parametric setup, we make the following assumption. Such an assumption assumes that the reward function consists of two parts: a linear function of xx and a penalty term that increases in magnitude when xx deviates from the representation subspace.

Assumption 6.1 .

We assume that g(x)=(θ)xg^{*}(x_{\parallel})=(\theta^{*})^{\top}x_{\parallel} where θ=Aβ\theta^{*}=A\beta^{*} for some βd\beta^{*}\in\mathbb{R}^{d} and θ2=β2=1\|\theta^{*}\|_{2}=\|\beta^{*}\|_{2}=1. The penalty h(x)h^{*}(x_{\perp}) is always nonpositive and decreases with respect to x2\|x_{\perp}\|_{2} with h(0)=0h^{*}(0)=0.

Assumption 6.1 adopts a simple linear reward model, which will be generalized in Section 6.2. We now present Theorem 6.2, which is a corollary of Theorem 5.4 and characterizes the deviation of the generated samples from the ground-truth representation subspace.

Theorem 6.2.

Assume that the conditions in Theorem 5.4 hold. Further under Assumption 6.1, it holds that

𝔼xP^a[x2]=𝒪(t0D+(V,A)a2βΣ+d),\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x_{\perp}\|_{2}]=\mathcal{O}\left(\sqrt{t_{0}D}+\sqrt{\angle({V},{A})}\cdot\sqrt{\frac{a^{2}}{\|\beta^{*}\|_{\Sigma}}+d}\right), (6.1)

where β\beta^{*} is the ground-truth parameter of linear model and t0t_{0} is the early-stopping time in diffusio models.

The proof is provided in Appendix B.2. We observe that larger target reward aa leads to more severe subspace deviation, since aggressive extrapolation is needed to match the target reward. Therefore, to maintain high sample fidelity prevents us from setting aa too large.

On the other hand, we examine the sub-optimality bound. We begin with the real-valued rewards, where a labeled dataset 𝒟label={(xi,yi)}i=1n2\mathcal{D}_{\rm label}=\{(x_{i},y_{i})\}_{i=1}^{n_{2}} is given. In this case, we perform ridge regression to estimate the linear reward function in Step 2 of Algorithm 1, i.e.,

argminθi=1n2(θxiyi)2+λθ22\mathop{\mathrm{argmin}}_{\theta}~{}\sum_{i=1}^{n_{2}}(\theta^{\top}x_{i}-y_{i})^{2}+\lambda\|\theta\|_{2}^{2}

for a positive coefficient λ\lambda. We now characterize the sub-optimality bound in the following theorem.

Theorem 6.3 (Off-policy Regret of Generated Samples).

Suppose Assumptions 3.1, 5.3 and 6.1 hold. We choose λ=1\lambda=1, t0=((Dd2+D2d)/n1)1/6t_{0}=\left((Dd^{2}+D^{2}d)/n_{1}\right)^{1/6} and ν=1/D\nu=1/\sqrt{D}. With high probability, running Algorithm 1 with a target reward value aa gives rise to

SubOpt(P^a;y=a)\displaystyle\quad\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a)
Tr(Σ^λ1ΣPa)𝒪(dlogn2n2)1:off-policy bandit regret+𝙳𝚒𝚜𝚝𝚛𝚘𝚂𝚑𝚒𝚏𝚝(a)(d2D+D2d)1/6n11/6a2:on-support diffusion error\displaystyle\leq\underbrace{\sqrt{\operatorname{Tr}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})}\cdot\mathcal{O}\left(\sqrt{\frac{d\log n_{2}}{n_{2}}}\right)}_{\mathcal{E}_{1}:\textrm{off-policy bandit regret}}+\underbrace{{\tt DistroShift}(a)\cdot\left(d^{2}D+D^{2}d\right)^{1/6}{n_{1}}^{-1/6}\cdot a}_{\mathcal{E}_{2}:\textrm{on-support diffusion error}}
+𝔼P^a[h(x)]3:off-support diffusion error,\displaystyle\quad+\underbrace{\mathbb{E}_{\widehat{P}_{a}}[h^{*}(x_{\perp})]}_{\mathcal{E}_{3}:\textrm{off-support diffusion error}}, (6.2)

where Σ^λ:=1n2(XX+λID)\widehat{\Sigma}_{\lambda}:=\frac{1}{n_{2}}(X^{\top}X+\lambda I_{D}) where XX is the stack matrix of 𝒟label\mathcal{D}_{\rm label} and ΣPa=𝔼Pa[xx]\Sigma_{P_{a}}=\mathbb{E}_{P_{a}}[xx^{\top}].

The proof is provided in Appendix B.3.

Implications and Discussions:

1). Equation (6.3) and (6.9) decompose the suboptimality gap into two separate parts of error: error from reward learning (1\mathcal{E}_{1}) and error coming from diffusion (2\mathcal{E}_{2} and 3\mathcal{E}_{3}).

2). 1\mathcal{E}_{1} depending on dd shows diffusion model learns a low-dimensional representation of xx, reducing DD to smaller latent dimension dd. It can be seen from Tr(Σ^λ1Σpq)𝒪(a2βΣ+d)\operatorname{Tr}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{p_{q}})\leq\mathcal{O}\left(\frac{a^{2}}{\|{\beta}^{*}\|_{\Sigma}}+d\right) when n2=Ω(1λmin)n_{2}=\Omega(\frac{1}{\lambda_{\min}}). Similarly, we can prove that Tr(Σ~λ1ΣPa)𝒪(a2βΣ+d)\operatorname{Tr}(\widetilde{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})\leq\mathcal{O}\left(\frac{a^{2}}{\|{\beta}^{*}\|_{\Sigma}}+d\right) when n2=Ω(1λmin)n_{2}=\Omega(\frac{1}{\lambda_{\min}}). Such a result shows that even without an observable reward, we can still obtain an off-policy bandit regret that matches the case with an observable reward dataset.

3). If we ignore the diffusion errors, the suboptimality gap resembles the standard regret of off-policy bandit learning in dd-dimensional feature subspace (Jin et al., 2021, Section 3.2), (Nguyen-Tang et al., 2021; Brandfonbrener et al., 2021).

4). It is also worth mentioning that 2\mathcal{E}_{2} and 3\mathcal{E}_{3} depend on t0t_{0} and that by taking t0=((Dd2+D2d)/n1)1/6t_{0}=\left((Dd^{2}+D^{2}d)/n_{1}\right)^{1/6} one gets a good trade-off in 2\mathcal{E}_{2}.

5). On-support diffusion error entangles with distribution shift in complicated ways. We show

2=(𝙳𝚒𝚜𝚝𝚛𝚘𝚂𝚑𝚒𝚏𝚝(a)(d2D+D2d)1/6n11/6a),\displaystyle\mathcal{E}_{2}=\left({\tt DistroShift}(a)\cdot\left(d^{2}D+D^{2}d\right)^{1/6}{n_{1}}^{-1/6}\cdot a\right),

where 𝙳𝚒𝚜𝚝𝚛𝚘𝚂𝚑𝚒𝚏𝚝(a){\tt DistroShift}(a) quantifies the distribution shift depending on different reward values. In the special case of the latent covariance matrix Σ\Sigma is known, we can quantify the distribution shift as 𝙳𝚒𝚜𝚝𝚛𝚘𝚂𝚑𝚒𝚏𝚝(a)=𝒪(ad){\tt DistroShift}(a)=\mathcal{O}(a\vee d). We observe an interesting phase shift. When a<da<d, the training data have sufficient coverage with respect to the generated distribution P^a\widehat{P}_{a}. Therefore, the on-support diffusion error has a lenient linear dependence on aa. However, when a>da>d, the data coverage is very poor and 2\mathcal{E}_{2} becomes quadratic in aa, which quickly amplifies.

6). When generated samples deviate away from the latent space, the reward may substantially degrade as determined by the nature of hh.

6.2 Extension to Nonparametric Setting

Our theoretical analysis, in its full generality, extends to using general nonparametric function approximation for both the reward and score functions. Built upon the insights from the linear setting, we provide an analysis of the nonparametric reward and general data distribution setting. We generalize Assumption 6.1 to the following.

Assumption 6.4 .

We assume that g(x)g^{*}(x_{\parallel}) is α\alpha-Hölder continuous for α1\alpha\geq 1. Moreover, gg^{*} has a bounded Hölder norm, i.e., gα1\|g^{*}\|_{\mathcal{H}^{\alpha}}\leq 1.

Hölder continuity is widely studied in nonparametric statistics literature (Györfi et al., 2002; Tsybakov, 2008). Under Assumption 6.4, we use nonparametric regression for estimating ff^{*}. Specifically, we specialize (4.1) in Algorithm 1 to

f^θargminfθ12ni=1n1(fθ(xi)yi)2,\displaystyle\widehat{f}_{\theta}\in\mathop{\mathrm{argmin}}_{f_{\theta}\in\mathcal{F}}\frac{1}{2n}\sum_{i=1}^{n_{1}}(f_{\theta}(x_{i})-y_{i})^{2},

where \mathcal{F} is chosen to be a class of neural networks. Hyperparameters in \mathcal{F} will be chosen properly in Theorem 6.7.

Our theory also considers generic sampling distributions on xx. Since xx lies in a low-dimensional subspace, this translates to a sampling distribution assumption on latent variable zz.

Assumption 6.5 .

The latent variable zz follows a distribution PzP_{z} with density pzp_{z}, such that there exist constants B,C1,C2B,C_{1},C_{2} verifying pz(z)(2π)d/2C1exp(C2z22/2)p_{z}(z)\leq(2\pi)^{-d/2}C_{1}\exp\left(-C_{2}\|z\|_{2}^{2}/2\right) whenever z2>B\|z\|_{2}>B. Moreover, there exists a positive constant c0c_{0} such that c0Id𝔼zPz[zz]c_{0}I_{d}\preceq\mathbb{E}_{z\sim P_{z}}\left[zz^{\top}\right].

Assumption 6.5 says PzP_{z} has a light tail and also encodes distributions with compact support. Furthermore, we assume that the curated data (x,y^)(x,\widehat{y}) induces Lipschitz conditional scores. Motivated by Chen et al. (2023a), we show that the linear subspace structure in xx leads to a similar conditional score decomposition logpt(x|y^)=s(x,y^,t)+s(x,y^,t)\nabla\log p_{t}(x|\widehat{y})=s_{\parallel}(x,\widehat{y},t)+s_{\perp}(x,\widehat{y},t), where ss_{\parallel} is the on-support score and ss_{\perp} is the orthogonal score. The formal statement and proof for this decomposition are deferred to Appendix E. We make the following assumption on ss_{\parallel}.

Assumption 6.6 .

The on-support conditional score function s(x,y^,t)s_{\parallel}(x,\widehat{y},t) is Lipschitz with respect to x,y^x,\widehat{y} for any t(0,T]t\in(0,T], i.e., there exists a constant ClipC_{\rm lip}, such that for any x,y^x,\widehat{y} and x,y^x^{\prime},\widehat{y}^{\prime}, it holds

s(x,y^,t)s(x,y^,t)2Clipxx2+Clip|y^y^|.\displaystyle\|s_{\parallel}(x,\widehat{y},t)-s_{\parallel}(x^{\prime},\widehat{y}^{\prime},t)\|_{2}\leq C_{\rm lip}\|x-x^{\prime}\|_{2}+C_{\rm lip}|\widehat{y}-\widehat{y}^{\prime}|.

Lipschitz score is commonly adopted in existing works (Chen et al., 2022b; Lee et al., 2023a). Yet Assumption 6.6 only requires the Lipschitz continuity of the on-support score, which matches the weak regularity conditions in Lee et al. (2023a); Chen et al. (2023a). We then choose the score network architecture similar to that in the linear reward setting, except we replace mm by a nonlinear network. Recall the linear encoder and decoder estimate the representation matrix AA.

We consider feedforward networks with ReLU activation functions as concept classes \mathcal{F} and 𝒮{\mathcal{S}} for nonparametric regression and conditional score matching. Given an input xx, neural networks compute

fNN(x)=WLσ(σ(W1x+b1))+bL,\displaystyle f_{\rm NN}(x)=W_{L}\sigma(\dots\sigma(W_{1}x+b_{1})\dots)+b_{L}, (6.3)

where WiW_{i} and bib_{i} are weight matrices and intercepts, respectively. We then define a class of neural networks as

NN(L,M,J,K,κ)={f:fin the form of (6.3) with L layers and width bounded by M,\displaystyle{\rm NN}(L,M,J,K,\kappa)=\Big{\{}f:f~{}\text{in the form of \eqref{eq:fnn} with $L$ layers and width bounded by~{}}M,
supxf(x)2K,max{bi,Wi}κfori=1,,L,andi=1L(Wi0+bi0)J}.\displaystyle\sup_{x}\|f(x)\|_{2}\leq K,\max\{\|b_{i}\|_{\infty},\|W_{i}\|_{\infty}\}\leq\kappa~{}\text{for}~{}i=1,\dots,L,~{}\text{and}~{}\sum_{i=1}^{L}\big{(}\|W_{i}\|_{0}+\|b_{i}\|_{0}\big{)}\leq J\Big{\}}.

For the conditional score network, we will additionally impose some Lipschitz continuity requirement, i.e., f(x)f(y)2clipxy2\|f(x)-f(y)\|_{2}\leq c_{\rm lip}\|x-y\|_{2} for some Lipschitz coefficient clipc_{\rm lip}. Now we are ready to bound SubOpt(P^a;y=a)\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a) in Theorem 6.7 in terms of non-parametric regression error, score matching error and distribution shifts in both regression and score matching.

Theorem 6.7.

Suppose Assumptions 3.1, 6.4, 6.5 and 6.6 hold. Let δ(n)=dloglognlogn\delta(n)=\frac{d\log\log n}{\log n}. Properly chosen \mathcal{F} and 𝒮{\mathcal{S}}, with high probability, running Algorithm 1 with a target reward value aa and early-stopping time t0=(n122δ(n1)d+6+Dn1d+4d+6)13t_{0}=\left(n_{1}^{-\frac{2-2\delta(n_{1})}{d+6}}+Dn_{1}^{-\frac{d+4}{d+6}}\right)^{\frac{1}{3}} gives rise to (V,A)𝒪~(1c0(n122δ(n1)d+6+Dn1d+4d+6))\angle({V},{A})\leq\widetilde{\mathcal{O}}\left(\frac{1}{c_{0}}\left(n_{1}^{-\frac{2-2\delta(n_{1})}{d+6}}+Dn_{1}^{-\frac{d+4}{d+6}}\right)\right) and

SubOpt(P^a;y=a)\displaystyle\quad\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a)
𝒯(P(x|y^=a),Px;¯)𝒪~(n2αδ(n2)2α+d+D/n2)1\displaystyle\leq\underbrace{\sqrt{{\mathcal{T}}(P(x|\widehat{y}=a),P_{x};\bar{\mathcal{F}})}\cdot\widetilde{\mathcal{O}}\left(n_{2}^{-\frac{\alpha-\delta(n_{2})}{2\alpha+d}}+D/n_{2}\right)}_{\mathcal{E}_{1}}
+(𝒯(P(x,y^=a),Pxy^;𝒮¯)c0g+M(a))𝒪~((n122δ(n1)d+6+Dn1d+4d+6)13)2\displaystyle\quad+\underbrace{\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{c_{0}}}\cdot\|g^{*}\|_{\infty}+\sqrt{M(a)}\right)\cdot\widetilde{\mathcal{O}}\left(\left(n_{1}^{-\frac{2-2\delta(n_{1})}{d+6}}+Dn_{1}^{-\frac{d+4}{d+6}}\right)^{\frac{1}{3}}\right)}_{\mathcal{E}_{2}}
+𝔼xP^a[h(x)]3,\displaystyle\quad+\underbrace{\mathbb{E}_{x\sim\widehat{P}_{a}}[h^{*}(x_{\perp})]}_{\mathcal{E}_{3}},

where M(a):=𝔼z(a)[z22]M(a):=\mathbb{E}_{z\sim\mathbb{P}(a)}[\|z\|^{2}_{2}] and

¯:={|f(x)f(x)|2:f},𝒮¯={1Tt0t0T𝔼xtxlogpt(xty)s(xt,y,t)22dt:s𝒮},\displaystyle\bar{\mathcal{F}}:=\{|f^{*}(x)-f(x)|^{2}:f\in\mathcal{F}\},\quad\bar{{\mathcal{S}}}=\left\{\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\mathbb{E}_{x_{t}\mid x}\|\nabla\log p_{t}(x_{t}\mid y)-s(x_{t},y,t)\|_{2}^{2}\mathrm{d}t:s\in{\mathcal{S}}\right\},

3\mathcal{E}_{3} penalizes the component in P^a\widehat{P}_{a} that is off the subspace. The function classes \mathcal{F} and 𝒮{\mathcal{S}} are chosen as =NN(Lf,Mf,Jf,Kf,κf)\mathcal{F}={\rm NN}(L_{f},M_{f},J_{f},K_{f},\kappa_{f}) with

Lf=𝒪(logn2),Mf=𝒪(n2dd+2α(logn2)d/2),Jf=𝒪(n2dd+2α(logn2)d/2+1)\displaystyle L_{f}=\mathcal{O}(\log n_{2}),\ M_{f}=\mathcal{O}\left(n_{2}^{-\frac{d}{d+2\alpha}}(\log n_{2})^{d/2}\right),\ J_{f}=\mathcal{O}\left(n_{2}^{-\frac{d}{d+2\alpha}}(\log n_{2})^{d/2+1}\right)
Kf=1,κf=𝒪(logn2)\displaystyle\hskip 130.08621ptK_{f}=1,\ \kappa_{f}=\mathcal{O}\left(\sqrt{\log n_{2}}\right)

and 𝒮=NN(Ls,Ms,Js,Ks,κs){\mathcal{S}}={\rm NN}(L_{s},M_{s},J_{s},K_{s},\kappa_{s}) with

Ls=𝒪(logn1+d),Ms=𝒪(dd/2n1d+2d+6(logn1)d/2),Js=𝒪(dd/2n1d+2d+6(logn1)d/2+1)\displaystyle L_{s}=\mathcal{O}(\log n_{1}+d),\ M_{s}=\mathcal{O}\left(d^{d/2}n_{1}^{-\frac{d+2}{d+6}}(\log n_{1})^{d/2}\right),\ J_{s}=\mathcal{O}\left(d^{d/2}n_{1}^{-\frac{d+2}{d+6}}(\log n_{1})^{d/2+1}\right)
Ks=𝒪(dlog(dn1)),κs=𝒪(dlog(n1d)).\displaystyle\hskip 108.405ptK_{s}=\mathcal{O}\left(d\log(dn_{1})\right),\ \kappa_{s}=\mathcal{O}\left(\sqrt{d\log(n_{1}d)}\right).

Moreover, 𝒮{\mathcal{S}} is also Lipschitz with respect to (x,y)(x,y) and the Lipschitz coefficient is clip=𝒪(10dClip)c_{\rm lip}=\mathcal{O}\left(10dC_{\rm lip}\right).

The proof is provided in Appendix D.2. Here the δ(n)\delta(n) term accounts for the unbounded domain of xx, which is negligible when nn is large. As a consequency, we prove that under a nonparametric setting, Algorithm 1 enjoys a 𝒪~(DistroShift(a)(n2α2α+d+n123(d+6)))\widetilde{\mathcal{O}}\big{(}\texttt{DistroShift}(a)\cdot\big{(}n_{2}^{-\frac{\alpha}{2\alpha+d}}+n_{1}^{-\frac{2}{3(d+6)}}\big{)}\big{)} suboptimality when the neural network is properly chosen.

6.3 Results under Human Preference

In this section, we study the setup where real-valued reward yy is not observable, and the labeled dataset consists of pairwise comparisons, i.e., 𝒟label={xi(1),xi(2),ui}i=1n2\mathcal{D}_{\rm label}=\{x_{i}^{(1)},x_{i}^{(2)},u_{i}\}_{i=1}^{n_{2}}, where uiu_{i} is the human preference between {xi(1),xi(2)}\{x_{i}^{(1)},x_{i}^{(2)}\} for every i[n2]i\in[n_{2}]. Recall that we model human preference by the Bradley-Terry model, and the choice probability is explicitly given in (3.2):

(ux(1),x(2))=exp(f(u))exp(f(x(1)))+exp(f(x(2))).\displaystyle{\mathbb{P}}(u\mid x^{(1)},x^{(2)})=\frac{\exp(f^{*}(u))}{\exp(f^{*}(x^{(1)}))+\exp(f^{*}(x^{(2)}))}.

Under such a model, changing the ground truth ff^{*} by a constant does not change the human choice probability. Thus the value of ff^{*} is in general unidentifiable. For identifiability, we make the following assumption:

Assumption 6.8 .

Suppose the ground truth reward ff^{*} is linear as in Assumption 6.1. To identify θ\theta^{*} with human preference, we assume that i=1Dθi=0\sum_{i=1}^{D}\theta_{i}^{*}=0.

In the rest of this section, we denote the space of parameter θ\theta as Θ={θDθ21,i=1Dθi=0}\Theta=\{\theta\in\mathbb{R}^{D}\mid\|\theta\|_{2}\leq 1,\sum_{i=1}^{D}\theta_{i}=0\}. We remark that Assumption 6.8 is imposed only for technical reasons and does not compromise the generality of our analysis. Such identification assumptions are widely made in previous works on choice models (Zhu et al., 2023; Aguirregabiria and Mira, 2010; Bajari et al., 2015; Chernozhukov et al., 2022; Fan et al., 2023).

We learn the underlying ground truth reward function by maximum likelihood estiomation under linear function approximation. Specifically, by the human choice model (3.2), we apply the following minimization in Step 2 of Algorithm 1:

argminθΘ1n2i=1n2{θuilog(exp(θxi(1))+exp(θxi(2)))}.\displaystyle\mathop{\mathrm{argmin}}_{\theta\in\Theta}~{}-\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\bigg{\{}\theta^{\top}u_{i}-\log\bigg{(}\exp(\theta^{\top}x_{i}^{(1)})+\exp(\theta^{\top}x_{i}^{(2)})\bigg{)}\bigg{\}}.

Then we establish the following sub-optimality guarantee with human preference.

Theorem 6.9.

Suppose Assumptions 3.1, 5.3, 6.1 and 6.8 hold. We set t0=((Dd2+D2d)/n1)1/6t_{0}=\big{(}(Dd^{2}+D^{2}d)/n_{1}\big{)}^{1/6} and ν=1/D\nu=1/\sqrt{D}. Then for an arbitrary λ>0\lambda>0, with high probability, running Algorithm 1 with a target reward value aa, we have

SubOpt(P^a;y=a)\displaystyle\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a) Tr(Σ~λ1ΣPa)𝒪(d+logn2+λn2)1:off-policy bandit regret\displaystyle\leq\underbrace{\sqrt{\operatorname{Tr}(\widetilde{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})}\cdot\mathcal{O}\left(\sqrt{\frac{d+\log n_{2}+\lambda}{n_{2}}}\right)}_{\mathcal{E}_{1}:\textrm{off-policy bandit regret}}
+𝙳𝚒𝚜𝚝𝚛𝚘𝚂𝚑𝚒𝚏𝚝(a)(d2D+D2d)1/6n11/6a2:on-support diffusion error+𝔼P^a[h(x)]3:off-support diffusion error,\displaystyle\quad+\underbrace{{\tt DistroShift}(a)\cdot\left(d^{2}D+D^{2}d\right)^{1/6}{n_{1}}^{-1/6}\cdot a}_{\mathcal{E}_{2}:\textrm{on-support diffusion error}}+\underbrace{\mathbb{E}_{\widehat{P}_{a}}[h^{*}(\|x_{\perp}\|)]}_{\mathcal{E}_{3}:\textrm{off-support diffusion error}}, (6.4)

where ΣPa:=𝔼Pa[xx]\Sigma_{P_{a}}:=\mathbb{E}_{P_{a}}[xx^{\top}] and Σ~λ:=1n2(i=1n2(xi(1)xi(2))(xi(1)xi(2))+λID)\widetilde{\Sigma}_{\lambda}:=\frac{1}{n_{2}}\big{(}\sum_{i=1}^{n_{2}}(x_{i}^{(1)}-x_{i}^{(2)})(x_{i}^{(1)}-x_{i}^{(2)})^{\top}+\lambda I_{D}\big{)}.

The proof is provided in Appendix B.3.

Implications and Discussions:

1). Similar to the real-valued labeled setup, 1\mathcal{E}_{1} depending on dd shows diffusion model learns a low-dimensional representation of xx, reducing DD to smaller latent dimension dd. We can prove that Tr(Σ~λ1ΣPa)𝒪(a2βΣ+d)\operatorname{Tr}(\widetilde{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})\leq\mathcal{O}\left(\frac{a^{2}}{\|{\beta}^{*}\|_{\Sigma}}+d\right) when n2=Ω(1λmin)n_{2}=\Omega(\frac{1}{\lambda_{\min}}). Such a result shows that even without an observable reward, we can still obtain an off-policy bandit regret that matches the case with an observable reward dataset.
2). By setting t0=((Dd2+D2d)/n1)1/6t_{0}=\left((Dd^{2}+D^{2}d)/n_{1}\right)^{1/6}, we still have

2=(𝙳𝚒𝚜𝚝𝚛𝚘𝚂𝚑𝚒𝚏𝚝(a)(d2D+D2d)1/6n11/6a),\displaystyle\mathcal{E}_{2}=\left({\tt DistroShift}(a)\cdot\left(d^{2}D+D^{2}d\right)^{1/6}{n_{1}}^{-1/6}\cdot a\right),

where 𝙳𝚒𝚜𝚝𝚛𝚘𝚂𝚑𝚒𝚏𝚝(a){\tt DistroShift}(a) quantifies the distribution shift depending on different reward values. In the special case of the latent covariance matrix Σ\Sigma is known, we can quantify the distribution shift as 𝙳𝚒𝚜𝚝𝚛𝚘𝚂𝚑𝚒𝚏𝚝(a)=𝒪(ad){\tt DistroShift}(a)=\mathcal{O}(a\vee d). 3). Learning from human preferences is almost as good as learning from an explicitly labeled dataset. Specifically, our results in human preference setup match those in the setup with the observable real-valued label. We also claim that Algorithm 1 can provably improve the reward on the basis of dataset samples. When the latent covariance matrix Σ\Sigma is known, we can quantify the distribution shift as 𝙳𝚒𝚜𝚝𝚛𝚘𝚂𝚑𝚒𝚏𝚝(a)=𝒪(ad){\tt DistroShift}(a)=\mathcal{O}(a\vee d). If n2=o(n11/3)n_{2}=o(n_{1}^{1/3}) and the off-support diffusion error is negligible, if we denote the average reward in 𝒟label\mathcal{D}_{\rm label} with ss, by selecting a=O(s+dn2logn2)a=O(s+\frac{d}{n_{2}}\sqrt{\log n_{2}}), we have 𝔼P^a[f(x)]s\mathbb{E}_{\widehat{P}_{a}}[f^{*}(x)]\geq s with high probability. To the authors’ best knowledge, this is the first theoretical attempt to understand reward improvement of conditional diffusion. These results imply a potential connection between diffusion theory and off-policy bandit learning, which is interesting for more future research. See proofs in Appendix B.3.

7 Numerical Experiments

We now present numerical results in both simulated and real data environments to showcase the performance and validate our established theories.

7.1 Synthetic Data

We consider a data-driven product design optimization problem. The goal is to generate new designs that attain high rewards from customers, based on pre-collected customer feedback. Let xDx\in\mathbb{R}^{D} denote the product design. We consider xx having a linear subspace structure, i.e., x=Azx=Az for some unknown matrix AA and latent variable zdz\in\mathbb{R}^{d}. We set D=64D=64 and d=16d=16. The representation matrix AA is randomly sampled with orthonormal columns. The latent variable zz follows the standard Gaussian distribution 𝖭(𝟢,𝖨𝖽)\sf{N}(0,I_{d}). For each product design, we associate a reward function in the form of Assumption 6.1. In particular, for the linear part, we generate β\beta^{*} by uniformly sampling from the unit sphere and then fix it. The penalty term is chosen as h(x)=5x22h^{*}(x)=-5\|x\|_{2}^{2}.

To implement Algorithm 1, we use ridge regression to estimate the reward function and train a diffusion model utilizing a 11-dimensional version of the UNet (Ronneberger et al., 2015) for learning the score function. More training configurations and code are deferred to Appendix F.

Figure 4 shows the average reward of the generated samples under different target reward values. We also plot the distribution shift and off-support deviation in terms of the l2l_{2}-norm distance from the support. For small target reward values, the generated average reward almost scales linearly with the target value, which is consistent with the theory as the distribution shift remains small for these target values. The generated reward begins to decrease as we further increase the target reward value, and the reason is twofold. Firstly, the off-support deviation of the generated samples becomes large in this case, which prevents the generated reward from further going up. Secondly, the distribution shift increases rapidly as we further increase the target value. In Figure 5, we show the distribution of the rewards in the generated samples. As we increase the target reward value, the generated rewards become less concentrated and are shifted to the left of the target value, which is also due to the distribution shift and off-support deviation.

Refer to caption
Figure 4: Quality of generated samples as target reward value increases. Left: Average reward of the generation; Middle: Distribution shift; Right: Off-support deviation. The errorbar is computed by 22 times the standard deviation over 55 runs.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Shifting reward distribution of the generated population.

7.2 Directed Text-to-Image Generation

Next, we empirically verify our theory through directed text-to-image generation. Instead of training a diffusion model from scratch, we use Stable Diffusion v1.5 (Rombach et al., 2022), pre-trained on LAION dataset (Schuhmann et al., 2022). Stable Diffusion operates on the latent space of its Variational Auto-Encoder and can incorporate text conditions. We show that by training a reward model we can further guide the Stable Diffusion model to generate images of desired properties.

Ground-truth Reward Model. We start from an ImageNet (Deng et al., 2009) pre-trained ResNet-18 (He et al., 2016) model and replace the final prediction layer with a randomly initialized linear layer of scalar outputs. Then we use this model as the ground-truth reward model. To investigate the meaning of this randomly-generated reward model, we generate random samples and manually inspect the images with high rewards and low rewards. The ground-truth reward model seems to favor colorful and vivid natural scenes against monochrome and dull images; see Appendix F for sample images.

Labelled Dataset. We use the ground-truth reward model to compute a scalar output for each instance in the CIFAR-10 (Krizhevsky et al., 2009) training dataset and perturb the output by adding a Gaussian noise from 𝒩(0,0.01)\mathcal{N}(0,0.01). We use the images and the corresponding outputs as the training dataset.

Reward-network Training. To avoid adding additional input to the diffusion model and tuning the new parameters, we introduce a new network μθ\mu_{\theta} and approximate pt(y|xt)p_{t}(y|x_{t}) by 𝖭(μθ(xt),σ2){\sf N}(\mu_{\theta}(x_{t}),\sigma^{2}). For simplicity, we set σ2\sigma^{2} as a tunable hyperparameter. We share network parameters for different noise levels tt, so our μθ\mu_{\theta} has no additional input of tt. We train μθ\mu_{\theta} by minimizing the expected KL divergence between pt(y|xt)p_{t}(y|x_{t}) and 𝖭(μθ(xt),σ2){\sf N}(\mu_{\theta}(x_{t}),\sigma^{2}):

𝔼t𝔼xt[KL(pt(y|xt)𝖭(μθ(xt),σ2))]=𝔼t𝔼(xt,y)ptyμθ(xt)222σ2+Constant.\mathbb{E}_{t}\mathbb{E}_{x_{t}}\Big{[}\mathrm{KL}(p_{t}(y|x_{t})\mid{\sf N}(\mu_{\theta}(x_{t}),\sigma^{2}))\Big{]}=\mathbb{E}_{t}\mathbb{E}_{(x_{t},y)\sim p_{t}}\frac{\|y-\mu_{\theta}(x_{t})\|_{2}^{2}}{2\sigma^{2}}+\mathrm{Constant}.

Equivalently, we train the reward model μθ\mu_{\theta} to predict the noisy reward yy from the noisy inputs xtx_{t}. Also, notice that the minimizers of the objective do not depend on the choice of σ2\sigma^{2}.

Reward-network-based Directed Diffusion. To perform reward-directed conditional diffusion, observe that xlogpt(x|y)=xlogpt(x)+xlogpt(y|x)\nabla_{x}\log p_{t}(x|y)=\nabla_{x}\log p_{t}(x)+\nabla_{x}\log p_{t}(y|x), and pt(y|x)exp(yμθ(x)222σ2)p_{t}(y|x)\propto\exp\Big{(}-\frac{\|y-\mu_{\theta}(x)\|_{2}^{2}}{2\sigma^{2}}\Big{)}. Therefore,

xlogpt(y|x)=1/σ2x[12yμθ(x)22].\nabla_{x}\log p_{t}(y|x)=-1/\sigma^{2}\cdot\nabla_{x}\Big{[}\frac{1}{2}\|y-\mu_{\theta}(x)\|_{2}^{2}\Big{]}.

In our implementation, we compute the gradient by back-propagation through μθ\mu_{\theta} and incorporate this gradient guidance into each denoising step of the DDIM sampler (Song et al., 2021) following (Dhariwal and Nichol, 2021) (equation (14)). We see that 1/σ21/\sigma^{2} corresponds to the weights of the gradient with respect to unconditioned score. In the sequel, we refer to 1/σ21/\sigma^{2} as the “guidance level”, and yy as the “target value”.

Quantitative Results. We vary 1/σ21/\sigma^{2} in {25,50,100,200,400}\{25,50,100,200,400\} and yy in {1,2,4,8,16}\{1,2,4,8,16\}. For each combination, we generate 100 images with the text prompt “A nice photo” and calculate the mean and the standard variation of the predicted rewards and the ground-truth rewards. The results are plotted in Figure 6. From the plot, we see similar effects of increasing the target value yy at different guidance levels 1/σ21/\sigma^{2}. A larger target value puts more weight on the guidance signals xμθ(x)\nabla_{x}\mu_{\theta}(x), which successfully drives the generated images towards higher predicted rewards, but suffers more from the distribution-shift effects between the training distribution and the reward-conditioned distribution, which renders larger gaps between the predicted rewards and the ground-truth rewards. To optimally choose a target value, we must trade off between the two counteractive effects.

Refer to caption
Figure 6: The predicted rewards and the ground-truth rewards of the generated images. At each guidance level, increasing the target yy successfully directs the generation towards higher predicted rewards, but also increases the error induced by the distribution shift. The reported baseline is the expected ground-truth reward for undirected generations.

Qualitative Results. To qualitatively test the effects of the reward conditioning, we generate a set of images with increasing target values yy under different text prompts and investigate the visual properties of the produced images. We isolate the effect of reward conditioning by fixing all the randomness during the generation processes, so the generated images have similar semantic layouts. After hyper-parameter tuning, we find that setting 1/σ2=1001/\sigma^{2}=100 and y{2,4,6,8,10}y\in\{2,4,6,8,10\} achieves good results across different text prompts and random seeds. We pick out typical examples and summarized the results in Figure 7, which demonstrates that as we increase the target value, the generated images become more colorful at the expense of degradations of the image qualities.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: The effects of the reward-directed diffusion. Increasing the target value directs the images to be more colorful and vivid at the cost of degradation of the image qualities. Leftmost: without reward conditioning. Second-to-Last: target value y=2,4,6,8,10y=2,4,6,8,10. The guidance level 1/σ21/\sigma^{2} is fixed to 100100. The text prompts are “A cat with a glass of water”, “An astronaut on the horseback”.

7.3 Offline High-Reward Trajectory Generation in Reinforcement Learning

Lastly, we demonstrate using our proposed reward-directed conditional diffusion model for direct high-reward trajectory generation in reinforcement learning problems. We replicate the implementation of Decision Diffuser (Ajay et al., 2023) under the (Med-Expert, Hopper) setting. In Decision Diffuser, the RL trajectory and the final reward are jointly modeled with a conditioned diffusion model. The policy is given by first performing reward-directed conditional diffusion to sample a trajectory of high reward and then extracting action sequences from the trajectory using a trained inverse dynamics model. We plot the mean and standard deviation of the total returns (averaged across 10 independent episodes) v.s. the target rewards in Figure 8. The theoretical maximum of the reward is 400. Therefore, trajectories with rewards greater than 400 are never seen during training. We observe that when we increase the target reward beyond 320, the actual total reward decreases. This is supported by our theory that as we increase the reward guidance signal, the condition effect becomes stronger but the distribution-shift effect also becomes stronger. We also observe that the reward-directed conditional generation yields comparable and even marginally better performance compared to state-of-the-art methods, such as trajectory transformer (Janner et al., 2021) and MoReL (Kidambi et al., 2020).

Refer to caption
Figure 8: Total Reward v.s. Target Reward for Decision Diffuser in the (Med-Expert, Hopper) setting.

References

  • Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
  • Aguirregabiria and Mira (2010) Victor Aguirregabiria and Pedro Mira. Dynamic discrete choice structural models: A survey. Journal of Econometrics, 156(1):38–67, 2010.
  • Ajay et al. (2022) Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
  • Ajay et al. (2023) Anurag Ajay, Yilun Du, Abhi Gupta, Joshua B. Tenenbaum, Tommi S. Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=sP1fo2K9DFG.
  • Albergo et al. (2023) Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.
  • Austin et al. (2021) Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
  • Bajari et al. (2015) Patrick Bajari, Victor Chernozhukov, Han Hong, and Denis Nekipelov. Identification and efficient semiparametric estimation of a dynamic discrete game. Technical report, National Bureau of Economic Research, 2015.
  • Balaji et al. (2022) Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  • Bansal et al. (2023) Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 843–852, 2023.
  • Barrera et al. (2016) Luis A Barrera, Anastasia Vedenko, Jesse V Kurland, Julia M Rogers, Stephen S Gisselbrecht, Elizabeth J Rossin, Jaie Woodard, Luca Mariani, Kian Hong Kock, Sachi Inukai, et al. Survey of variation in human transcription factors reveals prevalent dna binding changes. Science, 351(6280):1450–1454, 2016.
  • Benton et al. (2023) Joe Benton, Valentin De Bortoli, Arnaud Doucet, and George Deligiannidis. Linear convergence bounds for diffusion models via stochastic localization. arXiv preprint arXiv:2308.03686, 2023.
  • Block et al. (2020) Adam Block, Youssef Mroueh, and Alexander Rakhlin. Generative modeling with denoising auto-encoders and langevin sampling. arXiv preprint arXiv:2002.00107, 2020.
  • Brandfonbrener et al. (2021) David Brandfonbrener, William Whitney, Rajesh Ranganath, and Joan Bruna. Offline contextual bandits with overparameterized models. In International Conference on Machine Learning, pages 1049–1058. PMLR, 2021.
  • Brookes et al. (2019) David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling for robust design. In International conference on machine learning, pages 773–782. PMLR, 2019.
  • Bu et al. (2023) Jinzhi Bu, David Simchi-Levi, and Li Wang. Offline pricing and demand learning with censored data. Management Science, 69(2):885–903, 2023.
  • Chen and Jiang (2019) Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
  • Chen et al. (2020) Minshuo Chen, Yu Bai, Jason D Lee, Tuo Zhao, Huan Wang, Caiming Xiong, and Richard Socher. Towards understanding hierarchical learning: Benefits of neural representations. Advances in Neural Information Processing Systems, 33:22134–22145, 2020.
  • Chen et al. (2022a) Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery. Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022a.
  • Chen et al. (2023a) Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. arXiv preprint arXiv:2302.07194, 2023a.
  • Chen et al. (2022b) Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022b.
  • Chen et al. (2023b) Sitan Chen, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, and Adil Salim. The probability flow ode is provably fast. arXiv preprint arXiv:2305.11798, 2023b.
  • Chen et al. (2023c) Sitan Chen, Giannis Daras, and Alex Dimakis. Restoration-degradation beyond linear diffusions: A non-asymptotic analysis for ddim-type samplers. In International Conference on Machine Learning, pages 4462–4484. PMLR, 2023c.
  • Chernozhukov et al. (2022) Victor Chernozhukov, Juan Carlos Escanciano, Hidehiko Ichimura, Whitney K Newey, and James M Robins. Locally robust semiparametric estimation. Econometrica, 90(4):1501–1535, 2022.
  • Christiano et al. (2017) Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  • De Bortoli (2022) Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. arXiv preprint arXiv:2208.05314, 2022.
  • De Bortoli et al. (2021) Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  • Duan et al. (2020) Yaqi Duan, Zeyu Jia, and Mengdi Wang. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
  • Esser et al. (2024) Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
  • Fan et al. (2020) Jianqing Fan, Zhaoran Wang, Yuchen Xie, and Zhuoran Yang. A theoretical analysis of deep q-learning. In Learning for Dynamics and Control, pages 486–489. PMLR, 2020.
  • Fan et al. (2023) Jianqing Fan, Zhipeng Lou, Weichen Wang, and Mengxin Yu. Spectral ranking inferences based on general multiway comparisons. arXiv preprint arXiv:2308.02918, 2023.
  • Fu and Levine (2021) Justin Fu and Sergey Levine. Offline model-based optimization via normalized maximum likelihood estimation. arXiv preprint arXiv:2102.07970, 2021.
  • Glaese et al. (2022) Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  • Gong et al. (2019) Sixue Gong, Vishnu Naresh Boddeti, and Anil K Jain. On the intrinsic dimensionality of image representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3987–3996, 2019.
  • Graikos et al. (2022) Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models as plug-and-play priors. arXiv preprint arXiv:2206.09012, 2022.
  • Guo et al. (2023) Zhiye Guo, Jian Liu, Yanli Wang, Mengrui Chen, Duolin Wang, Dong Xu, and Jianlin Cheng. Diffusion models in bioinformatics and computational biology. Nature Reviews Bioengineering, pages 1–19, 2023.
  • Györfi et al. (2002) László Györfi, Michael Köhler, Adam Krzyżak, and Harro Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hebbal et al. (2019) Ali Hebbal, Loic Brevault, Mathieu Balesdent, El-Ghazali Talbi, and Nouredine Melab. Bayesian optimization using deep gaussian processes. arXiv preprint arXiv:1905.03350, 2019.
  • Ho and Salimans (2022) Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Homburg et al. (2019) Christian Homburg, Karin Lauer, and Arnd Vomberg. The multichannel pricing dilemma: Do consumers accept higher offline than online prices? International Journal of Research in Marketing, 36(4):597–612, 2019.
  • Hsu et al. (2011) Daniel Hsu, Sham M. Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors, 2011.
  • Hyvärinen and Dayan (2005) Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  • Janner et al. (2021) Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
  • Janner et al. (2022) Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, 2022.
  • Ji et al. (2022) Xiang Ji, Minshuo Chen, Mengdi Wang, and Tuo Zhao. Sample complexity of nonparametric off-policy evaluation on low-dimensional manifolds using deep networks. arXiv preprint arXiv:2206.02887, 2022.
  • Jin et al. (2021) Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
  • Kidambi et al. (2020) Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  • Krishnamoorthy et al. (2023) Siddarth Krishnamoorthy, Satvik Mehul Mashkaria, and Aditya Grover. Diffusion models for black-box optimization. In International Conference on Machine Learning, pages 17842–17857. PMLR, 2023.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Kshetri et al. (2023) Nir Kshetri, Yogesh K Dwivedi, Thomas H Davenport, and Niki Panteli. Generative artificial intelligence in marketing: Applications, opportunities, challenges, and research agenda, 2023.
  • Kumar and Levine (2020) Aviral Kumar and Sergey Levine. Model inversion networks for model-based optimization. Advances in Neural Information Processing Systems, 33:5126–5137, 2020.
  • Lee et al. (2022a) Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative modeling with polynomial complexity. arXiv preprint arXiv:2206.06227, 2022a.
  • Lee et al. (2022b) Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. arXiv preprint arXiv:2209.12381, 2022b.
  • Lee et al. (2023a) Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. In International Conference on Algorithmic Learning Theory, pages 946–985. PMLR, 2023a.
  • Lee et al. (2023b) Jin Sub Lee, Jisun Kim, and Philip M. Kim. Proteinsgm: Score-based generative modeling for de novo protein design. bioRxiv, 2023b. doi: 10.1101/2022.07.13.499967. URL https://www.biorxiv.org/content/early/2023/02/04/2022.07.13.499967.
  • Lee et al. (2021) Kimin Lee, Laura Smith, and Pieter Abbeel. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training, 2021.
  • Levine et al. (2020) Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Li et al. (2022) Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
  • Li et al. (2023) Zihao Li, Zhuoran Yang, and Mengdi Wang. Reinforcement learning with human feedback: Learning dynamic choices via pessimism. arXiv preprint arXiv:2305.18438, 2023.
  • Liang et al. (2023) Zhixuan Liang, Yao Mu, Mingyu Ding, Fei Ni, Masayoshi Tomizuka, and Ping Luo. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. arXiv preprint arXiv:2302.01877, 2023.
  • Liu et al. (2018) Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems, 31, 2018.
  • Liu et al. (2024) Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024.
  • Luenberger et al. (2021) David G Luenberger, Yinyu Ye, et al. Linear and nonlinear programming, volume 2. Springer, 2021.
  • Mei and Wu (2023) Song Mei and Yuchen Wu. Deep networks as denoising algorithms: Sample-efficient learning of diffusion models in high-dimensional graphical models. arXiv preprint arXiv:2309.11420, 2023.
  • Montanari and Wu (2023) Andrea Montanari and Yuchen Wu. Posterior sampling from the spiked models via diffusion processes. arXiv preprint arXiv:2304.11449, 2023.
  • Munos and Szepesvári (2008) Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5), 2008.
  • Nakada and Imaizumi (2020) Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. The Journal of Machine Learning Research, 21(1):7018–7055, 2020.
  • Nguyen-Tang et al. (2021) Thanh Nguyen-Tang, Sunil Gupta, A Tuan Nguyen, and Svetha Venkatesh. Offline neural contextual bandits: Pessimism, optimization and generalization. arXiv preprint arXiv:2111.13807, 2021.
  • Nichol et al. (2021) Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • Oko et al. (2023) Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023. URL https://openreview.net/forum?id=6961CeTSFA.
  • Ooi et al. (2023) Keng-Boon Ooi, Garry Wei-Han Tan, Mostafa Al-Emran, Mohammed A Al-Sharafi, Alexandru Capatina, Amrita Chakraborty, Yogesh K Dwivedi, Tzu-Ling Huang, Arpan Kumar Kar, Voon-Hsien Lee, et al. The potential of generative artificial intelligence across disciplines: Perspectives and future directions. Journal of Computer Information Systems, pages 1–32, 2023.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Pan and Wu (2020) Yuchen Pan and Desheng Wu. A novel recommendation model for online-to-offline service based on the customer network and service location. Journal of Management Information Systems, 37(2):563–593, 2020.
  • Pan et al. (2019) Yuchen Pan, Desheng Wu, Cuicui Luo, and Alexandre Dolgui. User activity measurement in rating-based online-to-offline (o2o) service recommendation. Information Sciences, 479:180–196, 2019.
  • Pearce et al. (2023) Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, and Sam Devlin. Imitating human behaviour with diffusion models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Pv1GPQzRrC8.
  • Peters and Schaal (2007) Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007.
  • Pope et al. (2021) Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. arXiv preprint arXiv:2104.08894, 2021.
  • Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  • Schlett et al. (2022) Torsten Schlett, Christian Rathgeb, Olaf Henniger, Javier Galbally, Julian Fierrez, and Christoph Busch. Face image quality assessment: A literature survey. ACM Computing Surveys, 54(10s):1–49, January 2022. ISSN 1557-7341. doi: 10.1145/3507901. URL http://dx.doi.org/10.1145/3507901.
  • Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  • Shahriari et al. (2015) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
  • Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.
  • Snoek et al. (2014) Jasper Snoek, Kevin Swersky, Rich Zemel, and Ryan Adams. Input warping for bayesian optimization of non-stationary functions. In International Conference on Machine Learning, pages 1674–1682. PMLR, 2014.
  • Song et al. (2021) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=St1giarCHLP.
  • Song and Ermon (2020) Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
  • Song et al. (2020a) Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574–584. PMLR, 2020a.
  • Song et al. (2020b) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  • Song et al. (2023) Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  • Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  • Tenenbaum et al. (2000) Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
  • Tsybakov (2008) Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519.
  • Vahdat et al. (2021) Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
  • Vincent (2011) Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  • Wainwright (2019) Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  • Wang et al. (2021) Ningning Wang, Ting Zhang, Xiaowei Zhu, and Peimiao Li. Online-offline competitive pricing with reference price effect. Journal of the Operational Research Society, 72(3):642–653, 2021.
  • Wang et al. (2022) Xiaofei Wang, Kimin Lee, Kourosh Hakhamaneshi, Pieter Abbeel, and Michael Laskin. Skill preferences: Learning to extract and execute robotic skills from human feedback. In Conference on Robot Learning, pages 1259–1268. PMLR, 2022.
  • Wang et al. (2023) Zhizhong Wang, Lei Zhao, and Wei Xing. Stylediffusion: Controllable disentangled style transfer via diffusion models, 2023.
  • Watson et al. (2023) Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
  • Wibisono et al. (2024) Andre Wibisono, Yihong Wu, and Kaylee Yingxi Yang. Optimal score estimation via empirical bayes smoothing. arXiv preprint arXiv:2402.07747, 2024.
  • Wu et al. (2021) Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
  • Xie et al. (2021) Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694, 2021.
  • Yuan et al. (2023) Hui Yuan, Kaixuan Huang, Chengzhuo Ni, Minshuo Chen, and Mengdi Wang. Reward-directed conditional diffusion: Provable distribution estimation and reward improvement. arXiv preprint arXiv:2307.07055, 2023.
  • Zhang and Agrawala (2023) Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  • Zhang et al. (2023) Yangwenhui Zhang, Hong Qian, Xiang Shu, and Aimin Zhou. High-dimensional dueling optimization with preference embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11280–11288, 2023.
  • Zhu et al. (2023) Banghua Zhu, Jiantao Jiao, and Michael I Jordan. Principled reinforcement learning with human feedback from pairwise or kk-wise comparisons. arXiv preprint arXiv:2301.11270, 2023.
  • Žiga Babnik et al. (2023) Žiga Babnik, Peter Peer, and Vitomir Štruc. Diffiqa: Face image quality assessment using denoising diffusion probabilistic models, 2023.

Appendix A Conditional Score Matching and Generation

A.1 Proof of Proposition 4.1

For any t0t\geq 0, it hold that xtlogpt(xty)=xtlogpt(xt,y)\nabla_{x_{t}}\log p_{t}(x_{t}\mid y)=\nabla_{x_{t}}\log p_{t}(x_{t},y) since the gradient is taken w.r.t. xtx_{t} only. Then plugging in this equation and expanding the norm square on the LHS gives

𝔼(xt,y)Pt[xtlogpt(xt,y)s(xt,y,t)22]=𝔼(xt,y)Pt[s(xt,y,t)222xtlogpt(xt,y),s(xt,y,t)]+C.\displaystyle\mathbb{E}_{(x_{t},y)\sim P_{t}}\left[\|\nabla_{x_{t}}\log p_{t}(x_{t},y)-s(x_{t},y,t)\|_{2}^{2}\right]=\mathbb{E}_{(x_{t},y)\sim P_{t}}\big{[}\|s(x_{t},y,t)\|_{2}^{2}-2\langle\nabla_{x_{t}}\log p_{t}(x_{t},y),s(x_{t},y,t)\rangle\big{]}+C.

Then it suffices to prove

𝔼(xt,y)Pt[xtlogpt(xt,y),s(xt,y,t)]=𝔼(x,y)Pxy^𝔼x𝖭(α(t)x,h(t)I)[xϕt(xx),s(x,y,t)]\mathbb{E}_{(x_{t},y)\sim P_{t}}\left[\langle\nabla_{x_{t}}\log p_{t}(x_{t},y),s(x_{t},y,t)\rangle\right]=\mathbb{E}_{(x,y)\sim P_{x\widehat{y}}}\mathbb{E}_{x^{\prime}\sim{\sf N}(\alpha(t)x,h(t)I)}\left[\langle\nabla_{x^{\prime}}\phi_{t}(x^{\prime}\mid x),s(x^{\prime},y,t)\rangle\right]

Using integration by parts to rewrite the inner product we have

𝔼(xt,y)Pt[xtlogpt(xt,y),s(xt,y,t)]\displaystyle\mathbb{E}_{(x_{t},y)\sim P_{t}}\left[\langle\nabla_{x_{t}}\log p_{t}(x_{t},y),s(x_{t},y,t)\rangle\right] =pt(xt,y)xtlogpt(xt,y),s(xt,y,t)𝑑xt𝑑y\displaystyle=\int p_{t}(x_{t},y)\langle\nabla_{x_{t}}\log p_{t}(x_{t},y),s(x_{t},y,t)\rangle dx_{t}dy
=xtpt(xt,y),s(xt,y,t)𝑑xt𝑑y\displaystyle=\int\langle\nabla_{x_{t}}p_{t}(x_{t},y),s(x_{t},y,t)\rangle dx_{t}dy
=pt(xt,y)div(s(xt,y,t))𝑑xt𝑑y,\displaystyle=-\int p_{t}(x_{t},y)\operatorname{div}(s(x_{t},y,t))dx_{t}dy,

where denote by ϕt(x|x)\phi_{t}(x^{\prime}|x) the density of 𝖭(α(t)x,h(t)ID){\sf N}(\alpha(t)x,h(t)I_{D}) with α(t)=exp(t/2)\alpha(t)=\exp(-t/2) and h(t)=1exp(t)h(t)=1-\exp(-t), then

pt(xt,y)div(s(xt,y,t))𝑑xt𝑑y\displaystyle-\int p_{t}(x_{t},y)\operatorname{div}(s(x_{t},y,t))dx_{t}dy =𝔼(x,y)Pxy^ϕt(xx)div(s(x,y,t))𝑑x\displaystyle=-\mathbb{E}_{(x,y)\sim P_{x\widehat{y}}}\int\phi_{t}(x^{\prime}\mid x)\operatorname{div}(s(x^{\prime},y,t))dx^{\prime}
=𝔼(x,y)Pxy^xϕt(xx),s(x,y,t)𝑑x\displaystyle=\mathbb{E}_{(x,y)\sim P_{x\widehat{y}}}\int\langle\nabla_{x^{\prime}}\phi_{t}(x^{\prime}\mid x),s(x^{\prime},y,t)\rangle dx^{\prime}
=𝔼(x,y)Pxy^𝔼x𝖭(α(t)x,h(t)I)[xϕt(xx),s(x,y,t)].\displaystyle=\mathbb{E}_{(x,y)\sim P_{x\widehat{y}}}\mathbb{E}_{x^{\prime}\sim{\sf N}(\alpha(t)x,h(t)I)}\left[\langle\nabla_{x^{\prime}}\phi_{t}(x^{\prime}\mid x),s(x^{\prime},y,t)\rangle\right].

Appendix B Omitted Proofs in Section 5

Additional Notations: We follow the notations in the main paper along with some additional ones. Use PtLD(z)P_{t}^{LD}(z) to denote the low-dimensional distribution on zz corrupted by diffusion noise. Formally, ptLD(z)=ϕt(z|z)pz(z)dzp_{t}^{LD}(z)=\int\phi_{t}(z^{\prime}|z)p_{z}(z)\mathrm{d}z with ϕt(|z)\phi_{t}(\cdot|z) being the density of 𝖭(α(t)z,h(t)Id){\sf N}(\alpha(t)z,h(t)I_{d}). Pt0LD(zf^(Az)=a)P^{LD}_{t_{0}}(z\mid\widehat{f}(Az)=a) the corresponding conditional distribution on f^(Az)=a\widehat{f}(Az)=a at t0t_{0}, with shorthand as Pt0LD(a)P^{LD}_{t_{0}}(a). Also give Pz(zf^(Az)=a)P_{z}(z\mid\widehat{f}(Az)=a) a shorthand as PLD(a)P^{LD}(a). In our theorems, 𝒪\mathcal{O} hides constant factors and higher order terms in n11n_{1}^{-1} and n21n_{2}^{-1} and , 𝒪~\widetilde{\mathcal{O}} further hides logarithmic terms and can also hide factors in dd.

B.1 Parametric Conditional Score Matching Error

Theorems presented in Section 5 are established upon the conditional score estimation error, which has been studied in Chen et al. [2023a] for general distributions, but in Lemma B.1 we provide a new one specific to our setting where the true score is linear in input (xt,y^)(x_{t},\widehat{y}) due to the Gaussian design. Despite the linearity of score in Gaussian case, we emphasize matching score in (4.2) is not simply linear regression as 𝒮{\mathcal{S}} consists of an encoder-decoder structure for estimating matrix AA to reduce dimension (see §E for 𝒮{\mathcal{S}} construction and more proof details).

In the following lemma, we first present a general result for the case where the true score is within 𝒮{\mathcal{S}}, which is constructed as a parametric function class. Then the score matching error is bounded in terms of 𝒩(𝒮,1/n1)\mathcal{N}({\mathcal{S}},1/n_{1}), the log\log covering number of 𝒮{\mathcal{S}}, recall n1n_{1} is the size of 𝒟unlabel\mathcal{D}_{\rm unlabel}. Instantiating this general result, we derive score matching error for Gaussian case by upper bounding 𝒩(𝒮,1/n1)\mathcal{N}({\mathcal{S}},1/n_{1}) in this special case.

Lemma B.1.

Under Assumption 3.1, if logpt(xy)𝒮\nabla\log p_{t}(x\mid y)\in{\mathcal{S}}, where

𝒮={𝐬V,ψ(x,y,t)=1h(t)(Vψ(Vx,y,t)x)\displaystyle{\mathcal{S}}=\bigg{\{}\mathbf{s}_{V,\psi}(x,y,t)=\frac{1}{h(t)}(V\cdot\psi(V^{\top}x,y,t)-x) :VD×d,ψΨ:d+1×[t0,T]d},\displaystyle:~{}V\in\mathbb{R}^{D\times d},~{}\psi\in\Psi:\mathbb{R}^{d+1}\times[t_{0},T]\to\mathbb{R}^{d}~{}\bigg{\}}, ((4.8) revisited)

with Ψ\Psi parametric. Then for δ0\delta\geq 0, with probability 1δ1-\delta, the square score matching error is bounded by ϵdiff2=𝒪(1t0𝒩(𝒮,1/n1)(d2D)log1δn1)\epsilon^{2}_{diff}=\mathcal{O}\left(\frac{1}{t_{0}}\sqrt{\frac{\mathcal{N}({\mathcal{S}},1/n_{1})(d^{2}\vee D)\log\frac{1}{\delta}}{n_{1}}}\right), i.e.,

1Tt0t0T𝔼(xt,y)Pt[logpt(xt|y)s^(xt,y,t)22]dtϵdiff2,\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\mathbb{E}_{(x_{t},y)\sim P_{t}}\left[\|\nabla\log p_{t}(x_{t}|y)-\widehat{s}(x_{t},y,t)\|_{2}^{2}\right]\mathrm{d}t\leq\epsilon_{diff}^{2}, (B.1)

Proof of this lemma can be found in Appendix E. Recall PtP_{t} comes from Pxy^P_{x\widehat{y}} by noising xx at tt in the forward process. Under Assumption 5.3 and given f^(x)=θ^x\widehat{f}(x)=\widehat{\theta}^{\top}x and y^=f^(x)+ξ,ξ𝖭(0,ν2)\widehat{y}=\widehat{f}(x)+\xi,\xi\sim{\sf N}(0,\nu^{2}), the score function logpt(xy^)\nabla\log p_{t}(x\mid\widehat{y}) to approximate is linear in xx and y^\widehat{y}. When approximated by SS with Ψ\Psi linear, 𝒩(𝒮,1/n1)=𝒪((d2+Dd)log(Ddn1))\mathcal{N}({\mathcal{S}},1/n_{1})=\mathcal{O}((d^{2}+Dd)\log(Ddn_{1})).

To provide fidelity and reward guarantees of P^a\widehat{P}_{a}: the generated distribution of xx given condition y^=a\widehat{y}=a, we will need the following lemma. It provides a subspace recovery guarantee between VV(score matching output) and AA(ground truth), as well as a distance measure between distributions PaP_{a} and P^a\widehat{P}_{a}, given score matching error ϵdiff\epsilon_{diff}.

Note PaP_{a} and P^a\widehat{P}_{a} are over xx, which admits an underlying low-dimensional structure x=Azx=Az. Thus we measure distance between PaP_{a} and P^a\widehat{P}_{a} by defining

Definition B.2.

We define TV(P^a):=dTV(Pt0LD(zf^(Az)=a),(UV)#P^a)TV(\widehat{P}_{a}):=\texttt{d}_{\rm TV}\left(P^{LD}_{t_{0}}(z\mid\widehat{f}(Az)=a),(U^{\top}V^{\top})_{\#}\widehat{P}_{a}\right) with notations:

  • dTV(,)\texttt{d}_{\rm TV}(\cdot,\cdot) is the TV distance between two distribution.

  • fPf_{\sharp}P denotes a push-forward measure, i.e., for any measurable Ω\Omega, (fP)(Ω)=P(f1(Ω))(f_{\sharp}P)(\Omega)=P(f^{-1}(\Omega))

  • (V)#P^a(V^{\top})_{\#}\widehat{P}_{a} pushes generated P^a\widehat{P}_{a} forward to the low dimensional subspace using learned subspace matrix VV. UU is an orthonormal matrix of dimension dd.

  • Pt0LD(zf^(Az)=a)P^{LD}_{t_{0}}(z\mid\widehat{f}(Az)=a) is close to (A)#Pa(A^{\top})_{\#}P_{a}, with t0t_{0} taking account for the early stopping in backward process.

We note that there is a distribution shift between the training and the generated data, which has a profound impact on the generative performance. We quantify the influence of distribution shift by the following class restricted divergence measure.

Definition B.3.

Distribution shift between two arbitrary distributions P1P_{1} and P2P_{2} restricted under function class \mathcal{L} is defined as

𝒯(P1,P2;)=supl𝔼xP1[l(x)]/𝔼xP2[l(x)]with arbitrary two distributions P1,P2.\displaystyle\textstyle{\mathcal{T}}(P_{1},P_{2};\mathcal{L})=\sup_{l\in\mathcal{L}}\mathbb{E}_{x\sim P_{1}}[l(x)]/\mathbb{E}_{x\sim P_{2}}[l(x)]\quad\text{with arbitrary two distributions~{}}P_{1},P_{2}.

Definition B.3 is well perceived in bandit and RL literature [Munos and Szepesvári, 2008, Liu et al., 2018, Chen and Jiang, 2019, Fan et al., 2020].

Lemma B.4.

Given the square score matching error (B.1) upper bounded by ϵdiff2\epsilon_{diff}^{2}, and when PzP_{z} satisfying Assumption 5.2 with c0Id𝔼zPz[zz]c_{0}I_{d}\preceq\mathbb{E}_{z\sim P_{z}}\left[zz^{\top}\right], it guarantees on for xP^ax\sim\widehat{P}_{a} and (V,A):=VVAAF2\angle({V},{A}):=\|VV^{\top}-AA^{\top}\|^{2}_{\rm F} that

(IDVV)x𝖭(0,Λ),Λct0ID,\displaystyle(I_{D}-VV^{\top})x\sim{\sf N}(0,\Lambda),\quad\Lambda\prec ct_{0}I_{D}, (B.2)
(V,A)=𝒪~(t0c0ϵdiff2).\displaystyle\angle({V},{A})=\widetilde{\mathcal{O}}\left(\frac{t_{0}}{c_{0}}\cdot\epsilon^{2}_{diff}\right). (B.3)

In addition,

TV(P^a)=𝒪~(𝒯(P(x,y^=a),Pxy^;𝒮¯)c0ϵdiff).TV(\widehat{P}_{a})=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{c_{0}}}\cdot\epsilon_{diff}\right). (B.4)

with 𝒮¯={1Tt0t0T𝔼xtxlogpt(xty)s(xt,y,t)22dt:s𝒮}\bar{{\mathcal{S}}}=\left\{\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\mathbb{E}_{x_{t}\mid x}\|\nabla\log p_{t}(x_{t}\mid y)-s(x_{t},y,t)\|_{2}^{2}\mathrm{d}t:s\in{\mathcal{S}}\right\}. TV(P^a)TV(\widehat{P}_{a}) and 𝒯(P(x,y^=a),Pxy^;𝒮¯){\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}}) are defined in Definition B.2 and B.3. Proof of Lemma B.4 is in Appendix C.2.

B.2 Proof of Theorem 5.4 and Theorem 6.2

Proof of Theorem 5.4: bounding (V,A)\angle({V},{A}). By Lemma 3 of Chen et al. [2023a], we have

(V,A)=𝒪(t0c0ϵdiff2)\angle({V},{A})={\mathcal{O}}\left(\frac{t_{0}}{c_{0}}\cdot\epsilon^{2}_{diff}\right)

when the latent zz satisfying Assumption 5.2 and c0Id𝔼zPz[zz]c_{0}I_{d}\preceq\mathbb{E}_{z\sim P_{z}}\left[zz^{\top}\right]. Therefore, by (B.1), we have with high probability that

(V,A)=𝒪~(1c0𝒩(𝒮,1/n1)(Dd2)n1).\angle({V},{A})=\widetilde{\mathcal{O}}\left(\frac{1}{c_{0}}\sqrt{\frac{\mathcal{N}({\mathcal{S}},1/n_{1})(D\vee d^{2})}{n_{1}}}\right).

When Assumption 5.3 holds, plugging in c0=λminc_{0}=\lambda_{\min} and 𝒩(𝒮,1/n1)=𝒪((d2+Dd)log(Ddn1))\mathcal{N}({\mathcal{S}},1/n_{1})=\mathcal{O}((d^{2}+Dd)\log(Ddn_{1})), it gives

(V,A)=𝒪~(1λmin(Dd2)d2+(Dd2)Ddn1),\angle({V},{A})=\widetilde{\mathcal{O}}\left(\frac{1}{\lambda_{\min}}\sqrt{\frac{(D\vee d^{2})d^{2}+(D\vee d^{2})Dd}{n_{1}}}\right),

where 𝒪~\widetilde{\mathcal{O}} hides logarithmic terms. When D>d2D>d^{2}, which is often the case in practical applications, we have

(V,A)=𝒪~(1λminDd2+D2dn1).\displaystyle\angle({V},{A})=\widetilde{\mathcal{O}}\left(\frac{1}{\lambda_{\min}}\sqrt{\frac{Dd^{2}+D^{2}d}{n_{1}}}\right).

Proof of Theorem 6.2: bounding 𝔼xP^a[x2]\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x_{\perp}\|_{2}]. By the definition of xx_{\perp} that x=(IDAA)xx_{\perp}=(I_{D}-AA^{\top})x,

𝔼xP^a[x2]=𝔼xP^a[(IDAA)x2]𝔼xP^a[(IDAA)x22].\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x^{\perp}\|_{2}]=\mathbb{E}_{x\sim\widehat{P}_{a}}[\|(I_{D}-AA^{\top})x\|_{2}]\leq\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}[\|(I_{D}-AA^{\top})x\|_{2}^{2}]}.

Score matching returns VV as an approximation of AA, then

(IDAA)x2\displaystyle\|(I_{D}-AA^{\top})x\|_{2} (IDVV)x2+(VVAA)x2,\displaystyle\leq\|(I_{D}-VV^{\top})x\|_{2}+\|(VV^{\top}-AA^{\top})x\|_{2},
𝔼xP^a[(IDAA)x22]\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}[\|(I_{D}-AA^{\top})x\|^{2}_{2}] 2𝔼xP^a[(IDVV)x22]+2𝔼xP^a[(VVAA)x22],\displaystyle\leq 2\mathbb{E}_{x\sim\widehat{P}_{a}}[\|(I_{D}-VV^{\top})x\|^{2}_{2}]+2\mathbb{E}_{x\sim\widehat{P}_{a}}[\|(VV^{\top}-AA^{\top})x\|^{2}_{2}],

where by (B.2) in Lemma B.4 we have

(IDVV)x𝖭(𝟢,Λ),Λ𝖼𝗍𝟢𝖨(I_{D}-VV^{\top})x\sim\sf N(0,\Lambda),\quad\Lambda\prec ct_{0}I

for some constant c0c\geq 0. Thus

𝔼xP^a[(IDVV)x22]=Tr(Λ)ct0D.\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|(I_{D}-VV^{\top})x\|^{2}_{2}\right]=\operatorname{Tr}(\Lambda)\leq ct_{0}D. (B.5)

On the other hand,

(VVAA)x22VVAAop2x22VVAAF2x22,\|(VV^{\top}-AA^{\top})x\|^{2}_{2}\leq\|VV^{\top}-AA^{\top}\|^{2}_{op}\|x\|^{2}_{2}\leq\|VV^{\top}-AA^{\top}\|^{2}_{F}\|x\|^{2}_{2},

where VVAAF2\|VV^{\top}-AA^{\top}\|^{2}_{F} has an upper bound as in (B.3) and 𝔼xP^a[x22])\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|x\|^{2}_{2}\right]) is bounded in Lemma C.3 by

𝔼xP^a[x22]=𝒪(ct0D+M(a)(1+TV(P^a)).\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|x\|^{2}_{2}\right]=\mathcal{O}\left(ct_{0}D+M(a)\cdot(1+TV(\widehat{P}_{a})\right).

with M(a)=O(a2βΣ+d)M(a)=O\left(\frac{a^{2}}{\|{\beta}^{*}\|_{\Sigma}}+d\right).

Therefore, to combine things together, we have

𝔼xP^a[x2]\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x^{\perp}\|_{2}] 2𝔼xP^a[(IDVV)x22]+2𝔼xP^a[(VVAA)x22]\displaystyle\leq\sqrt{2\mathbb{E}_{x\sim\widehat{P}_{a}}[\|(I_{D}-VV^{\top})x\|^{2}_{2}]+2\mathbb{E}_{x\sim\widehat{P}_{a}}[\|(VV^{\top}-AA^{\top})x\|^{2}_{2}]}
ct0D+2(V,A)𝔼xP^a[x22]\displaystyle\leq c^{\prime}\sqrt{t_{0}D}+2\sqrt{\angle({V},{A})}\cdot\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|x\|^{2}_{2}\right]}
=𝒪(t0D+(V,A)M(a)).\displaystyle=\mathcal{O}\left(\sqrt{t_{0}D}+\sqrt{\angle({V},{A})}\cdot\sqrt{M(a)}\right).

𝒪\mathcal{O} hides multiplicative constant and (V,A)t0D\sqrt{\angle({V},{A})t_{0}D}, (V,A)M(a)TV(P^a)\sqrt{\angle({V},{A})M(a)TV(\widehat{P}_{a})}, which are terms with higher power of n11n_{1}^{-1} than the leading term.

B.3 Proof of Theorem 6.3 and 6.9

Proof of Theorem 6.3 and 6.9 are provided in this section. This section breaks down into three parts: Suboptimality Decomposition, bounding 1\mathcal{E}_{1} Relating to Offline Bandits for different datasets , Bounding 2\mathcal{E}_{2} and the Distribution Shift in Diffusion.

B.3.1 SubOpt(P^a;y=a)\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a) Decomposition.

Recall notations P^a:=P^(|y^=a)\widehat{P}_{a}:=\widehat{P}(\cdot|\widehat{y}=a) (generated distribution) and Pa:=P(|y^=a)P_{a}:=P(\cdot|\widehat{y}=a) (target distribution) and f(x)=g(x)+h(x)f^{*}(x)=g^{*}(x_{\parallel})+h^{*}(x_{\perp}). 𝔼xP^a[f(x)]\mathbb{E}_{x\sim\widehat{P}_{a}}[f^{\star}(x)] can be decomposed into 3 terms:

𝔼xP^a[f(x)]\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}[f^{\star}(x)]\geq 𝔼xPa[f(x)]|𝔼xP^a[f(x)]𝔼xPa[f(x)]|\displaystyle\mathbb{E}_{x\sim P_{a}}[f^{*}(x)]-\left|\mathbb{E}_{x\sim\widehat{P}_{a}}[f^{*}(x)]-\mathbb{E}_{x\sim P_{a}}[f^{*}(x)]\right|
\displaystyle\geq 𝔼xPa[f^(x)]𝔼xPa[|f^(x)f(x)|]|𝔼xP^a[f(x)]𝔼xPa[f(x)]|\displaystyle\mathbb{E}_{x\sim P_{a}}[\widehat{f}(x)]-\mathbb{E}_{x\sim P_{a}}\left[\left|\widehat{f}(x)-f^{*}(x)\right|\right]-\left|\mathbb{E}_{x\sim\widehat{P}_{a}}[f^{*}(x)]-\mathbb{E}_{x\sim P_{a}}[f^{*}(x)]\right|
\displaystyle\geq 𝔼xPa[f^(x)]𝔼xPa[|f^(x)g(x)|]1|𝔼xPa[g(x)]𝔼xP^a[g(x)]|2𝔼xP^a[h(x)]3,\displaystyle\mathbb{E}_{x\sim P_{a}}[\widehat{f}(x)]-\underbrace{\mathbb{E}_{x\sim P_{a}}\left[\left|\widehat{f}(x)-g^{*}(x)\right|\right]}_{\mathcal{E}_{1}}-\underbrace{\left|\mathbb{E}_{x\sim P_{a}}[g^{*}(x_{\parallel})]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(x_{\parallel})]\right|}_{\mathcal{E}_{2}}-\underbrace{\mathbb{E}_{x\sim\widehat{P}_{a}}[h^{*}(x_{\perp})]}_{\mathcal{E}_{3}},

where 𝔼xPa[f^(x)]=𝔼aq[a]\mathbb{E}_{x\sim P_{a}}[\widehat{f}(x)]=\mathbb{E}_{a\sim q}[a] and we use x=xx=x_{\parallel}, f(x)=g(x)f^{*}(x)=g^{*}(x) when xPax\sim P_{a}. Therefore

SubOpt(P^a;y=a)\displaystyle\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a) =a𝔼xP^a[f(x)]\displaystyle=a-\mathbb{E}_{x\sim\widehat{P}_{a}}[f^{\star}(x)]
𝔼xPa[|(θ^θ)x|]1+|𝔼xPa[g(x)]𝔼xP^a[g(x)]|2+𝔼xP^a[h(x)]3.\displaystyle\leq\underbrace{\mathbb{E}_{x\sim P_{a}}\left[\left|(\widehat{\theta}-\theta^{*})^{\top}x\right|\right]}_{\mathcal{E}_{1}}+\underbrace{\left|\mathbb{E}_{x\sim P_{a}}[g^{*}(x_{\parallel})]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(x_{\parallel})]\right|}_{\mathcal{E}_{2}}+\underbrace{\mathbb{E}_{x\sim\widehat{P}_{a}}[h^{*}(x_{\perp})]}_{\mathcal{E}_{3}}.

1\mathcal{E}_{1} comes from regression: prediction/generalization error onto PaP_{a}, which is independent from any error of distribution estimation that occurs in diffusion. 2\mathcal{E}_{2} and 3\mathcal{E}_{3} do not measure regression-predicted f^\widehat{f}, thus they are independent from the prediction error in f^\widehat{f} for pseudo-labeling. 2\mathcal{E}_{2} measures the disparity between P^a\widehat{P}_{a} and PaP_{a} on the subspace support and 3\mathcal{E}_{3} measures the off-subspace component in generated P^a\widehat{P}_{a}.

B.3.2 Bounding 1\mathcal{E}_{1} under the Labeled Dataset Setup.

For all xi𝒟label,yi=f(xi)+ϵi=g(xi)+ϵix_{i}\in\mathcal{D}_{\rm label},y_{i}=f^{*}(x_{i})+\epsilon_{i}=g(x_{i})+\epsilon_{i}. Thus, trained on 𝒟label\mathcal{D}_{\rm label} the prediction model f^\widehat{f} is essentially approximating gg. By estimating θ\theta^{*} with ridge regression on 𝒟label\mathcal{D}_{\rm label}, we have f^(x)=θ^x\widehat{f}(x)=\widehat{\theta}^{\top}x with

θ^=(XX+λI)1X(Xθ+η),\widehat{\theta}=\left(X^{\top}X+\lambda I\right)^{-1}X^{\top}\left(X\theta^{*}+\eta\right), (B.6)

where X=(x1,,xi,,xn2)X^{\top}=(x_{1},\cdots,x_{i},\cdots,x_{n_{2}}) and η=(ϵ1,,ϵi,,ϵn2)\eta=(\epsilon_{1},\cdots,\epsilon_{i},\cdots,\epsilon_{n_{2}}).

Lemma B.5.

Under Assumption 3.1 and 6.1 and given ϵi𝖭(𝟢,σ𝟤)\epsilon_{i}\sim\sf N(0,\sigma^{2}), define Vλ:=XX+λIV_{\lambda}:=X^{\top}X+\lambda I, Σ^λ:=1n2Vλ\widehat{\Sigma}_{\lambda}:=\frac{1}{n_{2}}V_{\lambda} and ΣPa:=𝔼xPaxx\Sigma_{P_{a}}:=\mathbb{E}_{x\sim P_{a}}xx^{\top} the covariance matrix (uncentered) of PaP_{a}, and take λ=1\lambda=1, then with high probability

1Tr(Σ^λ1ΣPa)𝒪(dlogn2)n2.\mathcal{E}_{1}\leq\sqrt{\operatorname{Tr}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})}\cdot\frac{\mathcal{O}\left(\sqrt{d\log n_{2}}\right)}{\sqrt{n_{2}}}. (B.7)

Proof of Lemma B.5 is in Appendix C.3.

Lemma B.6.

Under Assumption 3.1, 6.1 and 5.3, when λ=1\lambda=1, PaP_{a} has a shift from the empirical marginal of xx in dataset by

Tr(Σ^λ1ΣPa)𝒪(a2βΣ+d).\operatorname{Tr}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})\leq\mathcal{O}\left(\frac{a^{2}}{\|\beta^{*}\|_{\Sigma}}+d\right). (B.8)

when n2=Ω(max{1λmin,dβΣ2})n_{2}=\Omega(\max\{\frac{1}{\lambda_{\min}},\frac{d}{\|\beta^{*}\|^{2}_{\Sigma}}\}).

Proof of Lemma B.6 is in Appendix C.4.

B.3.3 Bounding 1\mathcal{E}_{1} under the Human Preference Setup

Recall that in the human preference setup, we use the MLE to learn the underlying parameter θ^\widehat{\theta}. For all xi(j)𝒟labelx_{i}^{(j)}\in\mathcal{D}_{\rm label}, j{1,2}j\in\{1,2\}, we have f(xi(j))=g(xi(j))f(x_{i}^{(j)})=g(x_{i}^{(j)}). Therefore, by Assumption 6.1, we have the following objective:

θ^=argminθΘ1ni=1n{uiθlog(exp(θxi(1))+exp(θxi(2)))},\displaystyle\widehat{\theta}=\operatorname{argmin}_{\theta\in\Theta}-\frac{1}{n}\sum_{i=1}^{n}\bigg{\{}u_{i}^{\top}\theta-\log\bigg{(}\exp(\theta^{\top}x_{i}^{(1)})+\exp(\theta^{\top}x_{i}^{(2)})\bigg{)}\bigg{\}}, (B.9)

We are now ready to present our results.

Lemma B.7.

With probability at least 1δ1-\delta, we have

1O(κ2d+log1/δn+λTr(𝔼Pa[xx](Σλ~)1))\displaystyle\mathcal{E}_{1}\leq O\bigg{(}\sqrt{\kappa^{2}\cdot\frac{d+\log 1/\delta}{n}+\lambda}\cdot\sqrt{\operatorname{Tr}(\mathbb{E}_{P_{a}}[xx^{\top}](\widetilde{\Sigma_{\lambda}})^{-1})}\bigg{)}

here κ=(1+exp(logn/δ))2\kappa=\big{(}1+\exp(\sqrt{\log n/\delta})\big{)}^{2}.

Proof of Lemma B.7 is in Appendix C.5.

B.3.4 Bounding 2\mathcal{E}_{2} and the Distribution Shift in Diffusion

Lemma B.8.

Under Assumption 3.1, 6.1 and 5.3, when t0=((Dd2+D2d)/n1)1/6t_{0}=\left((Dd^{2}+D^{2}d)/n_{1}\right)^{1/6}

2=𝒪~(𝒯(P(x,y^=a),Pxy^;𝒮¯)λmin(Dd2+D2dn1)16a).\displaystyle\mathcal{E}_{2}=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{\lambda_{\min}}}\cdot\left(\frac{Dd^{2}+D^{2}d}{n_{1}}\right)^{\frac{1}{6}}\cdot a\right).

Proof of Lemma B.8 can be found in Appendix C.6.

Note that 𝒯(P(x,y^=a),Pxy^;𝒮¯){\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}}) depends on aa and measures the distribution shift between the desired distribution P(x,y^=a)P(x,\widehat{y}=a) and the data distribution Pxy^P_{x\widehat{y}}. To understand this distribution’s dependency on aa, it what follows we give 𝒯(P(x,y^=a),Pxy^;𝒮¯){\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}}) a shorthand as DistroShift2(a)\texttt{DistroShift}^{2}(a) and give it an upper bound in one special case of the problem.

Distribution Shift

In the special case of covariance Σ\Sigma of zz is known and AV22=𝒪(AAVVF2)\|A-V\|_{2}^{2}=\mathcal{O}\left(\|AA^{\top}-VV^{\top}\|_{\rm F}^{2}\right), we showcase a bound on the distribution shift in 2\mathcal{E}_{2}, as promised in the discussion following Theorem 6.3. We have

DistroShift2(a)=𝔼Px,y^=a[(x,y;s^)]𝔼Pxy^[(x,y;s^)],\displaystyle\texttt{DistroShift}^{2}(a)=\frac{\mathbb{E}_{P_{x,\widehat{y}=a}}[\ell(x,y;\widehat{s})]}{\mathbb{E}_{P_{x\widehat{y}}}[\ell(x,y;\widehat{s})]},

where (x,y;s^)=1Tt0t0T𝔼x|xxlogϕt(x|x)s^(x,y,t)22dt\ell(x,y;\widehat{s})=\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\mathbb{E}_{x^{\prime}|x}\|\nabla_{x^{\prime}}\log\phi_{t}(x^{\prime}|x)-\widehat{s}(x^{\prime},y,t)\|_{2}^{2}\mathrm{d}t. By Proposition 4.1, it suffices to bound

DistroShift2(a)=𝔼Px,y^=a[t0Tlogpt(x,y)s^(x,y,t)22dt]𝔼Pxy^[t0Tlogpt(x,y)s^(x,y,t)22dt].\displaystyle\texttt{DistroShift}^{2}(a)=\frac{\mathbb{E}_{P_{x,\widehat{y}=a}}[\int_{t_{0}}^{T}\|\nabla\log p_{t}(x,y)-\widehat{s}(x,y,t)\|_{2}^{2}\mathrm{d}t]}{\mathbb{E}_{P_{x\widehat{y}}}[\int_{t_{0}}^{T}\|\nabla\log p_{t}(x,y)-\widehat{s}(x,y,t)\|_{2}^{2}\mathrm{d}t]}.

We expand the difference logpt(x,y)s^(x,y,t)22\|\nabla\log p_{t}(x,y)-\widehat{s}(x,y,t)\|_{2}^{2} by

logpt(x,y)s^(x,y,t)22\displaystyle\|\nabla\log p_{t}(x,y)-\widehat{s}(x,y,t)\|_{2}^{2} 2h2(t)[(AV)Bt(Ax+ν2yθ)22+VBt(AV)x22]\displaystyle\leq\frac{2}{h^{2}(t)}\Big{[}\|(A-V)B_{t}(A^{\top}x+\nu^{-2}y\theta)\|_{2}^{2}+\|VB_{t}(A-V)^{\top}x\|_{2}^{2}\Big{]}
2h2(t)[AV22Bt(Ax+ν2yθ)22+AV22x22]\displaystyle\leq\frac{2}{h^{2}(t)}\Big{[}\|A-V\|_{2}^{2}\|B_{t}(A^{\top}x+\nu^{-2}y\theta)\|_{2}^{2}+\|A-V\|_{2}^{2}\|x\|_{2}^{2}\Big{]}
2h2(t)AV22(3x22+y2),\displaystyle\leq\frac{2}{h^{2}(t)}\|A-V\|_{2}^{2}(3\|x\|_{2}^{2}+y^{2}),

where we recall BtB_{t} is defined in (E) and in the last inequality, we use (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}. In the case of covariance matrix Σ\Sigma is known, i.e., BtB_{t} is known, we also consider matrix VV directly matches AA without rotation. Then by Chen et al. [2023a, Lemma 3 and 17], we have AV22=𝒪(AAVVF2)=𝒪(t0/c0𝔼Pxy^[(x,y;s^)])\|A-V\|_{2}^{2}=\mathcal{O}\left(\|AA^{\top}-VV^{\top}\|_{\rm F}^{2}\right)=\mathcal{O}\left(t_{0}/c_{0}\mathbb{E}_{P_{x\widehat{y}}}[\ell(x,y;\widehat{s})]\right). To this end, we only need to find 𝔼Px|y^=a[x22]\mathbb{E}_{P_{x|\widehat{y}=a}}[\|x\|_{2}^{2}]. Since we consider on-support xx, which can be represented as x=Azx=Az, we have x2=z2\|x\|_{2}=\|z\|_{2}. Thus, we only need to find the conditional distribution of z|y^=az|\widehat{y}=a. Fortunately, we know (z,y^)(z,\widehat{y}) is jointly Gaussian, with mean 0 and covariance

[ΣΣβ^β^Σβ^Σβ^+ν2].\displaystyle\begin{bmatrix}\Sigma&\Sigma\widehat{\beta}\\ \widehat{\beta}^{\top}\Sigma&\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\end{bmatrix}.

Consequently, the conditional distribution of z|y^=az|\widehat{y}=a is still Gaussian, with mean Σβ^a/(β^Σβ^+ν2)\Sigma\widehat{\beta}a/(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}) and covariance ΣΣβ^β^Σ/(β^Σβ^+ν2)\Sigma-\Sigma\widehat{\beta}\widehat{\beta}^{\top}\Sigma/(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}). Hence, we have

𝔼Pz|y^=a[z22]=1(β^Σβ^+ν2)2((a2β^Σβ^ν2)β^Σ2β^)+Tr(Σ)=𝒪(a2d).\displaystyle\mathbb{E}_{P_{z|\widehat{y}=a}}[\|z\|_{2}^{2}]=\frac{1}{(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2})^{2}}\left((a^{2}-\widehat{\beta}^{\top}\Sigma\widehat{\beta}-\nu^{2})\widehat{\beta}^{\top}\Sigma^{2}\widehat{\beta}\right)+{\rm Tr}(\Sigma)=\mathcal{O}\left(a^{2}\vee d\right).

We integrate over tt for the numerator in DistroShift(a)\texttt{DistroShift}(a) to obtain 𝔼Px|y^=a[t0Tlogpt(x,y)s^(x,y,t)22dt=𝒪((a2d)1c0𝔼Pxy^[(x,y;s^)])\mathbb{E}_{P_{x|\widehat{y}=a}}[\int_{t_{0}}^{T}\|\nabla\log p_{t}(x,y)-\widehat{s}(x,y,t)\|_{2}^{2}\mathrm{d}t=\mathcal{O}\left((a^{2}\vee d)\frac{1}{c_{0}}\mathbb{E}_{P_{x\widehat{y}}}[\ell(x,y;\widehat{s})]\right). Note the cancellation between the numerator and denominator, we conclude

DistroShift(a)=𝒪(1c0(ad)).\displaystyle\texttt{DistroShift}(a)=\mathcal{O}\left(\frac{1}{c_{0}}(a\vee\sqrt{d})\right).

As dd is a natural upper bound of d\sqrt{d} and viewing c0c_{0} as a constant, we have DistroShift(a)=𝒪(ad)\texttt{DistroShift}(a)=\mathcal{O}\left(a\vee d\right) as desired.

Appendix C Supporting Lemmas and Proofs

C.1 Supporting Lemmas

Lemma C.1.

The estimated subspace VV satisfies

VUAF=𝒪(d32(V,A))\|VU-A\|_{F}=\mathcal{O}\left(d^{\frac{3}{2}}\sqrt{\angle({V},{A})}\right) (C.1)

for some orthogonal matrix Ud×dU\in\mathbb{R}^{d\times d}.

Proof of Lemma C.1 is in Appendix C.7.

Lemma C.2.

Suppose P1P_{1} and P2P_{2} are two distributions over d\mathbb{R}^{d} and mm is a function defined on d\mathbb{R}^{d}, then |𝔼xP1[m(z)]𝔼zP2[m(z)]|\left|\mathbb{E}_{x\sim P_{1}}[m(z)]-\mathbb{E}_{z\sim P_{2}}[m(z)]\right|can be bounded in terms of dTV(P1,P2)\texttt{d}_{\rm TV}(P_{1},P_{2}), specifically when P1P_{1} and P2P_{2} are Gaussians and m(z)=z22m(z)=\|z\|^{2}_{2}:

𝔼xP1[z22]=𝒪(𝔼zP2[z22](1+dTV(P1,P2))).\mathbb{E}_{x\sim P_{1}}[\|z\|^{2}_{2}]=\mathcal{O}\left(\mathbb{E}_{z\sim P_{2}}[\|z\|^{2}_{2}](1+\texttt{d}_{\rm TV}(P_{1},P_{2}))\right). (C.2)

When P1P_{1} and P2P_{2} are Gaussians and m(z)=z2m(z)=\|z\|_{2}:

|𝔼zP1[z2]𝔼zP2[z2]|=𝒪((𝔼zP1[z22]+𝔼zP2[z22])dTV(P1,P2)).\left|\mathbb{E}_{z\sim P_{1}}[\|z\|_{2}]-\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}]\right|=\mathcal{O}\left(\left(\sqrt{\mathbb{E}_{z\sim P_{1}}[\|z\|_{2}^{2}]}+\sqrt{\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}^{2}]}\right)\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})\right). (C.3)

Proof of Lemma C.2 is in Appendix C.8.

Lemma C.3.

We compute 𝔼zPLD(a)[z22],𝔼zPt0LD(a)[z22],𝔼xP^a[x22],𝔼z(UV)#P^a[z22]\mathbb{E}_{z\sim P^{LD}(a)}[\left\|z\right\|_{2}^{2}],\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[\|z\|_{2}^{2}],\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x\|_{2}^{2}],\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[\|z\|_{2}^{2}] in this Lemma.

𝔼zPLD(a)[z22]=β^Σ2β^(β^Σ2+ν2)2a2+trace(ΣΣβ^(β^Σβ^+ν2)1β^Σ).\mathbb{E}_{z\sim P^{LD}(a)}[\left\|z\right\|_{2}^{2}]=\frac{\widehat{\beta}^{\top}\Sigma^{2}\widehat{\beta}}{\left(\|\widehat{\beta}\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}+\operatorname{trace}(\Sigma-\Sigma\widehat{\beta}\left(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\right)^{-1}\widehat{\beta}^{\top}\Sigma). (C.4)

Let M(a):=𝔼zPLD(a)[z22]M(a):=\mathbb{E}_{z\sim P^{LD}(a)}[\left\|z\right\|^{2}_{2}], which has an upper bound M(a)=O(a2βΣ+d)M(a)=O\left(\frac{a^{2}}{\|{\beta}^{*}\|_{\Sigma}}+d\right).

𝔼zPt0LD(a)[z22]M(a)+t0d.\displaystyle\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[\|z\|^{2}_{2}]\leq M(a)+t_{0}d. (C.5)
𝔼xP^a[x22]𝒪(ct0D+M(a)(1+TV(P^a)).\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|x\|^{2}_{2}\right]\leq\mathcal{O}\left(ct_{0}D+M(a)\cdot(1+TV(\widehat{P}_{a})\right). (C.6)

Proof of Lemma C.3 is in Appendix C.9.

C.2 Proof of Lemma B.4

The first two assertions (B.2) and (B.3) are consequences of Chen et al. [2023a, Theorem 3, item 1 and 3]. To show (B.4), we first have the conditional score matching error under distribution shift being

𝒯(P(x,y^=a),Pxy^;𝒮¯)ϵdiff2,\displaystyle{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})\cdot\epsilon_{diff}^{2},

where 𝒯(P(x,y^=a),Pxy^;𝒮¯){\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}}) accounts for the distribution shift as in the parametric case (Lemma B.8). Then we apply Chen et al. [2023a, Theorem 3, item 2] to conclude

TV(P^a)=𝒪~(𝒯(P(x,y^=a),Pxy^;𝒮¯)c0ϵdiff).\displaystyle TV(\widehat{P}_{a})=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{c_{0}}}\cdot\epsilon_{diff}\right).

The proof is complete.

C.3 Proof of Lemma B.5

Given

1\displaystyle\mathcal{E}_{1} =𝔼P^a|x(θθ^)|𝔼P^axVλ1θθ^Vλ,\displaystyle=\mathbb{E}_{\widehat{P}_{a}}\left|x^{\top}(\theta^{*}-\widehat{\theta})\right|\leq\mathbb{E}_{\widehat{P}_{a}}\|x\|_{V_{\lambda}^{-1}}\cdot\|\theta^{*}-\widehat{\theta}\|_{V_{\lambda}},

then things to prove are

𝔼P^axVλ1\displaystyle\mathbb{E}_{\widehat{P}_{a}}\|x\|_{V_{\lambda}^{-1}} =trace(Vλ1ΣP^a);\displaystyle=\sqrt{\operatorname{trace}(V_{\lambda}^{-1}\Sigma_{\widehat{P}_{a}})}; (C.7)
θθ^Vλ\displaystyle\|\theta^{*}-\widehat{\theta}\|_{V_{\lambda}} 𝒪(dlogn2),\displaystyle\leq\mathcal{O}\left(\sqrt{d\log n_{2}}\right), (C.8)

where the second inequality is to be proven with high probability w.r.t the randomness in 𝒟label\mathcal{D}_{label}. For (C.7), 𝔼P^axVλ1𝔼P^axVλ1x=𝔼P^atrace(Vλ1xx)=trace(Vλ1𝔼P^axx).\mathbb{E}_{\widehat{P}_{a}}\|x\|_{V_{\lambda}^{-1}}\leq\sqrt{\mathbb{E}_{\widehat{P}_{a}}x^{\top}V_{\lambda}^{-1}x}=\sqrt{\mathbb{E}_{\widehat{P}_{a}}\operatorname{trace}(V_{\lambda}^{-1}xx^{\top})}=\sqrt{\operatorname{trace}(V_{\lambda}^{-1}\mathbb{E}_{\widehat{P}_{a}}xx^{\top})}.

For (C.8), what’s new to prove compared to a classic bandit derivation is its dd dependency instead of DD, due to the linear subspace structure in xx. From the closed form solution of θ^\widehat{\theta}, we have

θ^θ=Vλ1XηλVλ1θ.\widehat{\theta}-\theta^{*}=V_{\lambda}^{-1}X^{\top}\eta-\lambda V_{\lambda}^{-1}\theta^{*}. (C.9)

Therefore,

θθ^VλXηVλ1+λθVλ1,\|\theta^{*}-\widehat{\theta}\|_{V_{\lambda}}\leq\|X^{\top}\eta\|_{V_{\lambda}^{-1}}+\lambda\|\theta^{*}\|_{V_{\lambda}^{-1}}, (C.10)

where λθVλ1λθ2λ\lambda\|\theta^{*}\|_{V_{\lambda}^{-1}}\leq\sqrt{\lambda}\|\theta^{*}\|_{2}\leq\sqrt{\lambda} and

XηVλ12\displaystyle\|X^{\top}\eta\|_{V_{\lambda}^{-1}}^{2} =ηX(XX+λID)1Xη\displaystyle=\eta^{\top}X\left(X^{\top}X+\lambda I_{D}\right)^{-1}X^{\top}\eta
=ηXX(XX+λIn2)1η.\displaystyle=\eta^{\top}XX^{\top}\left(XX^{\top}+\lambda I_{n_{2}}\right)^{-1}\eta.

Let Z=(z1,,zi,,zn2)Z^{\top}=(z_{1},\cdots,z_{i},\cdots,z_{n_{2}}) s.t. Azi=xiAz_{i}=x_{i}, then it holds that X=ZAX=ZA^{\top}, and XX=ZAAZ=ZZXX^{\top}=ZA^{\top}AZ^{\top}=ZZ^{\top} , thus

XηVλ12\displaystyle\|X^{\top}\eta\|_{V_{\lambda}^{-1}}^{2} =ηXX(XX+λIn2)1η\displaystyle=\eta^{\top}XX^{\top}\left(XX^{\top}+\lambda I_{n_{2}}\right)^{-1}\eta
=ηZZ(ZZ+λIn2)1η\displaystyle=\eta^{\top}ZZ^{\top}\left(ZZ^{\top}+\lambda I_{n_{2}}\right)^{-1}\eta
=ηZ(ZZ+λId)1Zη\displaystyle=\eta^{\top}Z\left(Z^{\top}Z+\lambda I_{d}\right)^{-1}Z^{\top}\eta
=Zη(ZZ+λId)1.\displaystyle=\|Z^{\top}\eta\|_{\left(Z^{\top}Z+\lambda I_{d}\right)^{-1}}.

With probability 1δ1-\delta, zi2d+dlog(2n2δ):=L2,i[n2]\|z_{i}\|^{2}\leq d+\sqrt{d\log\left(\frac{2n_{2}}{\delta}\right)}:=L^{2},\forall i\in[n_{2}]. Then applying Abbasi-Yadkori et al. [2011, Theorem 1] gives rise to

Zη(ZZ+λId)12log(2/δ)+dlog(1+n2L2/(λd))\|Z^{\top}\eta\|_{\left(Z^{\top}Z+\lambda I_{d}\right)^{-1}}\leq\sqrt{2\log(2/\delta)+d\log(1+n_{2}L^{2}/(\lambda d))}

with probability 1δ/21-\delta/2. Combine things together and plugging in λ=1\lambda=1, L2=d+dlog(2n2δ)L^{2}=d+\sqrt{d\log\left(\frac{2n_{2}}{\delta}\right)}, we have with high probability

θθ^Vλ=𝒪(dlog(n2log(n2)))=𝒪(dlogn2+12dlog(logn2))=𝒪(dlogn2).\|\theta^{*}-\widehat{\theta}\|_{V_{\lambda}}=\mathcal{O}\left(\sqrt{d\log\left(n_{2}\sqrt{\log(n_{2})}\right)}\right)=\mathcal{O}\left(\sqrt{d\log n_{2}+\frac{1}{2}d\log(\log n_{2})}\right)=\mathcal{O}\left(\sqrt{d\log n_{2}}\right).

C.4 Proof of Lemma B.6

Recall the definition of Σ^λ\widehat{\Sigma}_{\lambda} and ΣPa\Sigma_{P_{a}} that

Σ^λ\displaystyle\widehat{\Sigma}_{\lambda} =1n2XX+λn2ID,\displaystyle=\frac{1}{n_{2}}X^{\top}X+\frac{\lambda}{n_{2}}I_{D},
ΣPa\displaystyle\Sigma_{P_{a}} =𝔼xPa[xx],\displaystyle=\mathbb{E}_{x\sim P_{a}}\left[xx^{\top}\right],

where XX are stack matrix of data supported on 𝒜\mathcal{A} and PaP_{a} is also supported on 𝒜\mathcal{A}, 𝒜\mathcal{A} is the subspace encoded by matrix AA. The following lemma shows it is equivalent to measure trace(Σ^λ1ΣPa)\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}}) on 𝒜\mathcal{A} subspace.

Lemma C.4.

For any P.S.D. matrices Σ1,Σ2d×d\Sigma_{1},\Sigma_{2}\in\mathbb{R}^{d\times d} and AD×dA\in\mathbb{R}^{D\times d} such that AA=IdA^{\top}A=I_{d}, we have

Tr((λID+AΣ1A)1AΣ2A)=Tr((λId+Σ1)1Σ2).\displaystyle\mathrm{Tr}\left((\lambda I_{D}+A\Sigma_{1}A^{\top})^{-1}A\Sigma_{2}A^{\top}\right)=\mathrm{Tr}\left(\left(\lambda I_{d}+\Sigma_{1}\right)^{-1}\Sigma_{2}\right).

The lemma above allows us to abuse notations Σ^λ\widehat{\Sigma}_{\lambda} and ΣPa\Sigma_{P_{a}} in the following way while keeping the same trace(Σ^λ1ΣPa)\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}}) value:

Σ^λ\displaystyle\widehat{\Sigma}_{\lambda} =1n2ZZ+λn2Id,\displaystyle=\frac{1}{n_{2}}Z^{\top}Z+\frac{\lambda}{n_{2}}I_{d},
ΣPa\displaystyle\Sigma_{P_{a}} =𝔼zPLD(a)[zz],\displaystyle=\mathbb{E}_{z\sim P^{LD}(a)}\left[zz^{\top}\right],

where Z=(z1,,zi,,zn2)Z^{\top}=(z_{1},\cdots,z_{i},\cdots,z_{n_{2}}) s.t. Azi=xiAz_{i}=x_{i} and recall notataion PLD(a)=Pz(zf^(Az))P^{LD}(a)=P_{z}\left(z\mid\widehat{f}(Az)\right).

Given z𝖭(μ,Σ)z\sim{\sf N}(\mu,\Sigma), as a proof artifact, let f^(x)=θ^x+ξ,ξ𝖭(0,ν2)\widehat{f}(x)=\widehat{\theta}^{\top}x+\xi,\xi\sim{\sf N}(0,\nu^{2}) where we will let ν0\nu\to 0 in the end, then let β^=Aθ^d\widehat{\beta}=A^{\top}\widehat{\theta}\in\mathbb{R}^{d}, (z,f^(Az))(z,\widehat{f}(Az)) has a joint distribution

(z,f^)𝖭([μβ^μ],[ΣΣβ^β^Σβ^Σβ^+ν2]).(z,\widehat{f})\sim{\sf N}\left(\begin{bmatrix}\mu\\ \widehat{\beta}^{\top}\mu\end{bmatrix},\begin{bmatrix}\Sigma&\Sigma\widehat{\beta}\\ \widehat{\beta}^{\top}\Sigma&\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\end{bmatrix}\right). (C.11)

Then we have the conditional distribution zf^(Az)=az\mid\widehat{f}(Az)=a following

Pz(zf^(Az)=a)=𝖭(μ+Σβ^(β^Σβ^+ν2)1(aβ^μ),Γ)P_{z}\left(z\mid\widehat{f}(Az)=a\right)={\sf N}\left(\mu+\Sigma\widehat{\beta}\left(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\right)^{-1}(a-\widehat{\beta}^{\top}\mu),\Gamma\right) (C.12)

with Γ:=ΣΣβ^(β^Σβ^+ν2)1β^Σ\Gamma:=\Sigma-\Sigma\widehat{\beta}\left(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\right)^{-1}\widehat{\beta}^{\top}\Sigma.

When μ=0\mu=0, we compute trace(Σ^λ1ΣPa)\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}}) as

trace(Σ^λ1ΣPa)\displaystyle\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}}) =trace(Σ^λ1Σβ^β^Σ(β^Σ2+ν2)2a2)+trace(Σ^λ1Γ)\displaystyle=\operatorname{trace}\left(\widehat{\Sigma}_{\lambda}^{-1}\frac{\Sigma\widehat{\beta}\widehat{\beta}^{\top}\Sigma}{\left(\|\widehat{\beta}\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}\right)+\operatorname{trace}\left(\widehat{\Sigma}_{\lambda}^{-1}\Gamma\right)
=trace(β^ΣΣ^λ1Σβ^(β^Σ2+ν2)2a2)+trace(Σ^λ1Σ)trace(Σ^λ1Σβ^β^Σβ^Σ2+ν2)\displaystyle=\operatorname{trace}\left(\frac{\widehat{\beta}^{\top}\Sigma\widehat{\Sigma}_{\lambda}^{-1}\Sigma\widehat{\beta}}{\left(\|\widehat{\beta}\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}\right)+\operatorname{trace}\left(\widehat{\Sigma}_{\lambda}^{-1}\Sigma\right)-\operatorname{trace}\left(\widehat{\Sigma}_{\lambda}^{-1}\frac{\Sigma\widehat{\beta}\widehat{\beta}^{\top}\Sigma}{\|\widehat{\beta}\|_{\Sigma}^{2}+\nu^{2}}\right)
=trace(Σ1/2β^β^Σ1/2Σ1/2Σ^λ1Σ1/2(β^Σ2+ν2)2a2)\displaystyle=\operatorname{trace}\left(\frac{\Sigma^{1/2}\widehat{\beta}\widehat{\beta}^{\top}\Sigma^{1/2}\Sigma^{1/2}\widehat{\Sigma}_{\lambda}^{-1}\Sigma^{1/2}}{\left(\|\widehat{\beta}\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}\right)
ΣΣ^λ1Σopβ^Σ2(β^Σ2+ν2)2a2+trace(Σ12Σ^λ1Σ12)\displaystyle\leq\frac{\|\Sigma\widehat{\Sigma}_{\lambda}^{-1}\Sigma\|_{op}\cdot\|\widehat{\beta}\|_{\Sigma}^{2}}{\left(\|\widehat{\beta}\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}\cdot a^{2}+\operatorname{trace}\left(\Sigma^{\frac{1}{2}}\widehat{\Sigma}_{\lambda}^{-1}\Sigma^{\frac{1}{2}}\right)

By Lemma 3 in Chen et al. [2020], it holds that

Σ12Σ^λ1Σ12Id2O(1λminn2).\|\Sigma^{\frac{1}{2}}\widehat{\Sigma}_{\lambda}^{-1}\Sigma^{\frac{1}{2}}-I_{d}\|_{2}\leq O\left(\frac{1}{\sqrt{\lambda_{\min}n_{2}}}\right). (C.13)

Therefore,

trace(Σ^λ1ΣPa)\displaystyle\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}}) 1+1λminn2β^Σ2a2+O(d(1+1λminn2)).\displaystyle\leq\frac{1+\frac{1}{\sqrt{\lambda_{\min}n_{2}}}}{\|\widehat{\beta}\|_{\Sigma}^{2}}\cdot a^{2}+O\left(d\left(1+\frac{1}{\sqrt{\lambda_{\min}n_{2}}}\right)\right).

Then, what left is to bound β^Σ=θ^AΣAθAΣAθ^θAΣA\|\widehat{\beta}\|_{\Sigma}=\|\widehat{\theta}\|_{A\Sigma A^{\top}}\geq\|\theta^{*}\|_{A\Sigma A^{\top}}-\|\widehat{\theta}-\theta^{*}\|_{A\Sigma A^{\top}} by triangle inequality. On one hand,

θAΣA=βΣ.\|\theta^{*}\|_{A\Sigma A^{\top}}=\|\beta^{*}\|_{\Sigma}. (C.14)

On the other hand,

θ^θAΣA\displaystyle\|\widehat{\theta}-\theta^{*}\|_{A\Sigma A^{\top}} =𝒪(θ^θΣ^λ)\displaystyle=\mathcal{O}\left(\|\widehat{\theta}-\theta^{*}\|_{\widehat{\Sigma}_{\lambda}}\right)
=𝒪(θ^θVλn2)\displaystyle=\mathcal{O}\left(\frac{\|\widehat{\theta}-\theta^{*}\|_{V_{\lambda}}}{\sqrt{n_{2}}}\right)
=𝒪(dlog(n2)n2).\displaystyle=\mathcal{O}\left(\sqrt{\frac{d\log(n_{2})}{n_{2}}}\right).

with high probability. Thus when n2=Ω(dβΣ2)n_{2}=\Omega(\frac{d}{\|\beta^{*}\|^{2}_{\Sigma}})

β^Σ12βΣ.\|\widehat{\beta}\|_{\Sigma}\geq\frac{1}{2}\|\beta^{*}\|_{\Sigma}.

Therefore

trace(Σ^λ1ΣPa)𝒪(1+1λminn2βΣa2+d(1+1λminn2))=𝒪(a2βΣ+d).\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})\leq\mathcal{O}\left(\frac{1+\frac{1}{\sqrt{\lambda_{\min}n_{2}}}}{\|\beta^{*}\|_{\Sigma}}\cdot a^{2}+d\left(1+\frac{1}{\sqrt{\lambda_{\min}n_{2}}}\right)\right)\\ =\mathcal{O}\left(\frac{a^{2}}{\|\beta^{*}\|_{\Sigma}}+d\right). (C.15)

when n2=Ω(max{1λmin,dβΣ2})n_{2}=\Omega(\max\{\frac{1}{\lambda_{\min}},\frac{d}{\|\beta^{*}\|^{2}_{\Sigma}}\}).

C.5 Proof of Lemma B.7

Denote the loss function in Equation (B.9) by l(θ)l(\theta). By simple algebra we have

l(θ)=1n2i=1n2{exp(θxi(1))(xi(1)ui)+exp(θxi(2))(xi(2)hi)exp(θxi(1))+exp(θxi(2))},\displaystyle\nabla l(\theta)=\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\bigg{\{}\frac{\exp(\theta^{\top}x_{i}^{(1)})\cdot(x_{i}^{(1)}-u_{i})+\exp(\theta^{\top}x_{i}^{(2)})\cdot(x_{i}^{(2)}-h_{i})}{\exp(\theta^{\top}x_{i}^{(1)})+\exp(\theta^{\top}x_{i}^{(2)})}\bigg{\}},
2l(θ)=1n2i=1n2{exp((xi(1)+xi(2))θ)(exp(θxi(1))+exp(θxi(2)))2(xi(1)xi(2))(xi(1)xi(2))},\displaystyle\nabla^{2}l(\theta)=\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\bigg{\{}\frac{\exp((x_{i}^{(1)}+x_{i}^{(2)})^{\top}\theta)}{(\exp(\theta^{\top}x_{i}^{(1)})+\exp(\theta^{\top}x_{i}^{(2)}))^{2}}\cdot(x_{i}^{(1)}-x_{i}^{(2)})(x_{i}^{(1)}-x_{i}^{(2)})^{\top}\bigg{\}},

holds for all θd\theta\in\mathbb{R}^{d}. By Assumption 5.3 and Hoeffding’s inequality, we have

(maxi[n2]{|θxi(1)|,|θxi(2)|}log(2n2/δ))1δ.\mathbb{P}\big{(}\max_{i\in[n_{2}]}\big{\{}|\theta^{*\top}x_{i}^{(1)}|,|\theta^{*\top}x_{i}^{(2)}|\big{\}}\leq\log(2n_{2}/\delta)\big{)}\geq 1-\delta.

In the following proof, we will condition on this event. We have

2l(θ)\displaystyle\nabla^{2}l(\theta^{*}) =1n2i=1n2(11+exp((xi(1)xi(2))θ)11+exp((xi(2)xi(1))θ))(xi(1)xi(2))(xi(1)xi(2))\displaystyle=\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\bigg{(}\frac{1}{1+\exp((x_{i}^{(1)}-x_{i}^{(2)})^{\top}\theta^{*})}\cdot\frac{1}{1+\exp((x_{i}^{(2)}-x_{i}^{(1)})^{\top}\theta^{*})}\bigg{)}\cdot(x_{i}^{(1)}-x_{i}^{(2)})(x_{i}^{(1)}-x_{i}^{(2)})^{\top}
1n21(1+exp(log2n2/δ))2i=1n(xi(1)xi(2))(xi(1)xi(2)).\displaystyle\succeq\frac{1}{n_{2}}\cdot\frac{1}{(1+\exp(\sqrt{\log 2n_{2}/\delta}))^{2}}\sum_{i=1}^{n}(x_{i}^{(1)}-x_{i}^{(2)})(x_{i}^{(1)}-x_{i}^{(2)})^{\top}.

In the following we adapt the notation of Σx=1ni=1n(xi(1)xi(2))(xi(1)xi(2))\Sigma_{x}=\frac{1}{n}\sum_{i=1}^{n}(x_{i}^{(1)}-x_{i}^{(2)})(x_{i}^{(1)}-x_{i}^{(2)})^{\top}, then 2l(θ)1/κΣx\nabla^{2}l(\theta^{*})\succeq 1/\kappa\cdot\Sigma_{x}. To bound 𝔼xPa(|f^(x)f(x)|)\mathbb{E}_{x\sim P_{a}}(|\widehat{f}(x)-f^{*}(x)|), we have the following inequality:

𝔼xPa[|f^(x)f(x)|]\displaystyle\mathbb{E}_{x\sim P_{a}}[|\widehat{f}(x)-f^{*}(x)|] =𝔼xPa[|x(θ^θ)|]\displaystyle=\mathbb{E}_{x\sim P_{a}}[|x^{\top}(\widehat{\theta}-\theta^{*})|]
=𝔼z(A)Pa[|z(Aθ^β)|]\displaystyle=\mathbb{E}_{z\sim(A^{\top})_{\sharp}P_{a}}[|z^{\top}(A^{\top}\widehat{\theta}-\beta)|]
𝔼z(A)Pa[z(Σz+λId)1](i)Δz(Σz+λId)1(ii)\displaystyle\leq\underbrace{\mathbb{E}_{z\sim(A^{\top})_{\sharp}P_{a}}[\|z\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}]}_{\operatorname{(i)}}\cdot\underbrace{\|\Delta_{z}\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}}_{\operatorname{(ii)}}

Here Δz=Aθ^β\Delta_{z}=A^{\top}\widehat{\theta}-\beta. We also denote Δx=θ^θ\Delta_{x}=\widehat{\theta}-\theta^{*}. To bound (i) note that by Jensen’s inequality, we have

𝔼z(A)Pa[z(Σz+λId)1]𝔼z(A)Pa[z(Σz+λId)12]=Tr(𝔼z(A)Pa[zz])(Σz+λId)1.\mathbb{E}_{z\sim(A^{\top})_{\sharp}P_{a}}[\|z\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}]\leq\sqrt{\mathbb{E}_{z\sim(A^{\top})_{\sharp}P_{a}}\big{[}\|z\|^{2}_{(\Sigma_{z}+\lambda I_{d})^{-1}}\big{]}}=\sqrt{\operatorname{Tr}\big{(}\mathbb{E}_{z\sim(A^{\top})_{\sharp}P_{a}}[zz^{\top}]\big{)}(\Sigma_{z}+\lambda I_{d})^{-1}}.

Next we bound (ii). It is easy to prove that

l(θ^)l(θ)l(θ)(θ^θ)12κΔxΣx2.l(\widehat{\theta})-l(\theta^{*})-\nabla l(\theta^{*})^{\top}(\widehat{\theta}-\theta^{*})\geq\frac{1}{2\kappa}\|\Delta_{x}\|_{\Sigma_{x}}^{2}.

By definition of MLE, we have l(θ^)l(θ)0l(\widehat{\theta})-l(\theta)\leq 0. Therefore

12κΔxΣx2\displaystyle\frac{1}{2\kappa}\|\Delta_{x}\|_{\Sigma_{x}}^{2} |l(θ)(θ^θ)|\displaystyle\leq|\nabla l(\theta^{*})^{\top}(\widehat{\theta}-\theta^{*})|
|l(θ)(AAθ^θ)|+|l(θ)(IAA)(θ^θ)|.\displaystyle\leq|\nabla l(\theta^{*})^{\top}(AA^{\top}\widehat{\theta}-\theta^{*})|+|\nabla l(\theta^{*})^{\top}(I-AA^{\top})(\widehat{\theta}-\theta^{*})|. (C.16)

Recall that

l(θ)=1n2i=1n2exp(θxi(1))(xi(1)ui)+exp(θxi(2))(xi(2)ui)exp(θxi(1))+exp(θxi(2)),\nabla l(\theta^{*})=\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\frac{\exp(\theta^{\top}x_{i}^{(1)})\cdot(x_{i}^{(1)}-u_{i})+\exp(\theta^{\top}x_{i}^{(2)})\cdot(x_{i}^{(2)}-u_{i})}{\exp(\theta^{\top}x_{i}^{(1)})+\exp(\theta^{\top}x_{i}^{(2)})},

since xi(1),xi(2)x_{i}^{(1)},x_{i}^{(2)} are in the image set of AA, there exists αd\alpha\in\mathbb{R}^{d} such that l(θ)=Aα\nabla l(\theta^{*})=A\alpha, and we have

l(θ)(IAA)=αA(IAA)=0.\nabla l(\theta^{*})^{\top}(I-AA^{\top})=\alpha^{\top}A^{\top}(I-AA^{\top})=0.

Therefore we have

12κΔxΣx2\displaystyle\frac{1}{2\kappa}\|\Delta_{x}\|_{\Sigma_{x}}^{2} |l(θ)(AAθ^θ)|\displaystyle\leq|\nabla l(\theta^{*})^{\top}(AA^{\top}\widehat{\theta}-\theta^{*})|
=|(Al(θ))Δz|\displaystyle=|(A^{\top}\nabla l(\theta^{*}))^{\top}\Delta_{z}|
Al(θ)(Σz+λId)1ΔzΣz+λId\displaystyle\leq\|A^{\top}\nabla l(\theta^{*})\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}\cdot\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}} (C.17)

By AA(xi(1)xi(2))=(xi(1)xi(2))AA^{\top}(x_{i}^{(1)}-x_{i}^{(2)})=(x_{i}^{(1)}-x_{i}^{(2)}), we have

ΔxΣx2=ΔxAΣzAΔx=ΔzΣzΔz=ΔzΣz2,\displaystyle\|\Delta_{x}\|_{\Sigma_{x}}^{2}=\Delta_{x}^{\top}A\Sigma_{z}A^{\top}\Delta_{x}=\Delta_{z}^{\top}\Sigma_{z}\Delta_{z}=\|\Delta_{z}\|_{\Sigma_{z}}^{2},

plug this in (C.5), we have

12κΔzΣz+λId2\displaystyle\frac{1}{2\kappa}\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}}^{2} 12κ(ΔzΣz2+λΔz22)\displaystyle\leq\frac{1}{2\kappa}(\|\Delta_{z}\|_{\Sigma_{z}}^{2}+\lambda\|\Delta_{z}\|_{2}^{2})
Al(θ)(Σz+λId)1ΔzΣz+λId+λ2κΔz22,\displaystyle\leq\|A^{\top}\nabla l(\theta^{*})\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}\cdot\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}}+\frac{\lambda}{2\kappa}\|\Delta_{z}\|_{2}^{2},

By θ^21\|\widehat{\theta}\|_{2}\leq 1, we have Δz2=Aθ^β2=θ^θ22\|\Delta_{z}\|_{2}=\|A^{\top}\widehat{\theta}-\beta\|_{2}=\|\widehat{\theta}-\theta^{*}\|_{2}\leq 2, and therefore

ΔzΣz+λId22κAl(θ)(Σz+λId)1ΔzΣz+λId+2λ.\displaystyle\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}}^{2}\leq 2\kappa\cdot\|A^{\top}\nabla l(\theta^{*})\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}\cdot\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}}+{2\lambda}. (C.18)

To derive an upper bound for ΔzΣz+λId\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}}, we need to upper bound Al(θ)(Σz+λId)1\|A^{\top}\nabla l(\theta^{*})\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}. By Bernstein’s inequality for sub-Gaussian random variables in quadratic form (e.g. see Theorem 2.1 in Hsu et al. [2011]), we have

Al(θ)(Σz+λId)1cd+log1/δn\|A^{\top}\nabla l(\theta^{*})\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}\leq c\cdot\sqrt{\frac{d+\log 1/\delta}{n}}

for some constant cc with probability at least 1δ1-\delta. Then we have

ΔzΣz+λId22cκd+log1/δnΔzΣz+λId+2λ,\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}}^{2}\leq 2c\cdot\kappa\cdot\sqrt{\frac{d+\log 1/\delta}{n}}\cdot\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}}+{2\lambda},

solving which gives

(ii)O(κ2d+log1/δn+λ).\operatorname{(ii)}\leq O\bigg{(}\sqrt{\kappa^{2}\cdot\frac{d+\log 1/\delta}{n}+\lambda}\bigg{)}.

Combining (i) and (ii), we prove that

1=𝔼xPa[|f^(x)f(x)|]O(κ2d+log1/δn+λTr(𝔼Pa[zz](Σz+λId)1)).\mathcal{E}_{1}=\mathbb{E}_{x\sim P_{a}}[|\widehat{f}(x)-f^{*}(x)|]\leq O\bigg{(}\sqrt{\kappa^{2}\cdot\frac{d+\log 1/\delta}{n}+\lambda}\cdot\sqrt{\operatorname{Tr}(\mathbb{E}_{P_{a}}[zz^{\top}](\Sigma_{z}+\lambda I_{d})^{-1})}\bigg{)}.

By Lemma C.4, we have

1O(κ2d+log1/δn+λTr(𝔼Pa[xx](Σx+λID)1)),\displaystyle\mathcal{E}_{1}\leq O\bigg{(}\sqrt{\kappa^{2}\cdot\frac{d+\log 1/\delta}{n}+\lambda}\cdot\sqrt{\operatorname{Tr}(\mathbb{E}_{P_{a}}[xx^{\top}](\Sigma_{x}+\lambda I_{D})^{-1})}\bigg{)},

recall that Σ~λ:=1n2(i=1n2(xi(1)xi(2))(xi(1)xi(2))+λI)\widetilde{\Sigma}_{\lambda}:=\frac{1}{n_{2}}(\sum_{i=1}^{n_{2}}(x_{i}^{(1)}-x_{i}^{(2)})(x_{i}^{(1)}-x_{i}^{(2)})^{\top}+\lambda I), reset λ=λ/n2\lambda=\lambda/n_{2} and we conclude the proof.

C.6 Proof of lemma B.8

Recall the definition of g(x)g(x) that

g(x)=θAAx,g(x)={\theta^{*}}^{\top}AA^{\top}x,

note that g(x)=θxg(x)={\theta^{*}}^{\top}x when xx is supported on 𝒜\mathcal{A}. Thus,

|𝔼xPa[g(x)]𝔼xP^a[g(x)]|\displaystyle\left|\mathbb{E}_{x\sim P_{a}}[g(x)]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g(x)]\right|
=\displaystyle= |𝔼xPa[θx]𝔼xP^a[θAAx]|\displaystyle\left|\mathbb{E}_{x\sim P_{a}}[{\theta^{*}}^{\top}x]-\mathbb{E}_{x\sim\widehat{P}_{a}}[{\theta^{*}}^{\top}AA^{\top}x]\right|
\displaystyle\leq |𝔼xPa[θx]𝔼xP^a[θVVx]|+|𝔼xP^a[θVVx]𝔼xP^a[θAAx]|e1,\displaystyle\left|\mathbb{E}_{x\sim P_{a}}[{\theta^{*}}^{\top}x]-\mathbb{E}_{x\sim\widehat{P}_{a}}[{\theta^{*}}^{\top}VV^{\top}x]\right|+\underbrace{\left|\mathbb{E}_{x\sim\widehat{P}_{a}}[{\theta^{*}}^{\top}VV^{\top}x]-\mathbb{E}_{x\sim\widehat{P}_{a}}[{\theta^{*}}^{\top}AA^{\top}x]\right|}_{e_{1}},

where

e1\displaystyle e_{1} =|𝔼xP^a[θ(VVAA)x]|\displaystyle=\left|\mathbb{E}_{x\sim\widehat{P}_{a}}[{\theta^{*}}^{\top}\left(VV^{\top}-AA^{\top}\right)x]\right|
𝔼xP^a[((VVAA)x]\displaystyle\leq\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\left(\left\|(VV^{\top}-AA^{\top}\right)x\right\|\right]
VVAAF𝔼xP^a[x22].\displaystyle\leq\|VV^{\top}-AA^{\top}\|_{F}\cdot\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\left\|x\right\|^{2}_{2}\right]}.

Use notation PLD(a)=P(zf^(Az)=a)P^{LD}(a)=P(z\mid\widehat{f}(Az)=a), Pt0LD(a)=Pt0(zf^(Az)=a)P^{LD}_{t_{0}}(a)=P_{t_{0}}(z\mid\widehat{f}(Az)=a)

|𝔼xPa[θx]𝔼xP^a[θVVx]|\displaystyle\left|\mathbb{E}_{x\sim P_{a}}[{\theta^{*}}^{\top}x]-\mathbb{E}_{x\sim\widehat{P}_{a}}[{\theta^{*}}^{\top}VV^{\top}x]\right|
=\displaystyle= |𝔼zP(zf^(Az)=a)[θAz]𝔼z(UV)#P^a[θVUz]|\displaystyle\left|\mathbb{E}_{z\sim P(z\mid\widehat{f}(Az)=a)}[{\theta^{*}}^{\top}Az]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[{\theta^{*}}^{\top}VUz]\right|
\displaystyle\leq |𝔼zPt0LD(a)[θAz]𝔼z(UV)#P^a[θVUz]|+|𝔼zPt0LD(a)[θAz]𝔼zPLD(a)[θAz]|e2,\displaystyle\left|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{*}}^{\top}Az]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[{\theta^{*}}^{\top}VUz]\right|+\underbrace{\left|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{*}}^{\top}Az]-\mathbb{E}_{z\sim P^{LD}(a)}[{\theta^{*}}^{\top}Az]\right|}_{e_{2}},

here

e2\displaystyle e_{2} =|α(t0)𝔼zPLD(a)[θAz]+h(t0)𝔼u𝖭(𝟢,𝖨𝖽)[θAu]𝔼zPLD(a)[θAz]|\displaystyle=\left|\alpha(t_{0})\mathbb{E}_{z\sim P^{LD}(a)}[{\theta^{*}}^{\top}Az]+h(t_{0})\mathbb{E}_{u\sim\sf N(0,I_{d})}[{\theta^{*}}^{\top}Au]-\mathbb{E}_{z\sim P^{LD}(a)}[{\theta^{*}}^{\top}Az]\right|
(1α(t0))|𝔼zPLD(a)[θAz]|\displaystyle\leq(1-\alpha(t_{0}))\left|\mathbb{E}_{z\sim P^{LD}(a)}[{\theta^{*}}^{\top}Az]\right|
t0𝔼zPLD(a)[z2].\displaystyle\leq t_{0}\cdot\mathbb{E}_{z\sim P^{LD}(a)}[\left\|z\right\|_{2}].

Then what is left to bound is

|𝔼zPt0LD(a)[θAz]𝔼z(UV)#P^a[θVUz]|\displaystyle\left|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{*}}^{\top}Az]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[{\theta^{*}}^{\top}VUz]\right|
\displaystyle\leq |𝔼zPt0LD(a)[θVUz]𝔼z(UV)#P^a[θVUz]|e3+|𝔼zPt0LD(a)[θVUz]𝔼zPt0LD(a)[θAz]|e4.\displaystyle\underbrace{\left|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{*}}^{\top}VUz]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[{\theta^{*}}^{\top}VUz]\right|}_{e_{3}}+\underbrace{\left|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{*}}^{\top}VUz]-\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{*}}^{\top}Az]\right|}_{e_{4}}.

Then for term e3e_{3}, by Lemma C.2, we get

e3\displaystyle e_{3} |𝔼zPt0LD(a)[z2]𝔼z(UV)#P^a[z2]|\displaystyle\leq\left|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[\|z\|_{2}]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[\|z\|_{2}]\right|
=𝒪(TV(P^a)(𝔼zPt0LD(a)[z22]+𝔼xP^a[x22])),\displaystyle=\mathcal{O}\left(TV(\widehat{P}_{a})\cdot\left(\sqrt{\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[\|z\|^{2}_{2}]}+\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x\|^{2}_{2}]}\right)\right),

where we use 𝔼z(UV)#P^a[z2]𝔼xP^a[x2]\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[\|z\|_{2}]\leq\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x\|_{2}]

For e4e_{4}, we have

e4\displaystyle e_{4} =|𝔼zPt0LD(a)[θ(VUA)z]|\displaystyle=\left|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{*}}^{\top}(VU-A)z]\right|
=α(t0)|𝔼zP(a)[θ(VUA)z]|\displaystyle=\alpha(t_{0})\left|\mathbb{E}_{z\sim P(a)}[{\theta^{*}}^{\top}(VU-A)z]\right|
VUAF𝔼zPLD(a)[z2].\displaystyle\leq\|VU-A\|_{F}\cdot\mathbb{E}_{z\sim P^{LD}(a)}[\left\|z\right\|_{2}].

Therefore, by combining things together, we have

2\displaystyle\mathcal{E}_{2}\leq e1+e2+e3+e4\displaystyle e_{1}+e_{2}+e_{3}+e_{4}
\displaystyle\leq VVAAF𝔼xP^a[x22]+(VUAF+t0)M(a)\displaystyle\|VV^{\top}-AA^{\top}\|_{F}\cdot\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x\|^{2}_{2}]}+(\|VU-A\|_{F}+t_{0})\cdot\sqrt{M(a)}
+𝒪(TV(P^a)(M(a)+t0d+𝔼xP^a[x22])).\displaystyle+\mathcal{O}\left(TV(\widehat{P}_{a})\cdot\left(\sqrt{M(a)+t_{0}d}+\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x\|^{2}_{2}]}\right)\right).

By Lemma B.4 and Lemma C.1, we have

TV(P^a)=𝒪~(𝒯(P(x,y^=a),Pxy^;𝒮¯)λminϵdiff),\displaystyle TV(\widehat{P}_{a})=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{\lambda_{\min}}}\cdot\epsilon_{diff}\right),
VVAAF=𝒪~(t0λminϵdiff),\displaystyle\|VV^{\top}-AA^{\top}\|_{\rm F}=\widetilde{\mathcal{O}}\left(\frac{\sqrt{t_{0}}}{\sqrt{\lambda_{\min}}}\cdot\epsilon_{diff}\right),
VUAF=𝒪(d32VVAAF).\displaystyle\|VU-A\|_{\rm F}=\mathcal{O}(d^{\frac{3}{2}}\|VV^{\top}-AA^{\top}\|_{\rm F}).

And by Lemma C.3

𝔼xP^a[x22]=𝒪(ct0D+M(a)(1+TV(P^a)).\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|x\|^{2}_{2}\right]=\mathcal{O}\left(ct_{0}D+M(a)\cdot(1+TV(\widehat{P}_{a})\right).

Therefore. leading term in 2\mathcal{E}_{2} is

2=𝒪((TV(P^a)+t0)M(a)).\mathcal{E}_{2}=\mathcal{O}\left((TV(\widehat{P}_{a})+t_{0})\sqrt{M(a)}\right).

By plugging in score matching error ϵdiff2=𝒪~(1t0Dd2+D2dn1)\epsilon^{2}_{diff}=\widetilde{\mathcal{O}}\left(\frac{1}{t_{0}}\sqrt{\frac{Dd^{2}+D^{2}d}{n_{1}}}\right), we have

TV(P^a)=𝒪~(𝒯(P(x,y^=a),Pxy^;𝒮¯)λmin(Dd2+D2dn1)141t0).TV(\widehat{P}_{a})=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{\lambda_{\min}}}\cdot\left(\frac{Dd^{2}+D^{2}d}{n_{1}}\right)^{\frac{1}{4}}\cdot\frac{1}{\sqrt{t_{0}}}\right).

When t0=((Dd2+D2d)/n1)1/6t_{0}=\left((Dd^{2}+D^{2}d)/n_{1}\right)^{1/6}, it admits the best trade off in 2\mathcal{E}_{2} and 2\mathcal{E}_{2} is bounded by

2=𝒪~(𝒯(P(x,y^=a),Pxy^;𝒮¯)λmin(Dd2+D2dn1)16a).\mathcal{E}_{2}=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{\lambda_{\min}}}\cdot\left(\frac{Dd^{2}+D^{2}d}{n_{1}}\right)^{\frac{1}{6}}\cdot a\right).

C.7 Proof of Lemma C.1

From Lemma 17 in Chen et al. [2023a], we have

UVAF=𝒪(VVAAF).\|U-V^{\top}A\|_{F}=\mathcal{O}(\|VV^{\top}-AA^{\top}\|_{F}).

Then it suffices to bound

|VUAF2UVAF2|,\left|\|VU-A\|^{2}_{F}-\|U-V^{\top}A\|^{2}_{F}\right|,

where

VUAF2=2dtrace(UVA+AVU)\displaystyle\|VU-A\|^{2}_{F}=2d-\operatorname{trace}\left(U^{\top}V^{\top}A+A^{\top}VU\right)
UVAF2=d+trace(AVVA)trace(UVA+AVU).\displaystyle\|U-V^{\top}A\|^{2}_{F}=d+\operatorname{trace}\left(A^{\top}VV^{\top}A\right)-\operatorname{trace}\left(U^{\top}V^{\top}A+A^{\top}VU\right).

Thus

|VUAF2UVAF2|=|dtrace(AVVA)|=|trace(AA(VVAA))|,\left|\|VU-A\|^{2}_{F}-\|U-V^{\top}A\|^{2}_{F}\right|=\left|d-\operatorname{trace}\left(A^{\top}VV^{\top}A\right)\right|=\left|\operatorname{trace}\left(AA^{\top}(VV^{\top}-AA^{\top})\right)\right|,

which is because trace(AVVA)\operatorname{trace}\left(A^{\top}VV^{\top}A\right) is calcualted as

trace(AVVA)\displaystyle\operatorname{trace}\left(A^{\top}VV^{\top}A\right) =trace(AAVV)\displaystyle=\operatorname{trace}\left(AA^{\top}VV^{\top}\right)
=trace(AAAA)+trace(AA(VVAA))\displaystyle=\operatorname{trace}\left(AA^{\top}AA^{\top}\right)+\operatorname{trace}\left(AA^{\top}(VV^{\top}-AA^{\top})\right)
=d+trace(AA(VVAA)).\displaystyle=d+\operatorname{trace}\left(AA^{\top}(VV^{\top}-AA^{\top})\right).

Then we will bound |trace(AA(VVAA))|\left|\operatorname{trace}\left(AA^{\top}(VV^{\top}-AA^{\top})\right)\right| by VVAAF\|VV^{\top}-AA^{\top}\|_{F},

|trace(AA(VVAA))|\displaystyle\left|\operatorname{trace}\left(AA^{\top}(VV^{\top}-AA^{\top})\right)\right| trace(AA)trace(|VVAA|)\displaystyle\leq\operatorname{trace}\left(AA^{\top}\right)\operatorname{trace}\left(\left|VV^{\top}-AA^{\top}\right|\right)
dtrace(|VVAA|)\displaystyle\leq d\cdot\operatorname{trace}\left(\left|VV^{\top}-AA^{\top}\right|\right)
d2dVVAAF2.\displaystyle\leq d\cdot\sqrt{2d\left\|VV^{\top}-AA^{\top}\right\|^{2}_{F}}.

Thus, VUAF=𝒪(d32(V,A)).\|VU-A\|_{F}=\mathcal{O}\left(d^{\frac{3}{2}}\sqrt{\angle({V},{A})}\right).

C.8 Proof of Lemma C.2

When P1P_{1} and P2P_{2} are Gaussian, m(z)=z22m(z)=\|z\|^{2}_{2},

|𝔼zP1[m(z)]𝔼zP2[m(z)]|\displaystyle\quad\left|\mathbb{E}_{z\sim P_{1}}[m(z)]-\mathbb{E}_{z\sim P_{2}}[m(z)]\right|
=|m(z)(p1(z)p2(z))dz|\displaystyle=\left|\int m(z)\left(p_{1}(z)-p_{2}(z)\right)\mathrm{d}z\right|
|z2Rz22(p1(z)p2(z))dz|+z2>Rz22p1(z)dz+z2>Rz22p2(z)dz\displaystyle\leq\left|\int_{\|z\|_{2}\leq R}\|z\|^{2}_{2}\left(p_{1}(z)-p_{2}(z)\right)\mathrm{d}z\right|+\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{1}(z)\mathrm{d}z+\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{2}(z)\mathrm{d}z
R2dTV(P1,P2)+z2>Rz22p1(z)dz+z2>Rz22p2(z)dz.\displaystyle\leq R^{2}\texttt{d}_{\rm TV}(P_{1},P_{2})+\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{1}(z)\mathrm{d}z+\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{2}(z)\mathrm{d}z.

Since P1P_{1} and P2P_{2} are Gaussains, z2>Rz22p1(z)dz\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{1}(z)\mathrm{d}z and z2>Rz22p2(z)dz\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{2}(z)\mathrm{d}z are bounded by some constant C1C_{1} when R2C2max{𝔼zP1[z22],𝔼zP2[z22]}R^{2}\geq C_{2}\max\{\mathbb{E}_{z\sim P_{1}}[\|z\|_{2}^{2}],\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}^{2}]\} as suggested by Lemma 16 in Chen et al. [2023a].

Therefore,

𝔼zP1[z22]\displaystyle\mathbb{E}_{z\sim P_{1}}[\|z\|^{2}_{2}] 𝔼zP2[z22]+C2max{𝔼P1[z22],𝔼P2[z22]}dTV(P1,P2)+2C1\displaystyle\leq\mathbb{E}_{z\sim P_{2}}[\|z\|^{2}_{2}]+C_{2}\max\{\mathbb{E}_{P_{1}}[\|z\|_{2}^{2}],\mathbb{E}_{P_{2}}[\|z\|_{2}^{2}]\}\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})+2C_{1}
𝔼zP2[z22]+C2(𝔼zP1[z22]+𝔼zP2[z22])dTV(P1,P2)+2C1.\displaystyle\leq\mathbb{E}_{z\sim P_{2}}[\|z\|^{2}_{2}]+C_{2}(\mathbb{E}_{z\sim P_{1}}[\|z\|_{2}^{2}]+\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}^{2}])\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})+2C_{1}.

Then

𝔼zP1[z22]=𝒪(𝔼zP2[z22]+𝔼zP2[z22]dTV(P1,P2))\mathbb{E}_{z\sim P_{1}}[\|z\|^{2}_{2}]=\mathcal{O}\left(\mathbb{E}_{z\sim P_{2}}[\|z\|^{2}_{2}]+\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}^{2}]\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})\right)

since dTV(P1,P2)\texttt{d}_{\rm TV}(P_{1},P_{2}) decays with n1n_{1}.

Similarly, when m(z)=z2m(z)=\|z\|_{2}

|𝔼zP1[m(z)]𝔼zP2[m(z)]|\displaystyle\quad\left|\mathbb{E}_{z\sim P_{1}}[m(z)]-\mathbb{E}_{z\sim P_{2}}[m(z)]\right|
=|m(z)(p1(z)p2(z))dz|\displaystyle=\left|\int m(z)\left(p_{1}(z)-p_{2}(z)\right)\mathrm{d}z\right|
|z2Rz2(p1(z)p2(z))dz|+z2>Rz2p1(z)dz+z2>Rz2p2(z)dz\displaystyle\leq\left|\int_{\|z\|_{2}\leq R}\|z\|_{2}\left(p_{1}(z)-p_{2}(z)\right)\mathrm{d}z\right|+\int_{\|z\|_{2}>R}\|z\|_{2}p_{1}(z)\mathrm{d}z+\int_{\|z\|_{2}>R}\|z\|_{2}p_{2}(z)\mathrm{d}z
RdTV(P1,P2)+z2>Rz22p1(z)dz+z2>Rz22p2(z)dz,\displaystyle\leq R\texttt{d}_{\rm TV}(P_{1},P_{2})+\sqrt{\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{1}(z)\mathrm{d}z}+\sqrt{\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{2}(z)\mathrm{d}z},

where z2>Rz22p1(z)dz\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{1}(z)\mathrm{d}z and z2>Rz22p2(z)dz\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{2}(z)\mathrm{d}z are simultaneously bounded by a constant C1C_{1}, if the radius R2C2max{𝔼zP1[z22],𝔼zP2[z22]}R^{2}\geq C_{2}\max\{\mathbb{E}_{z\sim P_{1}}[\|z\|_{2}^{2}],\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}^{2}]\} as suggested by Chen et al. [2023a, Lemma 16].

Therefore,

|𝔼zP1[z2]𝔼zP2[z2]|\displaystyle\left|\mathbb{E}_{z\sim P_{1}}[\|z\|_{2}]-\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}]\right| C2max{𝔼P1[z22],𝔼P2[z22]}dTV(P1,P2)+2C1\displaystyle\leq\sqrt{C_{2}\max\{\mathbb{E}_{P_{1}}[\|z\|_{2}^{2}],\mathbb{E}_{P_{2}}[\|z\|_{2}^{2}]\}}\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})+2C_{1}
(C2𝔼zP1[z22]+C2𝔼zP2[z22])dTV(P1,P2)+2C1\displaystyle\leq\left(\sqrt{C_{2}\mathbb{E}_{z\sim P_{1}}[\|z\|_{2}^{2}]}+\sqrt{C_{2}\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}^{2}]}\right)\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})+2C_{1}
=𝒪((𝔼zP1[z22]+𝔼zP2[z22])dTV(P1,P2)).\displaystyle=\mathcal{O}\left(\left(\sqrt{\mathbb{E}_{z\sim P_{1}}[\|z\|_{2}^{2}]}+\sqrt{\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}^{2}]}\right)\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})\right).

C.9 Proof of Lemma C.3

Recall from (C.12) that

PLD(a)=Pz(zf^(Az)=a)=𝖭(μ(𝖺),Γ)P^{LD}(a)=P_{z}(z\mid\widehat{f}(Az)=a)=\sf N\left(\mu(a),\Gamma\right)

with μ(a):=Σβ^(β^Σβ^+ν2)1a\mu(a):=\Sigma\widehat{\beta}\left(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\right)^{-1}a, Γ:=ΣΣβ^(β^Σβ^+ν2)1β^Σ\Gamma:=\Sigma-\Sigma\widehat{\beta}\left(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\right)^{-1}\widehat{\beta}^{\top}\Sigma.

𝔼zPLD(a)[z22]\displaystyle\mathbb{E}_{z\sim P^{LD}(a)}\left[\|z\|^{2}_{2}\right] =μ(a)μ(a)+trace(Γ)\displaystyle=\mu(a)^{\top}\mu(a)+\operatorname{trace}(\Gamma)
=β^Σ2β^(β^Σ2+ν2)2a2+trace(ΣΣβ^(β^Σβ^+ν2)1β^Σ)\displaystyle=\frac{\widehat{\beta}^{\top}\Sigma^{2}\widehat{\beta}}{\left(\|\widehat{\beta}\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}+\operatorname{trace}(\Sigma-\Sigma\widehat{\beta}\left(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\right)^{-1}\widehat{\beta}^{\top}\Sigma)
=:M(a).\displaystyle=:M(a).
M(a)\displaystyle M(a) =𝒪(β^Σ2β^(β^Σ2)2a2+trace(Σ))\displaystyle=\mathcal{O}\left(\frac{\widehat{\beta}^{\top}\Sigma^{2}\widehat{\beta}}{\left(\|\widehat{\beta}\|_{\Sigma}^{2}\right)^{2}}a^{2}+\operatorname{trace}(\Sigma)\right)
=𝒪(a2β^Σ+d),\displaystyle=\mathcal{O}\left(\frac{a^{2}}{\|\widehat{\beta}\|_{\Sigma}}+d\right),

and by Lemma B.6

β^Σ12βΣ.\|\widehat{\beta}\|_{\Sigma}\leq\frac{1}{2}\|{\beta}^{*}\|_{\Sigma}.

Thus 𝔼zPLD(a)[z22]=M(a),M(a)=𝒪(a2βΣ+d)\mathbb{E}_{z\sim P^{LD}(a)}\left[\|z\|^{2}_{2}\right]=M(a),M(a)=\mathcal{O}\left(\frac{a^{2}}{\|{\beta}^{*}\|_{\Sigma}}+d\right).

Thus after adding diffusion noise at t0t_{0}, we have for α(t)=et/2\alpha(t)=e^{-t/2} and h(t)=h(t)= 1et1-e^{-t}:

𝔼zPt0LD(a)[z22]\displaystyle\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}\left[\|z\|^{2}_{2}\right] =𝔼z0PLD(a)𝔼z𝖭(α(𝗍𝟢)𝗓𝟢,𝗁(𝗍𝟢)𝖨𝖽)[z22]\displaystyle=\mathbb{E}_{z_{0}\sim P^{LD}(a)}\mathbb{E}_{z\sim\sf N\left(\alpha(t_{0})\cdot z_{0},h(t_{0})\cdot I_{d}\right)}\left[\|z\|^{2}_{2}\right]
=𝔼z0PLD(a)[α2(t0)z022+dh(t0)]\displaystyle=\mathbb{E}_{z_{0}\sim P^{LD}(a)}\left[\alpha^{2}(t_{0})\|z_{0}\|^{2}_{2}+d\cdot h(t_{0})\right]
=α2(t0)𝔼z0PLD(a)[z022]+dh(t0)\displaystyle=\alpha^{2}(t_{0})\cdot\mathbb{E}_{z_{0}\sim P^{LD}(a)}\left[\|z_{0}\|^{2}_{2}\right]+d\cdot h(t_{0})
=et0𝔼z0PLD(a)[z022]+(1et0)d.\displaystyle=e^{-t_{0}}\cdot\mathbb{E}_{z_{0}\sim P^{LD}(a)}\left[\|z_{0}\|^{2}_{2}\right]+(1-e^{-t_{0}})\cdot d.

Thus 𝔼zPt0LD(a)[z22]M(a)+t0d\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}\left[\|z\|^{2}_{2}\right]\leq M(a)+t_{0}d.

By orthogonal decomposition we have

𝔼xP^a[x22]\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|x\|^{2}_{2}\right] 𝔼xP^a[(IDVV)x22]+𝔼xP^a[VVx22]\displaystyle\leq\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|(I_{D}-VV^{\top})x\|^{2}_{2}\right]+\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|VV^{\top}x\|^{2}_{2}\right]
=𝔼xP^a[(IDVV)x22]+𝔼xP^a[UVx22],\displaystyle=\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|(I_{D}-VV^{\top})x\|^{2}_{2}\right]+\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|U^{\top}V^{\top}x\|^{2}_{2}\right],

where 𝔼xP^a[(IDVV)x22]\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|(I_{D}-VV^{\top})x\|^{2}_{2}\right] is bounded by (B.5) and the distribution of UVxU^{\top}V^{\top}x, which is (UV)#P^a(U^{\top}V^{\top})_{\#}\widehat{P}_{a}, is close to LDt0(a)\mathbb{P}^{LD}_{t_{0}}(a) up to TV(P^a)TV(\widehat{P}_{a}), which is defined in Definition B.2. Then by Lemma C.2, we have

𝔼xP^a[UVx22]\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|U^{\top}V^{\top}x\|^{2}_{2}\right] =𝒪(𝔼zPLDt0(a)[z22](1+TV(P^a)).\displaystyle=\mathcal{O}\left(\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}\left[\|z\|^{2}_{2}\right](1+TV(\widehat{P}_{a})\right).

Thus 𝔼xP^a[x22]=𝒪(ct0D+(M(a)+t0d)(1+TV(P^a))\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|x\|^{2}_{2}\right]=\mathcal{O}\left(ct_{0}D+(M(a)+t_{0}d)\cdot(1+TV(\widehat{P}_{a})\right).

C.10 Proof of Lemma C.4

Firstly, one can verify the following two equations by direct calculation:

(λID+AΣ1A)1\displaystyle(\lambda I_{D}+A\Sigma_{1}A^{\top})^{-1} =1λ(IDA(λId+Σ1)1Σ1A),\displaystyle=\frac{1}{\lambda}\left(I_{D}-A(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}A^{\top}\right),
(λId+Σ1)1\displaystyle(\lambda I_{d}+\Sigma_{1})^{-1} =1λ(Id(λId+Σ1)1Σ1).\displaystyle=\frac{1}{\lambda}\left(I_{d}-(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}\right).

Then we have

(λID+AΣ1A)1AΣ2A=\displaystyle(\lambda I_{D}+A\Sigma_{1}A^{\top})^{-1}A\Sigma_{2}A^{\top}= 1λ(IDA(λId+Σ1)1Σ1A)AΣ2A\displaystyle\frac{1}{\lambda}\left(I_{D}-A(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}A^{\top}\right)A\Sigma_{2}A^{\top}
=\displaystyle= 1λ(AΣ2AA(λId+Σ1)1Σ1Σ2A).\displaystyle\frac{1}{\lambda}\left(A\Sigma_{2}A^{\top}-A(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}\Sigma_{2}A^{\top}\right).

Therefore,

Tr((λID+AΣ1A)1AΣ2A)=\displaystyle\mathrm{Tr}\left((\lambda I_{D}+A\Sigma_{1}A^{\top})^{-1}A\Sigma_{2}A^{\top}\right)= Tr(1λ(AΣ2AA(λId+Σ1)1Σ1Σ2A))\displaystyle\mathrm{Tr}\left(\frac{1}{\lambda}\left(A\Sigma_{2}A^{\top}-A(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}\Sigma_{2}A^{\top}\right)\right)
=\displaystyle= Tr(1λ(Σ2(λId+Σ1)1Σ1Σ2))\displaystyle\mathrm{Tr}\left(\frac{1}{\lambda}\left(\Sigma_{2}-(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}\Sigma_{2}\right)\right)
=\displaystyle= Tr(1λ(Id(λId+Σ1)1Σ1)Σ2)\displaystyle\mathrm{Tr}\left(\frac{1}{\lambda}\left(I_{d}-(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}\right)\Sigma_{2}\right)
=\displaystyle= Tr((λId+Σ1)1Σ2),\displaystyle\mathrm{Tr}\left(\left(\lambda I_{d}+\Sigma_{1}\right)^{-1}\Sigma_{2}\right),

which has finished the proof.

Appendix D Omitted Proofs in Section 6.2

D.1 Conditional Score Decomposition and Score Matching Error

Lemma D.1.

Under Assumption 3.1, 6.5 and 6.6, with high probability

1Tt0t0Ts^(,t)logpt()L2(Pt)2dtϵdiff2(n1),\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\|\widehat{s}(\cdot,t)-\nabla\log p_{t}(\cdot)\|_{L^{2}(P_{t})}^{2}\mathrm{d}t\leq\epsilon_{diff}^{2}(n_{1}),

with ϵdiff2(n1)=𝒪~(1t0(n122δ(n1)d+6+Dn1d+4d+6))\epsilon_{diff}^{2}(n_{1})=\widetilde{\mathcal{O}}\left(\frac{1}{t_{0}}\left(n_{1}^{-\frac{2-2\delta(n_{1})}{d+6}}+Dn_{1}^{-\frac{d+4}{d+6}}\right)\right) for δ(n1)=dloglogn1logn1\delta(n_{1})=\frac{d\log\log n_{1}}{\log n_{1}}.

Proof.

Chen et al. [2023a, Theorem 1] is easily adapted here to prove Lemma D.1 with the input dimension d+1d+1 and the Lipschitzness in Assumption 6.6. Network size of 𝒮{\mathcal{S}} is implied by Chen et al. [2023a, Theorem 1] with ϵ=n11d+6\epsilon=n_{1}^{-\frac{1}{d+6}} accounting for the additional dimension of reward y^\widehat{y} and then the score matching error follows. ∎

D.2 Proof of Theorem 6.7

Additional Notations: Similar as before, use PtLD(z)P_{t}^{LD}(z) to denote the low-dimensional distribution on zz corrupted by diffusion noise. Formally, ptLD(z)=ϕt(z|z)pz(z)dzp_{t}^{LD}(z)=\int\phi_{t}(z^{\prime}|z)p_{z}(z)\mathrm{d}z with ϕt(|z)\phi_{t}(\cdot|z) being the density of 𝖭(α(t)z,h(t)Id){\sf N}(\alpha(t)z,h(t)I_{d}). PLDt0(zf^(Az)=a)P^{LD}_{t_{0}}(z\mid\widehat{f}(Az)=a) the corresponding conditional distribution on f^(Az)=a\widehat{f}(Az)=a at t0t_{0}, with shorthand as PLDt0(a)P^{LD}_{t_{0}}(a). Also give Pz(zf^(Az)=a)P_{z}(z\mid\widehat{f}(Az)=a) a shorthand as PLD(a)P^{LD}(a).

Now we are ready to begin our proof. By the same argument as in Appendix B.3.1, we have

SubOpt(P^a;y=a)\displaystyle\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a) 𝔼xPa[|f(x)f^(x)|]1+|𝔼xPa[g(x)]𝔼xP^a[g(x)]|2+𝔼xP^a[h(x)]3.\displaystyle\leq\underbrace{\mathbb{E}_{x\sim P_{a}}\left[\left|f^{*}(x)-\widehat{f}(x)\right|\right]}_{\mathcal{E}_{1}}+\underbrace{\left|\mathbb{E}_{x\sim P_{a}}[g^{*}(x_{\parallel})]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(x_{\parallel})]\right|}_{\mathcal{E}_{2}}+\underbrace{\mathbb{E}_{x\sim\widehat{P}_{a}}[h^{*}(x_{\perp})]}_{\mathcal{E}_{3}}.

In the following, we will bound terms 1,2\mathcal{E}_{1},\mathcal{E}_{2} respectively. Since 3\mathcal{E}_{3} is essentially the off-support error, its magnitude is bounded by Theorem 5.4.

D.2.1 1\mathcal{E}_{1}: Nonparamtric Regression Induced Error.

Since PzP_{z} has a light tail due to Assumption 6.5, by union bound and Chen et al. [2023a, Lemma 16], we have

(xiwithxi2>Rfori=1,,n2)n2C1d2d/2+1C2Γ(d/2+1)Rd2exp(C2R2/2),\displaystyle\mathbb{P}(\exists~{}x_{i}~{}\text{with}~{}\|x_{i}\|_{2}>R~{}\text{for}~{}i=1,\dots,n_{2})\leq n_{2}\frac{C_{1}d2^{-d/2+1}}{C_{2}\Gamma(d/2+1)}R^{d-2}\exp(-C_{2}R^{2}/2),

where C1,C2C_{1},C_{2} are constants and Γ()\Gamma(\cdot) is the Gamma function. Choosing R=𝒪(dlogd+lognδ)R=\mathcal{O}(\sqrt{d\log d+\log\frac{n}{\delta}}) ensures (xiwithxi2>Rfori=1,,n2)<δ\mathbb{P}(\exists~{}x_{i}~{}\text{with}~{}\|x_{i}\|_{2}>R~{}\text{for}~{}i=1,\dots,n_{2})<\delta. On the event ={xi2Rfor alli=1,,n2}\mathcal{E}=\{\|x_{i}\|_{2}\leq R~{}\text{for all}~{}i=1,\dots,n_{2}\}, denoting δ(n2)=dloglogn2logn2\delta(n_{2})=\frac{d\log\log n_{2}}{\log n_{2}}, we have

ff^2L2=𝒪~(n22(αδ(n2))d+2α)\displaystyle\|f^{*}-\widehat{f}\|^{2}_{L^{2}}=\widetilde{\mathcal{O}}\left(n_{2}^{-\frac{2(\alpha-\delta(n_{2}))}{d+2\alpha}}\right)

by Nakada and Imaizumi [2020, Theorem 7] with a new covering number of 𝒮{\mathcal{S}}, when n2n_{2} is sufficiently large. The corresponding network architecture follows Chen et al. [2022a, Theorem 2].

We remark that linear subspace is a special case of low Minkowski dimension. Moreover, δ(n2)\delta(n_{2}) is asymptotically negligible and accounts for the truncation radius RR of xix_{i}’s (see also Chen et al. [2023a, Theorem 2 and 3]). The covering number of 𝒮{\mathcal{S}} is 𝒪~(dd/2ndα(logn2)d/2+Dd)\widetilde{\mathcal{O}}\left(d^{d/2}n^{-\frac{d}{\alpha}}(\log n_{2})^{d/2}+Dd\right) as appear in Chen et al. [2023a, Proof of Theorem 2]. Therefore

𝔼xPa[|f(x)f^(x)|]\displaystyle\mathbb{E}_{x\sim P_{a}}\left[\left|f^{*}(x)-\widehat{f}(x)\right|\right] 𝔼xPa[|f(x)f^(x)|2]\displaystyle\leq\sqrt{\mathbb{E}_{x\sim P_{a}}\left[\left|f^{*}(x)-\widehat{f}(x)\right|^{2}\right]}
𝒯(P(x|y^=a),Px;¯)ff^2L2\displaystyle\leq\sqrt{{\mathcal{T}}(P(x|\widehat{y}=a),P_{x};\bar{\mathcal{F}})\cdot\|f^{*}-\widehat{f}\|^{2}_{L^{2}}}
=𝒯(P(x|y^=a),Px;¯)𝒪~(n2αδ(n2)2α+d+D/n2).\displaystyle=\sqrt{{\mathcal{T}}(P(x|\widehat{y}=a),P_{x};\bar{\mathcal{F}})}\cdot\widetilde{\mathcal{O}}\left(n_{2}^{-\frac{\alpha-\delta(n_{2})}{2\alpha+d}}+D/n_{2}\right).

D.2.2 2\mathcal{E}_{2}: Diffusion Induced On-support Error

Suppose L2L_{2} score matching error is ϵ2diff(n1)\epsilon^{2}_{diff}(n_{1}), i.e.

1Tt0t0T𝔼x,f^xlogpt(x,f^)sw^(x,f^,t)22dtϵ2diff(n1),\displaystyle\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\mathbb{E}_{x,\widehat{f}}\|\nabla_{x}\log p_{t}(x,\widehat{f})-s_{\widehat{w}}(x,\widehat{f},t)\|_{2}^{2}\mathrm{d}t\leq\epsilon^{2}_{diff}(n_{1}),

We revoke Definition B.2 measuring the distance between P^a\widehat{P}_{a} to PaP_{a} that

TV(P^a):=dTV(PLDt0(zf^(Az)=a),(UV)#P^a).TV(\widehat{P}_{a}):=\texttt{d}_{\rm TV}\left(P^{LD}_{t_{0}}(z\mid\widehat{f}(Az)=a),(U^{\top}V^{\top})_{\#}\widehat{P}_{a}\right).

Lemma B.4 applies to nonparametric setting, so we have

(IDVV)x𝖭(0,Λ),Λct0ID,\displaystyle(I_{D}-VV^{\top})x\sim{\sf N}(0,\Lambda),\quad\Lambda\prec ct_{0}I_{D}, (D.1)
(V,A)=𝒪~(t0c0ϵ2diff(n1)).\displaystyle\angle({V},{A})=\widetilde{\mathcal{O}}\left(\frac{t_{0}}{c_{0}}\cdot\epsilon^{2}_{diff}(n_{1})\right). (D.2)

In addition,

TV(P^a)=𝒪~(𝒯(P(x,y^=a),Pxy^;𝒮¯)c0ϵdiff(n1)).TV(\widehat{P}_{a})=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{c_{0}}}\cdot\epsilon_{diff}(n_{1})\right). (D.3)

2\mathcal{E}_{2} will be bounded by

2\displaystyle\mathcal{E}_{2} =|𝔼xPa[g(x)]𝔼xP^a[g(x)]|\displaystyle=\left|\mathbb{E}_{x\sim P_{a}}[g^{*}(x)]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(x)]\right|
|𝔼xPa[g(AAx)]𝔼xP^a[g(VVx)]|+|𝔼xP^a[g(VVx)g(AAx)]|,\displaystyle\leq\left|\mathbb{E}_{x\sim P_{a}}[g^{*}(AA^{\top}x)]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(VV^{\top}x)]\right|+\left|\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(VV^{\top}x)-g^{*}(AA^{\top}x)]\right|,

where for |𝔼xP^a[g(VVx)g(AAx)]|\left|\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(VV^{\top}x)-g^{*}(AA^{\top}x)]\right|, we have

|𝔼xP^a[g(VVx)g(AAx)]|𝔼xP^a[VVxAAx2]VVAAF𝔼xP^a[x2].\left|\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(VV^{\top}x)-g^{*}(AA^{\top}x)]\right|\leq\mathbb{E}_{x\sim\widehat{P}_{a}}[\|VV^{\top}x-AA^{\top}x\|_{2}]\leq\|VV^{\top}-AA^{\top}\|_{F}\cdot\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x\|_{2}]. (D.4)

For the other term |𝔼xPa[g(AAx)]𝔼xP^a[g(VVx)]|\left|\mathbb{E}_{x\sim P_{a}}[g^{*}(AA^{\top}x)]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(VV^{\top}x)]\right|, we will bound it with TV(P^a)TV(\widehat{P}_{a}).

|𝔼xPa[g(AAx)]𝔼xP^a[g(VVx)]|\displaystyle\left|\mathbb{E}_{x\sim P_{a}}[g^{*}(AA^{\top}x)]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(VV^{\top}x)]\right|
\displaystyle\leq |𝔼zt0(a)[g(Az)]𝔼z(V)#P^a[g(Vz)]|+|𝔼z(a)[g(Az)]𝔼zt0(a)[g(Az)]|\displaystyle\left|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(Az)]-\mathbb{E}_{z\sim(V^{\top})_{\#}\widehat{P}_{a}}[g^{*}(Vz)]\right|+\left|\mathbb{E}_{z\sim\mathbb{P}(a)}[g^{*}(Az)]-\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(Az)]\right|

Since any zt0(a)z\sim\mathbb{P}_{t_{0}}(a) can be represented by α(t0)z+h(t0)u\alpha(t_{0})z+\sqrt{h(t_{0})}u, where z(a),u𝖭(0,Id)z\sim\mathbb{P}(a),u\sim{\sf N}(0,I_{d}), then

𝔼zt0(a)[g(Az)]\displaystyle\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(Az)]
=𝔼z(a),u𝖭(0,Id)[g(α(t)Az+h(t)Au))]\displaystyle=\mathbb{E}_{z\sim\mathbb{P}(a),u\sim{\sf N}(0,I_{d})}[g^{*}(\alpha(t)Az+\sqrt{h(t)}Au))]
𝔼z(a)[g(α(t0)Az))]+h(t0)𝔼u𝖭(0,Id)[Au2]\displaystyle\leq\mathbb{E}_{z\sim\mathbb{P}(a)}[g^{*}(\alpha(t_{0})Az))]+\sqrt{h(t_{0})}\mathbb{E}_{u\sim{\sf N}(0,I_{d})}[\|Au\|_{2}]
𝔼z(a)[g(Az))]+(1α(t0))𝔼z(a)[Az2]+h(t0)𝔼u𝖭(𝟢,𝖨𝖽)[Au2],\displaystyle\leq\mathbb{E}_{z\sim\mathbb{P}(a)}[g^{*}(Az))]+(1-\alpha(t_{0}))\mathbb{E}_{z\sim\mathbb{P}(a)}[\|Az\|_{2}]+\sqrt{h(t_{0})}\mathbb{E}_{u\sim\sf N(0,I_{d})}[\|Au\|_{2}],

thus

|𝔼z(a)[g(Az)]𝔼zt0(a)[g(Az)]|t0𝔼z(a)[z2]+d,\left|\mathbb{E}_{z\sim\mathbb{P}(a)}[g^{*}(Az)]-\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(Az)]\right|\leq t_{0}\cdot\mathbb{E}_{z\sim\mathbb{P}(a)}[\|z\|_{2}]+d,

where we further use 1α(t0)=1et0/2t0/21-\alpha(t_{0})=1-e^{-t_{0}/2}\leq t_{0}/2, h(t0)1h(t_{0})\leq 1.

As for |𝔼zt0(a)[g(Az)]𝔼z(V)#P^a[g(Vz)]|\left|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(Az)]-\mathbb{E}_{z\sim(V^{\top})_{\#}\widehat{P}_{a}}[g^{*}(Vz)]\right|, we have

|𝔼zt0(a)[g(Az)]𝔼z(UV)#P^a[g(VUz)]|\displaystyle\left|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(Az)]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[g^{*}(VUz)]\right|
=\displaystyle= |𝔼zt0(a)[g(VUz)]𝔼z(UV)#P^a[g(VUz)]|+|𝔼zt0(a)[g(Az)]𝔼zt0(a)[g(VUz)]|,\displaystyle\left|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(VUz)]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[g^{*}(VUz)]\right|+\left|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(Az)]-\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(VUz)]\right|,

where

|𝔼zt0(a)[g(Az)]𝔼zt0(a)[g(VUz)]|AVUF𝔼zt0(a)[z2],\left|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(Az)]-\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(VUz)]\right|\leq\|A-VU\|_{F}\cdot\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[\|z\|_{2}],

and

|𝔼zt0(a)[g(VUz)]𝔼z(UV)#P^a[g(VUz)]|TV(P^a)g.\left|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(VUz)]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[g^{*}(VUz)]\right|\leq TV(\widehat{P}_{a})\cdot\|g^{*}\|_{\infty}.

Combining things up, we have

2\displaystyle\mathcal{E}_{2}\leq VVAAF𝔼xP^a[x2]+AVUF𝔼zt0(a)[z2]\displaystyle\|VV^{\top}-AA^{\top}\|_{F}\cdot\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x\|_{2}]+\|A-VU\|_{F}\cdot\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[\|z\|_{2}]
+t0𝔼z(a)[z2]+d+TV(P^a)g.\displaystyle+t_{0}\cdot\mathbb{E}_{z\sim\mathbb{P}(a)}[\|z\|_{2}]+d+TV(\widehat{P}_{a})\cdot\|g^{*}\|_{\infty}.

Similar to parametric case, Let M(a):=𝔼z(a)[z22]M(a):=\mathbb{E}_{z\sim\mathbb{P}(a)}[\|z\|^{2}_{2}], then

𝔼zt0(a)[z22]M(a)+t0d,\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[\|z\|^{2}_{2}]\leq M(a)+t_{0}d,

expect for in nonparametric case, we can not compute M(a)M(a) out as it is not Gaussian. But still, with higher-order terms in n11n_{1}^{-1} hided, we have

2\displaystyle\mathcal{E}_{2} =𝒪(TV(P^a)g+t0M(a))\displaystyle=\mathcal{O}\left(TV(\widehat{P}_{a})\cdot\|g^{*}\|_{\infty}+t_{0}M(a)\right)
=𝒪~(𝒯(P(x,y^=a),Pxy^;𝒮¯)c0ϵdiff(n1)g+t0M(a)).\displaystyle=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{c_{0}}}\cdot\epsilon_{diff}(n_{1})\cdot\|g^{*}\|_{\infty}+t_{0}M(a)\right).

Appendix E Parametric Conditional Score Estimation: Proof of Lemma B.1

We first derive a decomposition of the conditional score function similar to Chen et al. [2023a]. We have

pt(x,y)\displaystyle p_{t}(x,y) =pt(x,y|z)pz(z)dz\displaystyle=\int p_{t}(x,y|z)p_{z}(z)\mathrm{d}z
=pt(x|z)p(y|z)pz(z)dz\displaystyle=\int p_{t}(x|z)p(y|z)p_{z}(z)\mathrm{d}z
=Cexp(12h(t)xα(t)Az22)exp(1σ2y(θzy)2)pz(z)dz\displaystyle=C\int\exp\left(-\frac{1}{2h(t)}\|x-\alpha(t)Az\|_{2}^{2}\right)\exp\left(-\frac{1}{\sigma^{2}_{y}}\left(\theta^{\top}z-y\right)^{2}\right)p_{z}(z)\mathrm{d}z
=(i)Cexp(12h(t)(IDAA)x22)exp(12h(t)Axα(t)z22)exp(1σ2y(θzy)2)pz(z)dz,\displaystyle\overset{(i)}{=}C\exp\left(-\frac{1}{2h(t)}\|(I_{D}-AA^{\top})x\|_{2}^{2}\right)\cdot\int\exp\left(-\frac{1}{2h(t)}\|A^{\top}x-\alpha(t)z\|_{2}^{2}\right)\exp\left(-\frac{1}{\sigma^{2}_{y}}\left(\theta^{\top}z-y\right)^{2}\right)p_{z}(z)\mathrm{d}z,

where equality (i)(i) follows from the fact AAx(IDAA)xAA^{\top}x\perp(I_{D}-AA^{\top})x and CC is the normalizing constant of Gaussian densities. Taking logarithm and then derivative with respect to xx on pt(x,y)p_{t}(x,y), we obtain

xlogpt(x,y)=α(t)h(t)Azexp(12h(t)Axα(t)z22)exp(1σ2y(θzy)2)pz(z)dzexp(12h(t)Axα(t)z22)exp(1σ2y(θzy)2)pz(z)dz1h(t)x.\displaystyle\nabla_{x}\log p_{t}(x,y)=\frac{\alpha(t)}{h(t)}\frac{A\int z\exp\left(-\frac{1}{2h(t)}\|A^{\top}x-\alpha(t)z\|_{2}^{2}\right)\exp\left(-\frac{1}{\sigma^{2}_{y}}\left(\theta^{\top}z-y\right)^{2}\right)p_{z}(z)\mathrm{d}z}{\int\exp\left(-\frac{1}{2h(t)}\|A^{\top}x-\alpha(t)z\|_{2}^{2}\right)\exp\left(-\frac{1}{\sigma^{2}_{y}}\left(\theta^{\top}z-y\right)^{2}\right)p_{z}(z)\mathrm{d}z}-\frac{1}{h(t)}x.

Note that the first term in the right-hand side above only depends on AxA^{\top}x and yy. Therefore, we can compactly write xlogpt(x,y)\nabla_{x}\log p_{t}(x,y) as

xlogpt(x,y)=1h(t)Au(Ax,y,t)1h(t)x,\displaystyle\nabla_{x}\log p_{t}(x,y)=\frac{1}{h(t)}Au(A^{\top}x,y,t)-\frac{1}{h(t)}x, (E.1)

where mapping uu represents

α(t)zexp(12h(t)Axα(t)z22)exp(1σ2y(θzy)2)pz(z)dzexp(12h(t)Axα(t)z22)exp(1σ2y(θzy)2)pz(z)dz.\frac{\alpha(t)\int z\exp\left(-\frac{1}{2h(t)}\|A^{\top}x-\alpha(t)z\|_{2}^{2}\right)\exp\left(-\frac{1}{\sigma^{2}_{y}}\left(\theta^{\top}z-y\right)^{2}\right)p_{z}(z)\mathrm{d}z}{\int\exp\left(-\frac{1}{2h(t)}\|A^{\top}x-\alpha(t)z\|_{2}^{2}\right)\exp\left(-\frac{1}{\sigma^{2}_{y}}\left(\theta^{\top}z-y\right)^{2}\right)p_{z}(z)\mathrm{d}z}.

We observe that (E.1) motivates our choice of the neural network architecture 𝒮{\mathcal{S}} in (4.8). In particular, ψ\psi attempts to estimate uu and matrix VV attempts to estimate AA.

In the Gaussian design case (Assumption 5.3), we instantiate pz(z)p_{z}(z) to the Gaussian density (2π|Σ|)d/2exp(12zΣ1z)(2\pi|\Sigma|)^{-d/2}\exp\left(-\frac{1}{2}z^{\top}\Sigma^{-1}z\right). Some algebra on the Gaussian integral gives rise to

xlogpt(x,y)\displaystyle\nabla_{x}\log p_{t}(x,y) =α(t)h(t)ABtμt(x,y)1h(t)(IDAA)x1h(t)AAx\displaystyle=\frac{\alpha(t)}{h(t)}AB_{t}\mu_{t}(x,y)-\frac{1}{h(t)}(I_{D}-AA^{\top})x-\frac{1}{h(t)}AA^{\top}x
=α(t)h(t)ABt(α(t)Ax+h(t)ν2yθ)1h(t)x,\displaystyle=\frac{\alpha(t)}{h(t)}AB_{t}\left(\alpha(t)A^{\top}x+\frac{h(t)}{\nu^{2}}y\theta\right)-\frac{1}{h(t)}x, (E.2)

where we have denoted

μt(x,y)=α(t)Ax+h(t)ν2yθandBt=(α2(t)Id+h(t)ν2θθ+h(t)Σ1)1.\displaystyle\mu_{t}(x,y)=\alpha(t)A^{\top}x+\frac{h(t)}{\nu^{2}}y\theta\quad\text{and}\quad B_{t}=\left(\alpha^{2}(t)I_{d}+\frac{h(t)}{\nu^{2}}\theta\theta^{\top}+h(t)\Sigma^{-1}\right)^{-1}.

E.1 Score Estimation Error

Recall that we estimate the conditional score function via minimizing the denoising score matching loss in Proposition 4.1. To ease the presentation, we denote

(x,y;s)=1Tt0t0T𝔼x|xxlogϕt(x|x)s(x,y,t)22dt\displaystyle\ell(x,y;s)=\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\mathbb{E}_{x^{\prime}|x}\|\nabla_{x^{\prime}}\log\phi_{t}(x^{\prime}|x)-s(x^{\prime},y,t)\|_{2}^{2}\mathrm{d}t

as the loss function for a pair of clean data (x,y)(x,y) and a conditional score function ss. Further, we denote the population loss as

(s)=𝔼x,y[(x,y;s)],\displaystyle\mathcal{L}(s)=\mathbb{E}_{x,y}[\ell(x,y;s)],

whose empirical counterpart is denoted as ^(s)=1n1i=1n1(xi,yi;s)\widehat{\mathcal{L}}(s)=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\ell(x_{i},y_{i};s).

To bound the score estimation error, we begin with an oracle inequality. Denote trunc(s)\mathcal{L}^{\rm trunc}(s) as a truncated loss function defined as

trunc(s)=𝔼[(x,y;s)𝟙{x2R,|y|R}],\displaystyle\mathcal{L}^{\rm trunc}(s)=\mathbb{E}[\ell(x,y;s)\mathds{1}\{\|x\|_{2}\leq R,|y|\leq R\}],

where R>0R>0 is a truncation radius chosen as 𝒪(dlogd+logK+logn1δ)\mathcal{O}(\sqrt{d\log d+\log K+\log\frac{n_{1}}{\delta}}). Here KK is a uniform upper bound of s(x,y,t)𝟙{x2R,|y|R}s(x,y,t)\mathds{1}\{\|x\|_{2}\leq R,|y|\leq R\} for s𝒮s\in{\mathcal{S}}, i.e., sups𝒮s(x,y,t)𝟙{x2R,|y|R}2K\sup_{s\in{\mathcal{S}}}\|s(x,y,t)\mathds{1}\{\|x\|_{2}\leq R,|y|\leq R\}\|_{2}\leq K. To this end, we have

(s^)\displaystyle\mathcal{L}(\widehat{s}) =(s^)^(s^)+^(s^)\displaystyle=\mathcal{L}(\widehat{s})-\widehat{\mathcal{L}}(\widehat{s})+\widehat{\mathcal{L}}(\widehat{s})
=(s^)^(s^)+infs𝒮^(s)\displaystyle=\mathcal{L}(\widehat{s})-\widehat{\mathcal{L}}(\widehat{s})+\inf_{s\in{\mathcal{S}}}\widehat{\mathcal{L}}(s)
=(i)(s^)^(s^)\displaystyle\overset{(i)}{=}\mathcal{L}(\widehat{s})-\widehat{\mathcal{L}}(\widehat{s})
(s^)trunc(s^)+trunc(s^)^trunc(s^)\displaystyle\leq\mathcal{L}(\widehat{s})-\mathcal{L}^{\rm trunc}(\widehat{s})+\mathcal{L}^{\rm trunc}(\widehat{s})-\widehat{\mathcal{L}}^{\rm trunc}(\widehat{s})
supstrunc(s)^trunc(s)(A)+sups(s)trunc(s)(B),\displaystyle\leq\underbrace{\sup_{s}\mathcal{L}^{\rm trunc}(s)-\widehat{\mathcal{L}}^{\rm trunc}(s)}_{(A)}+\underbrace{\sup_{s}\mathcal{L}(s)-\mathcal{L}^{\rm trunc}(s)}_{(B)},

where equality (i)(i) holds since 𝒮{\mathcal{S}} contains the ground truth score function. We bound term (A)(A) by a PAC-learning concentration argument. Using the same argument in Chen et al. [2023a, Theorem 2, term (A)(A)], we have

sups𝒮trunc(x,y;s)=𝒪(1t0(Tt0)(K2+R2)).\displaystyle\sup_{s\in{\mathcal{S}}}\ell^{\rm trunc}(x,y;s)=\mathcal{O}\left(\frac{1}{t_{0}(T-t_{0})}(K^{2}+R^{2})\right).

Applying the standard metric entropy and symmetrization technique, we can show

(A)=𝒪(^(𝒮)+(K2+R2t0(Tt0))log2δ2n1),\displaystyle(A)=\mathcal{O}\left(\widehat{\mathfrak{R}}({\mathcal{S}})+\left(\frac{K^{2}+R^{2}}{t_{0}(T-t_{0})}\right)\sqrt{\frac{\log\frac{2}{\delta}}{2n_{1}}}\right),

where ^\widehat{\mathfrak{R}} is the empirical Rademacher complexity of 𝒮{\mathcal{S}}. Unfamiliar readers can refer to Theorem 3.3 in “Foundations of Machine Learning”, second edition for details. The remaining step is to bound the Rademacher complexity by Dudley’s entropy integral. Indeed, we have

^(𝒮)infϵ4ϵn1+12n1ϵK2n1𝒩(𝒮,ϵ,2)dϵ.\displaystyle\widehat{\mathfrak{R}}({\mathcal{S}})\leq\inf_{\epsilon}\frac{4\epsilon}{\sqrt{n_{1}}}+\frac{12}{n_{1}}\int_{\epsilon}^{K^{2}\sqrt{n_{1}}}\sqrt{\mathcal{N}({\mathcal{S}},\epsilon,\|\cdot\|_{2})}\mathrm{d}\epsilon.

We emphasize that the log covering number considers x,yx,y in the truncated region. Taking ϵ=1n1\epsilon=\frac{1}{n_{1}} gives rise to

(A)=𝒪((K2+R2t0(Tt0))𝒩(𝒮,1/n1)log1δn1).\displaystyle(A)=\mathcal{O}\left(\left(\frac{K^{2}+R^{2}}{t_{0}(T-t_{0})}\right)\sqrt{\frac{\mathcal{N}({\mathcal{S}},1/n_{1})\log\frac{1}{\delta}}{n_{1}}}\right).

Here KK is instance-dependent and majorly depends on dd. In the Gaussian design case, we can verify that KK is 𝒪(d)\mathcal{O}(\sqrt{d}). To this end, we deduce (A)=𝒪~(1t0d2𝒩(𝒮,1/n1)log1δn1)(A)=\widetilde{\mathcal{O}}\left(\frac{1}{t_{0}}\sqrt{d^{2}\frac{\mathcal{N}({\mathcal{S}},1/n_{1})\log\frac{1}{\delta}}{n_{1}}}\right). In practice, dd is often much smaller than DD (see for example Pope et al. [2021], where ImageNet has intrinsic dimension no more than 4343 in contrast to image resolution of 224×224×3224\times 224\times 3). In this way, we can upper bound d2d^{2} by DD, yet d2d^{2} is often a tighter upper bound.

For term (B)(B), we invoke the same upper bound in Chen et al. [2023a, Theorem 2, term (B)(B)] to obtain

(B)=𝒪(1n1t0(Tt0)),\displaystyle(B)=\mathcal{O}\left(\frac{1}{n_{1}t_{0}(T-t_{0})}\right),

which is negligible compared to (A)(A). Therefore, summing up (A)(A) and (B)(B), we deduce

ϵdiff2=𝒪(1t0𝒩(𝒮,1/n1)(d2D)log1δn1).\displaystyle\epsilon_{diff}^{2}=\mathcal{O}\left(\frac{1}{t_{0}}\sqrt{\frac{\mathcal{N}({\mathcal{S}},1/n_{1})(d^{2}\vee D)\log\frac{1}{\delta}}{n_{1}}}\right).

E.2 Gaussian Design

We only need to find the covering number under the Gaussian design case. Using (E), we can construct a covering from coverings on matrices VV and Σ1\Sigma^{-1}. Suppose V1,V2V_{1},V_{2} are two matrices with V1V22ηV\|V_{1}-V_{2}\|_{2}\leq\eta_{V} for some η>0\eta>0. Meanwhile, let Σ11,Σ12\Sigma^{-1}_{1},\Sigma^{-1}_{2} be two covariance matrices with Σ11Σ122ηΣ\|\Sigma^{-1}_{1}-\Sigma^{-1}_{2}\|_{2}\leq\eta_{\Sigma}. Then we bound

supx2R,|y|RsV1,Σ11(s,y,t)sV2,Σ12(x,y,t)2\displaystyle\quad\sup_{\|x\|_{2}\leq R,|y|\leq R}\|s_{V_{1},\Sigma^{-1}_{1}}(s,y,t)-s_{V_{2},\Sigma^{-1}_{2}}(x,y,t)\|_{2}
1h(t)supx2R,|y|R[V1ψΣ11(V1x,y,t)V1ψΣ11(V2x,y,t)2\displaystyle\leq\frac{1}{h(t)}\sup_{\|x\|_{2}\leq R,|y|\leq R}\Big{[}\big{\lVert}V_{1}\psi_{\Sigma^{-1}_{1}}(V_{1}^{\top}x,y,t)-V_{1}\psi_{\Sigma^{-1}_{1}}(V_{2}^{\top}x,y,t)\big{\rVert}_{2}
+V1ψΣ11(V2x,y,t)V1ψΣ12(V2x,y,t)2()+V1ψΣ12(V2x,y,t)V2ψΣ12(V2x,y,t)2]\displaystyle\quad+\underbrace{\big{\lVert}V_{1}\psi_{\Sigma^{-1}_{1}}(V_{2}^{\top}x,y,t)-V_{1}\psi_{\Sigma^{-1}_{2}}(V_{2}^{\top}x,y,t)\big{\rVert}_{2}}_{(\spadesuit)}+\big{\lVert}V_{1}\psi_{\Sigma^{-1}_{2}}(V_{2}^{\top}x,y,t)-V_{2}\psi_{\Sigma^{-1}_{2}}(V_{2}^{\top}x,y,t)\big{\rVert}_{2}\Big{]}
1h(t)(2RηV+2ν2RηΣ),\displaystyle\leq\frac{1}{h(t)}\left(2R\eta_{V}+2\nu^{-2}R\eta_{\Sigma}\right),

where for bounding ()(\spadesuit), we invoke the identity (I+A)1(I+B)12BA2\|(I+A)^{-1}-(I+B)^{-1}\|_{2}\leq\|B-A\|_{2}. Further taking supremum over t[t0,T]t\in[t_{0},T] leads to

supx2R,|y|RsV1,Σ11(s,y,t)sV2,Σ12(x,y,t)21t0(2RηV+2ν2RηΣ)\displaystyle\sup_{\|x\|_{2}\leq R,|y|\leq R}\|s_{V_{1},\Sigma^{-1}_{1}}(s,y,t)-s_{V_{2},\Sigma^{-1}_{2}}(x,y,t)\|_{2}\leq\frac{1}{t_{0}}\left(2R\eta_{V}+2\nu^{-2}R\eta_{\Sigma}\right)

for any t[t0,T]t\in[t_{0},T]. Therefore, the inequality above suggests that coverings on VV and Σ1\Sigma^{-1} form a covering on 𝒮{\mathcal{S}}. The covering numbers of VV and Σ1\Sigma^{-1} can be directly obtained by a volume ratio argument; we have

𝒩(V,ηV,2)Ddlog(1+2dηV)and𝒩(Σ1,ηΣ,2)d2log(1+2dλminηΣ).\displaystyle\mathcal{N}(V,\eta_{V},\|\cdot\|_{2})\leq Dd\log\left(1+\frac{2\sqrt{d}}{\eta_{V}}\right)\quad\text{and}\quad\mathcal{N}(\Sigma^{-1},\eta_{\Sigma},\|\cdot\|_{2})\leq d^{2}\log\left(1+\frac{2\sqrt{d}}{\lambda_{\min}\eta_{\Sigma}}\right).

Thus, the log covering number of 𝒮{\mathcal{S}} is

𝒩(𝒮,η,2)\displaystyle\mathcal{N}({\mathcal{S}},\eta,\|\cdot\|_{2}) =𝒩(V,t0ηV/2R,2)+𝒩(Σ1,t0ν2ηΣ/2R,2)\displaystyle=\mathcal{N}(V,t_{0}\eta_{V}/2R,\|\cdot\|_{2})+\mathcal{N}(\Sigma^{-1},t_{0}\nu^{2}\eta_{\Sigma}/2R,\|\cdot\|_{2})
(Dd+d2)log(1+dDt0λminη),\displaystyle\leq(Dd+d^{2})\log\left(1+\frac{dD}{t_{0}\lambda_{\min}\eta}\right),

where we have plugged ν2=1/D\nu^{2}=1/D into the last inequality. Setting η=1/n1\eta=1/n_{1} and substituting into ϵdiff2\epsilon_{diff}^{2} yield the desired result.

We remark that the analysis here does not try to optimize the error bounds, but aims to provide a provable guarantee for conditional score estimation using finite samples. We foresee that sharper analysis via Bernstein-type concentration may result in a better dependence on n1n_{1}. Nonetheless, the optimal dependence should not beat a 1/n11/n_{1}-rate.

Appendix F Additional Experimental Results

F.1 Simulation

We generate the latent sample zz from standard normal distribution z𝖭(𝟢,𝖨𝖽)z\sim\sf{N}(0,I_{d}) and set x=Azx=Az for a randomly generated orthonormal matrix AD×dA\in\mathbb{R}^{D\times d}. The dimensions are set to be d=16,D=64d=16,D=64. The reward function is set to be f(x)=(θ)x5x22f(x)=(\theta^{*})^{\top}x_{\parallel}-5\|x_{\perp}\|^{2}_{2}, where θ\theta^{*} is defined by AβA\beta^{*}. We generate β\beta^{*} by uniformly sampling from the unit sphere.

When estimating θ^\widehat{\theta}, we set λ=1.0\lambda=1.0. The score matching network is based on the UNet implementation from https://github.com/lucidrains/denoising-diffusion-pytorch, where we modified the class embedding so it accepts continuous input. The predictor is trained using 81928192 samples and the score function is trained using 6553665536 samples. When training the score function, we choose Adam as the optimizer with learning rate 8×1058\times 10^{-5}. We train the score function for 1010 epochs, each epoch doing a full iteration over the whole training dataset with batch size 3232.

For evaluation, the statistics is computed using 20482048 samples generated from the diffusion model. The curve in the figures is computed by averaging over 55 runs.

F.2 Directed Text-to-Image Generation

Samples of high rewards and low rewards from the ground-truth reward model. In Section 7.2, the ground-truth reward model is built by replacing the final prediction layer of the ImageNet pre-trained ResNet-18 model with a randomly initialized linear layer of scalar outputs. To investigate the meaning of this randomly-generated reward model, we generate images using Stable Diffusion and filter out images with rewards 0.4\geq 0.4 (positive samples) and rewards 0.4\leq-0.4 (negative samples) and pick two typical images for each; see Figure 9. We note that in real-world use cases, the ground-truth rewards are often measured and annotated by human labors according to the demands.

Refer to caption
(a) A positive sample
Refer to caption
(b) A positive sample
Refer to caption
(c) A negative sample
Refer to caption
(d) A negative sample
Figure 9: Random samples with high rewards and low rewards.

Training Details. In our implementation, as the Stable Diffusion model operates on the latent space of its VAE, we build a 3-layer ConvNet with residual connections and batch normalizations on top of the VAE latent space. We train the network using Adam optimizer with learning rate 0.0010.001 for 100 epochs.