Diffusion Model for Data-Driven Black-Box Optimization^†^†thanks: $\diamond$ denotes equal contribution listed in alphabetical order. $\ddagger$ denotes co-correspondence.^†^†thanks: This paper is a journal version of Yuan et al. (2023).

Zihao Li^1,⋄ Hui Yuan^1,⋄ Kaixuan Huang¹ Chengzhuo Ni¹
Yinyu Ye² Minshuo Chen^1,‡ Mengdi Wang^1,‡
¹Princeton University ²Stanford University

Abstract

Generative AI has redefined artificial intelligence, enabling the creation of innovative content and customized solutions that drive business practices into a new era of efficiency and creativity. In this paper, we focus on diffusion models, a powerful generative AI technology, and investigate their potential for black-box optimization over complex structured variables. Consider the practical scenario where one wants to optimize some structured design in a high-dimensional space, based on massive unlabeled data (representing design variables) and a small labeled dataset. We study two practical types of labels: 1) noisy measurements of a real-valued reward function and 2) human preference based on pairwise comparisons. The goal is to generate new designs that are near-optimal and preserve the designed latent structures. Our proposed method reformulates the design optimization problem into a conditional sampling problem, which allows us to leverage the power of diffusion models for modeling complex distributions. In particular, we propose a reward-directed conditional diffusion model, to be trained on the mixed data, for sampling a near-optimal solution conditioned on high predicted rewards. Theoretically, we establish sub-optimality error bounds for the generated designs. The sub-optimality gap nearly matches the optimal guarantee in off-policy bandits, demonstrating the efficiency of reward-directed diffusion models for black-box optimization. Moreover, when the data admits a low-dimensional latent subspace structure, our model efficiently generates high-fidelity designs that closely respect the latent structure. We provide empirical experiments validating our model in decision-making and content-creation tasks.

1 Introduction

Generative AI has revolutionized the field of artificial intelligence by introducing capabilities that allow for the creation of new, original content — from realistic images and music to sophisticated text and code — paving the way for unprecedented innovations across industries such as entertainment, marketing, and business management (Kshetri et al., 2023; Ooi et al., 2023). In particular, diffusion models stand out as a powerful and universal generative AI technology that serves as the backbone for Stable Diffusion, DALLE 3, Sora, etc. (Ho et al., 2020; Song et al., 2021; Rombach et al., 2022; Liu et al., 2024; Esser et al., 2024). Diffusion models, known as a family of score-matching generative models, have demonstrated the state-of-the-art performance in various domains, such as image generation (Rombach et al., 2022; Ramesh et al., 2022; Balaji et al., 2022) and audio generation, with fascinating potentials in broader domains, including text modeling (Austin et al., 2021; Li et al., 2022), reinforcement learning (Janner et al., 2022; Ajay et al., 2023; Pearce et al., 2023; Liang et al., 2023) and protein structure modeling (Lee et al., 2023b). Diffusion models are trained to sample from the training data’s distribution and generate a new sample through a sequential denoising process. The denoising network (a.k.a. score network) $s(x,t)$ approximates the score function $\nabla\log p_{t}(x)$ (Song et al., 2020a, b), and controls the behavior of diffusion models. People can incorporate any control information $c$ as an additional input to the score network $s(x,c,t)$ during the training and inference (Ramesh et al., 2022; Zhang and Agrawala, 2023).

In this paper, we focus on capitalizing diffusion models for data-driven black-box optimization, where the goal is to generate new solutions that optimize an unknown objective function. Black-box optimization problems, also known as model-based optimization in machine learning, cover various application domains including computational biology (Watson et al., 2023; Guo et al., 2023), reinforcement learning (Ajay et al., 2022; Levine et al., 2020; Jin et al., 2021), and management science, e.g., pricing (Bu et al., 2023; Homburg et al., 2019; Wang et al., 2021) and recommendation systems (Pan and Wu, 2020; Pan et al., 2019). In these applications, the data is high dimensional and the unknown objective function is nonconvex, posing challenges to search for good solutions. To further complicate matters, plausible solutions, e.g., molecule structures in drug discovery (Barrera et al., 2016), are often high-dimensional with latent geometric structures – they lie on a low-dimensional manifold (Kumar and Levine, 2020). This additionally poses a critical requirment for black-box optimization: we need to capture data geometric structures to avoid suggesting unrealistic solutions that deviate severely from the original data domain.

Data-driven black-box optimization differentiates from conventional optimization problems, as it exposes the knowledge of the objective function only via a pre-collected dataset (Kumar and Levine, 2020; Brookes et al., 2019; Fu and Levine, 2021), while additional interactions with the objective function are prohibitive. This excludes the possibility of sequentially searching for good solutions, e.g., gradient ascent, aided by feedback from the objective function at each iteration. Instead, we reformulate the data-driven black-box optimization as sampling from a conditional distribution. This novel perspective is demonstrated in Figure 1, and the objective function value is the conditioning in the conditional distribution. We learn such a conditional distribution from the pre-collected dataset using diffusion models, and optimal solutions are then generated by conditioning on large objective values. Meanwhile, the generated solutions are expected to respect the data latent structures, since the data distribution encodes the corresponding data geometry.

Refer to caption — Figure 1: Generative AI for black-optimization via conditional sampling. We convert the problem of black-box optimization into the problem of sampling from a conditional distribution learned from a pre-collected dataset.

The subtlety of this method, however, lies in that the new solution generation potentially conflicts with the training process: diffusion models are learned to generate solutions similar to the training distribution, however, optimizing the objective function drives the model to deviate from the training distribution. In other words, the model needs to both “interpolate" and “extrapolate". A higher value of the conditioning provides a stronger signal that guides the diffusion model towards higher objective function values, while the increasing distribution shift (i.e., the mismatch between the training data and the targeted generated data; see Figure 2(a)(b)) may hurt the generated data’ quality.

In this paper, we term various objective functions as reward functions and we introduce reward-directed conditional diffusion models for data-driven black-box optimization, with both the real-valued rewards and human preference. We show that reward-directed diffusion models enjoy strong theoretical guarantees and appealing empirical implementations. The theoretical crux to the success of our method centers around the following questions:

How to provably estimate the reward-conditioned distribution in black-box optimization via diffusion?
How to properly choose high reward guidance in diffusion models for data generation, so as to ensure reward improvement among generated data?

Our Approach.

To answer the questions above, we consider a semi-supervised learning setting, where we are given a small annotated dataset $\mathcal{D}_{\rm label}$ , and a massive unlabeled dataset $\mathcal{D}_{\rm unlabel}$ . We study two forms of rewards in $\mathcal{D}_{\rm label}$ :

•

(Real-valued reward) The dataset consists of data and reward pairs, i.e., the reward is a real-valued noise-perturbed version of the underlying ground-truth reward;
•

(Human preference) The dataset consists of triples taking two comparable data points and a binary preference label. The preference label indicates that the corresponding data point is likely to have an edge in the underlying reward over the other one.

Our semi-supervised formulation spans a wide range of applications. For example, in protein design problems, there is a large collection of unlabeled structures, while only a small fraction underwent expensive wet-lab tests. We remark that our method can also be applied to learning scenarios, where only labeled data is available.

Our proposed method begins with estimating the reward function using $\mathcal{D}_{\rm label}$ and leveraging the estimator for pseudo-labeling on $\mathcal{D}_{\rm unlabel}$ . Then we train a reward-conditioned diffusion model using the pseudo-labeled data. In real-world applications, there are other ways to incorporate the knowledge from the massive dataset $\mathcal{D}_{\rm unlabel}$ , e.g., finetuning from a pre-trained model (Ouyang et al., 2022; Zhang and Agrawala, 2023). We focus on the pseudo-labeling approach, as it provides a cleaner formulation and exposes the error dependency on data size and distribution shift. The intuition behind and the message are applicable to other semi-supervised approaches; see experiments in Section 7.2.

To model the data intrinsic low-dimension structures from a theoretical standpoint, we consider data $x$ to have a latent linear representation. Specifically, we assume $x=Az$ for some unknown matrix $A$ with orthonormal columns and $z$ being a latent variable. The latent variable often has a smaller dimension, reflecting the fact that practical datasets are often low-dimensional (Gong et al., 2019; Tenenbaum et al., 2000; Pope et al., 2021). The representation matrix $A$ should be learned to promote sample efficiency and generation quality (Chen et al., 2023a). Our theoretical analysis reveals an intricate interplay between reward guidance, distribution shift, and implicit representation learning.

Contributions.

Our results are summarized as follows.

(1)

We show that the reward-conditioned diffusion model implicitly learns the latent subspace representation of $x$ . Consequently, the model provably generates high-fidelity data that stays close to the subspace (Theorem 5.4).
(2)

Given a target reward value, we analyze the sub-optimality of reward-directed generation, measured by the difference between the target value and the average reward of the generated data. In the case of a linear reward model, we show that the regret mimics the off-policy regret of linear bandits with full knowledge of the subspace feature. In other words, the reward-conditioned generation can be viewed as a form of off-policy bandit learning in the latent feature space. We further instantiate the real-valued reward and human preference setups, and prove that our algorithm achieves a converging sub-optimality under both cases (Theorem 6.3 and 6.9).
(3)

We further extend our theory to nonparametric settings where reward prediction and the training of diffusion models are parameterized by general function classes, which covers the wildly adopted ReLU neural networks in real-world implementations (Section 6.2).
(4)

We provide numerical experiments on both synthetic data, text-to-image generation and offline reinforcement learning to support our theory (Section 7).

To our best knowledge, our results present the first statistical theory for conditioned diffusion models and provably reward improvement guarantees for reward-directed generation.

2 Related Work

Our work lies in the intersection of three different streams: guided diffusion, theory for diffusion model, and black-box optimization.

2.1 Guided Diffusion Models

Diffusion models successfully incorporate diverse guidance in practical applications. For image generation, guiding the backward diffusion process towards higher log probabilities predicted by a classifier (which can be viewed as the reward signal) leads to improved sample quality, where the classifier can either be externally trained, i.e., classifier guidance (Dhariwal and Nichol, 2021) or implicitly specified by a conditioned diffusion model, i.e., classifier-free guidance (Ho and Salimans, 2022). Classifier-free guidance has become a standard technique in the state-of-the-art text-to-image diffusion models (Rombach et al., 2022; Ramesh et al., 2022; Balaji et al., 2022). Other types of guidance are also explored in Nichol et al. (2021); Graikos et al. (2022); Bansal et al. (2023).

Similar ideas have been explored in sequence modeling problems. In offline reinforcement learning, Decision Diffuser (Ajay et al., 2023) is a diffusion model trained on offline trajectories and can be conditioned to generate new trajectories with high returns, satisfying certain safety constraints, or composing skills. For discrete generations, Diffusion LM (Li et al., 2022) manages to train diffusion models on discrete text space with an additional embedding layer and a rounding step. The authors further show that gradients of any classifier can be incorporated to control and guide the text generation. These appealing empirical performance raises condensed curiosity of their theoretical underpinnings.

2.2 Theory of Diffusion Models

The theoretical foundation of diffusion models are gradually developing in the recent two years. A line of work studies diffusion models from a sampling perspective (De Bortoli et al., 2021; Albergo et al., 2023; Block et al., 2020; Lee et al., 2022a; Chen et al., 2022b; Lee et al., 2022b; Chen et al., 2023c, b; Benton et al., 2023; Montanari and Wu, 2023). When assuming access to a score function that can accurately approximate the ground truth score function in $L^{\infty}$ or $L^{2}$ norm, Chen et al. (2022b); Lee et al. (2023a) provide polynomial convergence guarantees of score-based diffusion models. De Bortoli (2022) further studies diffusion models under the manifold hypothesis.

Recently, Chen et al. (2023a); Oko et al. (2023); Mei and Wu (2023); Wibisono et al. (2024) provide an end-to-end analysis of diffusion models, leveraging both sampling and statistical tools. In particular, they develop score estimation and distribution estimation guarantees of diffusion models. Statistical sample complexities are established and explain the success of diffusion models by their minimax optimality and adaptivity to data intrinsic structures. These results largely motivate our theory, whereas, we are the first to consider conditional score matching and statistical analysis of conditional diffusion models.

2.3 Black-Box Optimization

In this paper, we concentrate on black-box optimization using a pre-collected dataset, while additional first/second-order information cannot be accessed beyond the given dataset. This is sometimes termed the offline black-box or zero-order optimization (Fu and Levine, 2021; Shahriari et al., 2015; Luenberger et al., 2021), and is different from the widely studied setting where active interactions with the objective function are allowed (Peters and Schaal, 2007; Snoek et al., 2014; Hebbal et al., 2019; Snoek et al., 2012). To solve the black-box optimization without active function interactions, Kumar and Levine (2020) propose a model inversion network that learns a stochastic inverse mapping from a reward value to the data instances. Then they generate new instances corresponding to the best possible reward. Such a method demonstrates empirical success, leaving theoretical understandings unexplored. Notably, Krishnamoorthy et al. (2023) also empirically study offline black-box optimization with diffusion model with classifier-free guidance.

Our proposed method and analysis is related the offline bandit/RL theory (Munos and Szepesvári, 2008; Liu et al., 2018; Chen and Jiang, 2019; Fan et al., 2020; Jin et al., 2021; Nguyen-Tang et al., 2021; Brandfonbrener et al., 2021). In particular, our theory extensively deals with distribution shift in the pre-collected dataset by class restricted divergence measures, which are commonly adopted in offline RL (Duan et al., 2020; Xie et al., 2021; Ji et al., 2022). Moreover, our established sub-optimality bound of the generated data consists of an error term that coincides with off-policy linear bandits. However, our analysis goes far beyond the scope of bandit/RL.

3 Problem Setup

Recall that we are presented with a pre-collected dataset for solving data-driven black-box optimization. The massive unlabeled dataset $\mathcal{D}_{\rm unlabel}$ contains only feature vectors $\{x_{i}\}_{i=1}^{n_{1}}$ , where $n_{1}$ denotes the sample size in $\mathcal{D}_{\rm unlabel}$ . For the labeled dataset $\mathcal{D}_{\rm label}$ , we study real-valued rewards and human preferences. In both cases, there exists a ground-truth reward function $f^{*}$ , evaluating the quality of any data point $x$ .

$\bullet$ Real-valued reward. In this setting, we assume an access to label $y_{i}$ for each $x_{i}$ . In such a case,

\mathcal{D}_{\rm label}=\{(x_{i},y_{i})\}_{i=1}^{n_{2}},

where $n_{2}$ is the sample size, $x_{i}$ is randomly sampled and $y_{i}$ is a noisy observation of the reward:

y_{i}=f^{*}(x_{i})+\xi_{i}\quad\text{for}\quad\xi_{i}\sim{\sf N}(0,\sigma^{2}).

(3.1)

Here, $\sigma>0$ is the variance of the observation noise. Such a setting corresponds to the scenario when we have a noisy labeler, who can “inform” us how good the generated data point is.

$\bullet$ Human preference. In many real-life applications, a direct evaluation of reward can be impractical and highly biased, e.g., $\mathbb{E}[\xi_{i}]\neq 0$ in (3.1). Existing methods such as Schlett et al. (2022) and Žiga Babnik et al. (2023) resort to biological or visual expressions to assess the data quality. Nonetheless, these methods are limited when it comes to aligning the generation process with human desired properties such as aesthetic quality and stylish requirements (Wang et al., 2023). Thus, in many of these applications, extensive reward engineering or case-by-case training is required, which tends to be costly and time-consuming. Fortunately, recent revolutionary success in language models suggests that human preference is a sensible resource for handling hard-to-perceive abstract rewards (Ouyang et al., 2022; Lee et al., 2021; Wang et al., 2022; Wu et al., 2021; Christiano et al., 2017). In such scenarios, a human expert makes his/her preference among a pair of randomly sampled data points. The human preference reflects the ground-truth reward value, yet circumvents direct queries on the absolute reward values. Formally, our labeled dataset becomes

\displaystyle\mathcal{D}_{\rm label}=\{(x_{i}^{(1)},x_{i}^{(2)},u_{i})\}_{i=1}^{n_{2}},

where $x^{(1)}_{i}$ and $x^{(2)}_{i}$ are i.i.d. randomly sampled and $u_{i}\in\{x_{i}^{(1)},x_{i}^{(2)}\}$ is the preferred instance.

We model the preference as a probabilistic outcome dependent on the ground-truth reward of data $x$ . In particular, we consider the widely adopted model Bradley-Terry model (Zhu et al., 2023; Ouyang et al., 2022), where we have

\displaystyle{\mathbb{P}}(u\mid x^{(1)},x^{(2)})=\frac{\exp(f^{*}(u))}{\exp(f^{*}(x^{(1)}))+\exp(f^{*}(x^{(2)}))}.

(3.2)

Based on the preference information, we can estimate the underlying reward function and further utilize the estimator for inference. Since there is no need to perform real-valued reward engineering, using preference data reduces the reward tuning efforts and leads to outstanding empirical performance (Ouyang et al., 2022; Zhu et al., 2023; Li et al., 2023; Stiennon et al., 2020; Glaese et al., 2022; Zhang et al., 2023).

We assume without loss of generality that $\mathcal{D}_{\rm label}$ and $\mathcal{D}_{\rm unlabel}$ are independent. In both datasets, $x$ is i.i.d. sampled from an unknown population distribution $P_{x}$ . In our subsequent analysis, we focus on the case where $P_{x}$ is supported on a latent subspace, meaning that the raw data $x$ admits a low-dimensional representation as in the following assumption.

Assumption 3.1 .

Data sampling distribution $P_{x}$ is supported on a low-dimensional linear subspace, i.e., $x=Az$ for an unknown $A\in\mathbb{R}^{D\times d}$ with orthonormal columns and $z\in\mathbb{R}^{d}$ is a latent variable.

Note that our setup covers the full-dimensional setting as a special case when $d=D$ . Yet the case of $d<D$ is much more interesting, as practical datasets are rich in intrinsic geometric structures (Gong et al., 2019; Pope et al., 2021; Tenenbaum et al., 2000; Kumar and Levine, 2020). Furthermore, the representation matrix $A$ may encode critical constraints on the generated data. For example, in protein design, the generated samples need to be similar to natural proteins and abide rules of biology, otherwise they easily fail to stay stable, leading to substantial reward decay (Lee et al., 2023b). In those applications, off-support data may be risky and suffer from a large degradation of reward. Base on this motivation, we further assume that the ground-truth reward has the following decomposition.

Assumption 3.2 .

We assume that $f^{*}(x)=g^{*}(x_{\parallel})+h^{*}(\|x_{\perp}\|_{2})$ . Here $h^{*}$ is always nonpositive and $x=x_{\parallel}+x_{\perp}$ , where $x_{\parallel}$ is the projection of $x$ onto subspace spanned by columns of $A$ , and $x_{\perp}$ is the orthogonal complement.

Assumption 3.2 assumes that the reward can be decomposed into two terms: 1) a positive function that only concerns the projection of $x$ on the underlying representation $A$ , and 2) a penalty term that measures how the data deviates from the subspace spanned by matrix $A$ .

4 Reward-Directed Generation via Conditional Diffusion Models

In this section, we develop a conditioned diffusion model-based method to generate high-fidelity samples with desired properties. In real-world applications such as image/text generation and protein design, one often has access to abundant unlabeled data, but relatively limited number of labeled data. This motivates us to consider a semi-supervised learning setting.

Notation: $P_{xy}$ denotes ground truth joint distribution of $x$ and its label $y$ ; $P_{x}$ is the marginal distribution of $x$ . Any pair of data in $\mathcal{D}_{\rm label}$ follows $P_{xy}$ and any data in $\mathcal{D}_{\rm unlabel}$ follows $P_{x}$ . $P$ is used to denote a distribution and $p$ denotes its corresponding density. $P(x\mid y=a)$ and $P(x,y=a)$ are the conditionals of $P_{xy}$ Similarly, we also use the notation $P_{x\widehat{y}}$ , $P(x\mid\widehat{y}=a)$ for the joint and conditional of $(x,\widehat{y})$ , where the learned reward model predicts $\widehat{y}$ . Also, denote a generated distribution using diffusion by $\widehat{P}$ (density $\widehat{p}$ ) followed by the same argument in parentheses as the true distribution it approximates, e.g. $\widehat{P}(x\mid y=a)$ is generated as an approximation of $P(x\mid y=a)$ .

4.1 Meta Algorithm

Algorithm 1 Reward-Conditioned Generation via Diffusion Model (RCGDM)

1: Input: Datasets

\mathcal{D}_{\rm unlabel}

\mathcal{D}_{\rm label}

, target reward value

a

, early-stopping time

t_{0}

, noise level

\nu

.(Note: in the following psuedo-code,

\phi_{t}(x)

is the Gaussian density and

\eta

is the step size of discrete backward SDE, see Section 4.2 for elaborations on conditional diffusion)

2: Reward Learning: Estimate the reward function by

\displaystyle\widehat{f}\in\mathop{\mathrm{argmin}}_{f\in\mathcal{F}}{L(f,\mathcal{D}_{\rm label})},

(4.1)

where

\ell

is a loss that depends on the dataset setup, see §3, and

\mathcal{F}

is a function class.

3: Pseudo labeling: Use the learned function

\widehat{f}

to evaluate unlabeled data

\mathcal{D}_{\rm unlabel}

and augment it with pseudo labeles:

\widetilde{\mathcal{D}}=\{(x_{j},\widehat{y}_{j})=\widehat{f}(x_{j})+\xi_{j}\}_{j=1}^{n_{1}}

for

\xi_{j}\overset{\text{i.i.d.}}{\sim}{\sf N}(0,\nu^{2})

4: Conditional score matching: Minimize over

s\in{\mathcal{S}}

(

{\mathcal{S}}

constructed as 4.8) on data set

\widetilde{\mathcal{D}}

via

\displaystyle\widehat{s}\in\mathop{\mathrm{argmin}}_{s\in{\mathcal{S}}}\int_{t_{0}}^{T}\widehat{\mathbb{E}}_{(x,\widehat{y})\in\widetilde{\mathcal{D}}}\mathbb{E}_{x^{\prime}\sim{\sf N}(\alpha(t)x,h(t)I_{D})}\left[\|\nabla_{x^{\prime}}\log\phi_{t}(x^{\prime}|x)-s(x^{\prime},\widehat{y},t)\|_{2}^{2}\right]\mathrm{d}t.

(4.2)

5: Conditioned generation: Use the estimated score

\widehat{s}(\cdot,a,\cdot)

to sample from the backward SDE:

\displaystyle\mathrm{d}\widetilde{X}_{t}^{t,\Leftarrow}=\left[\frac{1}{2}\widetilde{X}_{t}^{y,\Leftarrow}+\widehat{s}(\widetilde{X}_{k\eta}^{y,\Leftarrow},a,T-k\eta)\right]\mathrm{d}t+\mathrm{d}\overline{W}_{t}\quad\text{for}\quad t\in[k\eta,(k+1)\eta].

(4.3)

6: Return: Generated population

\widehat{P}(\cdot|\widehat{y}=a)

, learned subspace representation

V

contained in

\widehat{s}

We present our main algorithm in Algorithm 1 and also see a graphical illustration in Figure 3. In order to generate novel samples with both high fidelity and high rewards, we propose Reward-Conditioned Generation via Diffusion Models (RCGDM). Specifically, we learn the reward function $\widehat{f}$ from the dataset $\mathcal{D}_{\rm label}$ via regression, where $L(f,\mathcal{D}_{\rm label})$ is the corresponding loss function. In the real-valued reward setting, we specify

L(f,\mathcal{D}_{\rm label})=\sum_{i=1}^{n_{2}}(y_{i}-f(x_{i}))^{2}+\lambda\rho(f),

where $\rho(f)$ is a regularization term. When $f$ is a linear function of $x$ , we are actually doing ridge regression. In the human preference setting, we specify

L(f,\mathcal{D}_{\rm label})=\sum_{i=1}^{n_{2}}\{f(u_{i})-\log\big{(}\exp(f(x_{i}^{(1)}))+\exp\big{(}f(x_{i}^{(2)})\big{)}\big{)}\},

which is essentially an MLE objective. We then use $\widehat{f}$ to augment the unlabeled data $\mathcal{D}_{\rm unlabel}$ with “pseudo labeling" and additive noise, i.e., $\widetilde{\mathcal{D}}=\{(x_{j},\widehat{y}_{j}=\widehat{f}(x_{j})+\xi_{j})\}_{j=1}^{n_{1}}$ with $\xi_{j}\sim{\sf N}(0,\nu^{2})$ of a small variance $\nu^{2}$ . Here, we add noise $\xi_{j}$ merely for technical reasons in the proof. We denote the joint distribution of $(x,\widehat{y})$ as $P_{x\widehat{y}}$ . Next, we train a conditional diffusion model using the augmented dataset $\widetilde{D}$ . If we specify a target value of the reward, for example letting $\widehat{y}=a$ , we can generate conditioned samples from the distribution $\widehat{P}(x|\widehat{y}=a)$ by backward diffusion.

4.2 Training of Conditional Diffusion Model

We provide details about the training and sampling of conditioned diffusion in Algorithm 1 (Line 4: conditional score matching and Line 5: conditional generation). In Algorithm 1, conditional diffusion model is learned with $\widetilde{\mathcal{D}}=\{(x_{j},\widehat{y}_{j}=\widehat{f}(x_{j})+\xi_{j})\}_{j=1}^{n_{1}}$ , where $(x,\widehat{y})\sim P_{x\widehat{y}}$ . For simplicity, till the end of this section we use $y$ instead of $\widehat{y}$ to denote the condition variable. The diffusion model is to approximate the conditional probability $P(x\mid\widehat{y})$ .

4.3 Conditional Score Matching

The working flow of conditional diffusion models is nearly identical to that of unconditioned diffusion models reviewed in Chen et al. (2023a). A major difference is we learn a conditional score $\nabla\log p_{t}(x|y)$ instead of the unconditional one. Here $p_{t}$ denotes the marginal density function at time $t$ of the following forward O-U process,

\displaystyle\mathrm{d}X_{t}^{y}=-\frac{1}{2}g(t)X_{t}^{y}\mathrm{d}t+\sqrt{g(t)}\mathrm{d}W_{t}\quad\text{with}\quad X_{0}^{y}\sim P_{0}(x|y)~{}\text{and}~{}t\in(0,T],

(4.4)

where similarly $T$ is a terminal time, $(W_{t})_{t\geq 0}$ is a Wiener process, and the initial distribution $P_{0}(x|y)$ is induced by the $(x,\widehat{y})$ -pair distribution $P_{x\widehat{y}}$ . Note here the noise is only added on $x$ but not on $y$ . Throughout the paper, we consider $g(t)=1$ for simplicity. We denote by $P_{t}(x_{t}|y)$ the distribution of $X_{t}^{y}$ and let $p_{t}(x_{t}|y)$ be its density and $P_{t}(x_{t},y)$ be the corresponding joint, shorthanded as $P_{t}$ . A key step is to estimate the unknown $\nabla\log p_{t}(x_{t}|y)$ through denoising score matching (Song et al., 2020b). A conceptual way is to minimize the following quadratic loss with ${\mathcal{S}}$ , a concept class:

\mathop{\mathrm{argmin}}_{s\in{\mathcal{S}}}\int_{0}^{T}\mathbb{E}_{(x_{t},y)\sim P_{t}}\left[\|\nabla\log p_{t}(x_{t}|y)-s(x_{t},y,t)\|_{2}^{2}\right]\mathrm{d}t,

(4.5)

Unfortunately, the loss in (4.5) is intractable since $\nabla\log p_{t}(x_{t}|y)$ is unknown. Inspired by Hyvärinen and Dayan (2005) and Vincent (2011), we choose a new objective (4.2) and show their equivalence in the following Proposition. The proof is provided in Appendix A.1.

Proposition 4.1 (Score Matching Objective for Implementation).

For any $t>0$ and score estimator $s$ , there exists a constant $C_{t}$ independent of $s$ such that

		$\displaystyle\mathbb{E}_{(x_{t},y)\sim P_{t}}\left[\\|\nabla\log p_{t}(x_{t}\|y)-s(x_{t},y,t)\\|_{2}^{2}\right]$
		$\displaystyle=\mathbb{E}_{(x,y)\sim P_{x\widehat{y}}}\mathbb{E}_{x^{\prime}\sim{\sf N}(\alpha(t)x,h(t)I_{D})}\left[\\|\nabla_{x^{\prime}}\log\phi_{t}(x^{\prime}\|x)-s(x^{\prime},y,t)\\|_{2}^{2}\right]+C_{t},$		(4.6)

where $\nabla_{x^{\prime}}\log\phi_{t}(x^{\prime}|x)=-\frac{x^{\prime}-\alpha(t)x}{h(t)}$ , where $\phi_{t}(x^{\prime}|x)$ is the density of ${\sf N}(\alpha(t)x,h(t)I_{D})$ with $\alpha(t)=\exp(-t/2)$ and $h(t)=1-\exp(-t)$ .

The proof of Proposition 4.1 can be found in Appendix A.1. Here Equation (4.1) allows an efficient implementation, since $P_{x\widehat{y}}$ can be approximated by the empirical data distribution in $\widetilde{\mathcal{D}}$ and $x^{\prime}$ is easy to sample. Integrating (4.1) over time $t$ leads to a practical conditional score matching object

\displaystyle\mathop{\mathrm{argmin}}_{s\in{\mathcal{S}}}\int_{t_{0}}^{T}\widehat{\mathbb{E}}_{(x,y)\sim P_{x\widehat{y}}}\mathbb{E}_{x^{\prime}\sim{\sf N}(\alpha(t)x,h(t)I_{D})}\left[\|\nabla_{x^{\prime}}\log\phi_{t}(x^{\prime}|x)-s(x^{\prime},y,t)\|_{2}^{2}\right]\mathrm{d}t,

(4.7)

where $t_{0}>0$ is an early-stopping time to stabilize the training (Song and Ermon, 2020; Vahdat et al., 2021) and $\widehat{\mathbb{E}}$ denotes the empirical distribution.

Constructing a function class ${\mathcal{S}}$ adaptive to data structure is beneficial for learning the conditional score. In the same spirit of Chen et al. (2023a), we propose the score network architecture (see Figure 2(c) for an illustration):

\displaystyle{\mathcal{S}}=\bigg{\{}\mathbf{s}_{V,\psi}(x,y,t)=\frac{1}{h(t)}(V\cdot\psi(V^{\top}x,y,t)-x)

\displaystyle:~{}V\in\mathbb{R}^{D\times d},~{}\psi\in\Psi:\mathbb{R}^{d+1}\times[t_{0},T]\to\mathbb{R}^{d}~{}\bigg{\}},

(4.8)

with $V$ being any $D\times d$ matirx with orthonormal columns and $\Phi$ a customizable function class. This design has a linear encoder-decoder structure, catering to the latent subspace structure in data. Also $-\frac{1}{h(t)}x$ is included as a shortcut connection.

4.4 Conditioned Generation

Sampling from the model is realized by running a discretized backward process with step size $\eta>0$ described as follows:

\displaystyle\mathrm{d}\widetilde{X}_{t}^{t,\Leftarrow}=\left[\frac{1}{2}\widetilde{X}_{t}^{y,\Leftarrow}+\widehat{s}(\widetilde{X}_{k\eta}^{y,\Leftarrow},y,T-k\eta)\right]\mathrm{d}t+\mathrm{d}\overline{W}_{t}\quad\text{for}\quad t\in[k\eta,(k+1)\eta].

(Eqn. (4.3) revisited)

initialized with $\widetilde{X}_{t}^{t,\Leftarrow}\sim{\sf N}(0,I_{D})$ and $\overline{W}_{t}$ is a reversed Wiener process. Note that in (4.3), the unknown conditional score $\nabla p_{t}(x|y)$ is substituted by $\widehat{s}(x,y,t)$ .

Remark 4.2 (Alternative methods).

In Algorithm 1 we train the conditional diffusion model via conditional score matching (Line 4). This approach is suitable when we have access to the unlabeled dataset and need to train a brand-new diffusion model from scratch. Practically, we can utilize a pre-trained diffusion model of the unlabeled data directly and incorporate the reward information to the pre-trained model. Popular methods in this direction include classifier-based guidance (Dhariwal and Nichol, 2021), fine-tuning (Zhang and Agrawala, 2023), and self-distillation (Song et al., 2023); all sharing a similar spirit with Algorithm 1.

5 Statistical Analysis of Reward-Directed Condition Diffusion

In this section, let us analyze the reward-conditioned diffusion model give by Algorithm 1 for data-driven optimization. In particular, we are interested in two properties of the generated samples: 1) the reward levels of these new samples and 2) their level of fidelity – how much do new samples deviate from the latent subspace. To measure these properties, we adopt two sets of metrics. Firstly, the sub-optimality gap quantifies the average reward of the generated samples to the target reward value. Secondly, the subspace angle and an expected $l_{2}$ deviation quantify the sample fidelity. In the sequel, we analyze these performance metrics, relating them to the reward learning and diffusion model training steps in Algorithm 1.

5.1 Sub-Optimality Decomposition

We introduce the performance metric for a generated distribution with respect to a certain target reward value. Let $y^{*}$ be a target reward value and $P$ be a distribution learned by Algorithm 1 (we omit the hat notation on $P$ for simplicity). We define the sub-optimality of $P$ as

\texttt{SubOpt}(P;y^{*})=y^{*}-\mathbb{E}_{x\sim P}[f^{*}(x)],

which measures the gap between the expected reward of $x\sim P$ and the target value $y^{*}$ . In the language of bandit learning, this gap can be viewed as a form of off-policy sub-optimality.

We introduce the following proposition, which decomposes the sub-optimality $\texttt{SubOpt}(P;y^{*})$ . Specifically, we show that $\texttt{SubOpt}(P;y^{*}=a)$ comprises of three components: off-policy bandit regret which comes from the estimation error of the reward function, on-support and off-support errors coming from estimating conditional distributions with diffusion.

Proposition 5.1.

Set the target reward value $y^{*}=a$ . For a generated distribution $P$ of Algorithm 1, it holds that

\displaystyle\texttt{SubOpt}(P;y^{*}=a)\leq\underbrace{\mathbb{E}_{x\sim P_{a}}\left[\left|\widehat{f}(x)-g^{*}(x)\right|\right]}_{\mathcal{E}_{1}:\textrm{off-policy bandit regret}}+\underbrace{\left|\mathbb{E}_{x\sim P_{a}}[g^{*}(x_{\parallel})]-\mathbb{E}_{x\sim P}[g^{*}(x_{\parallel})]\right|}_{\mathcal{E}_{2}:\textrm{on-support diffusion error}}+\underbrace{\mathbb{E}_{x\sim P}[h^{*}(\|x_{\perp}\|_{2})]}_{\mathcal{E}_{3}:\textrm{off-support diffusion error}}.

The proof is provided in Appendix B.3.1. Error $\mathcal{E}_{1}$ corresponds to the reward function estimation error, errors $\mathcal{E}_{2}$ and $\mathcal{E}_{3}$ are induced by the training of diffusion model.

5.2 Sample Fidelity

In this section, we analyze the fidelity of the conditional generation process specified by Algorithm 1. Specifically, we assess its ability to capture the underlying representation, and how the data generated by the conditional diffusion model deviates the ground truth representation. Recall that Algorithm 1 has two outputs: the generated distribution $\widehat{P}(\cdot|\widehat{y}=a)$ and the learned representation matrix $V$ . We use notations $\widehat{P}_{a}:=\widehat{P}(\cdot|\widehat{y}=a)$ (generated distribution) and $P_{a}:=P(\cdot|\widehat{y}=a)$ (target distribution) for better clarity in the rest of the presentation. We define two metrics

\displaystyle\angle({V},{A})=\|VV^{\top}-AA^{\top}\|_{\rm F}^{2}\text{\quad and \quad}\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x_{\perp}\|_{2}].

(5.1)

Here, $\angle({V},{A})$ is defined for matrices $V$ and $A$ , where $A$ is the matrix encoding the ground truth subspace. Clearly, $\angle({V},{A})$ measures the difference in the column span of $V$ and $A$ , which is also known as the subspace angle. Term $\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x_{\perp}\|_{2}]$ is defined as the expected $l_{2}$ distance between $x$ and the subspace.

We impose the following commonly adopted light-tail assumption (Wainwright, 2019) on the latent distribution $z$ for quantifying sample fidelity.

Assumption 5.2 .

The latent variable $z$ follows distribution $P_{z}$ with density $p_{z}$ , such that there exist constants $B,C_{1},C_{2}$ verifying $p_{z}(z)\leq(2\pi)^{-d/2}C_{1}\exp\left(-C_{2}\|z\|_{2}^{2}/2\right)$ whenever $\|z\|_{2}>B$ . Moreover, the ground truth score is realizable, i.e., $\nabla\log p_{t}(x\mid\widehat{y})\in{\mathcal{S}}$ .

We remark that the realizability condition is for technical convenience, which can be removed in our nonparametric extension in Section 6.2. Assumption 5.2 is also mild and covers arbitrary sub-Gaussian latent data distributions, which include the following Gaussian assumption as a special case.

Assumption 5.3 .

The latent variable $z\sim{\sf N}(0,\Sigma)$ with a covariance matrix $\Sigma$ satisfying $\lambda_{\min}I_{d}\preceq\Sigma\preceq\lambda_{\max}I_{d}$ for $\lambda_{\min}\leq\lambda_{\max}\leq 1$ and $\lambda_{\min}$ being positive.

Now we present sample fidelity guarantees.

Theorem 5.4 (Subspace Fidelity of Generated Data).

Under Assumption 3.1, if Assumption 5.2 holds with $c_{0}I_{d}\preceq\mathbb{E}_{z\sim P_{z}}\left[zz^{\top}\right]$ for some constant $c_{0}>0$ , then with high probability on data, it holds that

\angle({V},{A})=\widetilde{\mathcal{O}}\left(\frac{1}{c_{0}}\sqrt{\frac{\mathcal{N}({\mathcal{S}},1/n_{1})D}{n_{1}}}\right)

(5.2)

with $\mathcal{N}({\mathcal{S}},1/n_{1})$ being the log covering number of function class ${\mathcal{S}}$ as in Equation (4.8). Moreover, with Assumption 5.3 and $D\geq d^{2}$ , we have $\mathcal{N}({\mathcal{S}},1/n_{1})=\mathcal{O}((d^{2}+Dd)\log(Ddn_{1}))$ and thus $\angle({V},{A})=\widetilde{\mathcal{O}}(\frac{1}{\lambda_{\min}}\sqrt{\frac{Dd^{2}+D^{2}d}{n_{1}}})$ .

The proof is provided in Appendix B.2. In Theorem 5.4, we provide a fidelity guarantee for the learned subspace $V$ . Specifically, when the covariance of the underlying representation $z$ is invertible, we prove that the error in subspace learning converges in a parametric rate.

Remark 5.5.

We make the following remarks for Theorem 5.4.

1.

Guarantee (5.2) applies to general distributions with light tail as assumed in Assumption 5.2.
2.

Guarantee (6.1) guarantees high fidelity of generated data in terms of staying in the subspace when we have access to a large unlabeled dataset.
3.

Guarantee (6.1) shows that $\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x_{\perp}\|_{2}]$ scales up when $t_{0}$ goes up, which aligns with the dynamic in backward process that samples are concentrating to the learned subspace as $t_{0}$ goes to $0$ . Taking $t_{0}\to 0$ , $\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x_{\perp}\|]$ has the decay in $O(n_{1}^{-\frac{1}{4}})$ . However, taking $t_{0}\to 0$ is not ideal for the sake of high reward of $x$ , we take the best trade-off of $t_{0}$ later in Theorem 6.3.

6 Optimization Theory

In this section, we provide optimization guarantees by instantiating our performance metrics. We consider two theoretical settings: the parametric and nonparametric settings. In the parametric setting, we focus on a linear reward model with off-support penalties. The reward observation in $\mathcal{D}_{\rm label}$ covers both the real-valued and human preference cases. Through detailed analysis, we establish handy guidance on how to choose a proper target reward value for ensuring quality improvement. We then extend our results to nonparametric settings, where a ReLU neural network is used for reward estimation. We then present the sub-optimality guarantee in the corresponding context.

6.1 Results under Real-Valued Reward

In this subsection, we consider the case when we have access to a noisy reward label for every $x_{i}$ in $\mathcal{D}_{\rm label}$ , i.e. $\mathcal{D}_{\rm label}=\{(x_{i}.y_{i})\}_{i\in[n_{2}]}$ , where $y_{i}$ is sampled according to (3.1). Recall that in Assumption 3.2, the ground truth reward has the decomposition $f^{*}(x)=g^{*}(x_{\parallel})+h^{*}(x_{\perp})$ , where $x_{\parallel}$ is the projection of $x$ on subspace $A$ , and $x_{\perp}$ is the part orthogonal to $A$ . We consider two setups for $g^{*}(x_{\parallel})$ : 1) the parametric setting, where $g^{*}(x_{\parallel})$ is a linear function of $x_{\parallel}$ ; 2) the nonparametric setting, where $g^{*}(x_{\parallel})$ is a general Hölder continuous function, and we use a deep ReLU neural network for reward estimation.

Parametric Setting.

For the parametric setup, we make the following assumption. Such an assumption assumes that the reward function consists of two parts: a linear function of $x$ and a penalty term that increases in magnitude when $x$ deviates from the representation subspace.

Assumption 6.1 .

We assume that $g^{*}(x_{\parallel})=(\theta^{*})^{\top}x_{\parallel}$ where $\theta^{*}=A\beta^{*}$ for some $\beta^{*}\in\mathbb{R}^{d}$ and $\|\theta^{*}\|_{2}=\|\beta^{*}\|_{2}=1$ . The penalty $h^{*}(x_{\perp})$ is always nonpositive and decreases with respect to $\|x_{\perp}\|_{2}$ with $h^{*}(0)=0$ .

Assumption 6.1 adopts a simple linear reward model, which will be generalized in Section 6.2. We now present Theorem 6.2, which is a corollary of Theorem 5.4 and characterizes the deviation of the generated samples from the ground-truth representation subspace.

Theorem 6.2.

Assume that the conditions in Theorem 5.4 hold. Further under Assumption 6.1, it holds that

\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x_{\perp}\|_{2}]=\mathcal{O}\left(\sqrt{t_{0}D}+\sqrt{\angle({V},{A})}\cdot\sqrt{\frac{a^{2}}{\|\beta^{*}\|_{\Sigma}}+d}\right),

(6.1)

where $\beta^{*}$ is the ground-truth parameter of linear model and $t_{0}$ is the early-stopping time in diffusio models.

The proof is provided in Appendix B.2. We observe that larger target reward $a$ leads to more severe subspace deviation, since aggressive extrapolation is needed to match the target reward. Therefore, to maintain high sample fidelity prevents us from setting $a$ too large.

On the other hand, we examine the sub-optimality bound. We begin with the real-valued rewards, where a labeled dataset $\mathcal{D}_{\rm label}=\{(x_{i},y_{i})\}_{i=1}^{n_{2}}$ is given. In this case, we perform ridge regression to estimate the linear reward function in Step 2 of Algorithm 1, i.e.,

\mathop{\mathrm{argmin}}_{\theta}~{}\sum_{i=1}^{n_{2}}(\theta^{\top}x_{i}-y_{i})^{2}+\lambda\|\theta\|_{2}^{2}

for a positive coefficient $\lambda$ . We now characterize the sub-optimality bound in the following theorem.

Theorem 6.3 (Off-policy Regret of Generated Samples).

Suppose Assumptions 3.1, 5.3 and 6.1 hold. We choose $\lambda=1$ , $t_{0}=\left((Dd^{2}+D^{2}d)/n_{1}\right)^{1/6}$ and $\nu=1/\sqrt{D}$ . With high probability, running Algorithm 1 with a target reward value $a$ gives rise to

		$\displaystyle\quad\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a)$
		$\displaystyle\leq\underbrace{\sqrt{\operatorname{Tr}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})}\cdot\mathcal{O}\left(\sqrt{\frac{d\log n_{2}}{n_{2}}}\right)}_{\mathcal{E}_{1}:\textrm{off-policy bandit regret}}+\underbrace{{\tt DistroShift}(a)\cdot\left(d^{2}D+D^{2}d\right)^{1/6}{n_{1}}^{-1/6}\cdot a}_{\mathcal{E}_{2}:\textrm{on-support diffusion error}}$
		$\displaystyle\quad+\underbrace{\mathbb{E}_{\widehat{P}_{a}}[h^{*}(x_{\perp})]}_{\mathcal{E}_{3}:\textrm{off-support diffusion error}},$		(6.2)

where $\widehat{\Sigma}_{\lambda}:=\frac{1}{n_{2}}(X^{\top}X+\lambda I_{D})$ where $X$ is the stack matrix of $\mathcal{D}_{\rm label}$ and $\Sigma_{P_{a}}=\mathbb{E}_{P_{a}}[xx^{\top}]$ .

The proof is provided in Appendix B.3.

Implications and Discussions:

1). Equation (6.3) and (6.9) decompose the suboptimality gap into two separate parts of error: error from reward learning ( $\mathcal{E}_{1}$ ) and error coming from diffusion ( $\mathcal{E}_{2}$ and $\mathcal{E}_{3}$ ).

2). $\mathcal{E}_{1}$ depending on $d$ shows diffusion model learns a low-dimensional representation of $x$ , reducing $D$ to smaller latent dimension $d$ . It can be seen from $\operatorname{Tr}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{p_{q}})\leq\mathcal{O}\left(\frac{a^{2}}{\|{\beta}^{*}\|_{\Sigma}}+d\right)$ when $n_{2}=\Omega(\frac{1}{\lambda_{\min}})$ . Similarly, we can prove that $\operatorname{Tr}(\widetilde{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})\leq\mathcal{O}\left(\frac{a^{2}}{\|{\beta}^{*}\|_{\Sigma}}+d\right)$ when $n_{2}=\Omega(\frac{1}{\lambda_{\min}})$ . Such a result shows that even without an observable reward, we can still obtain an off-policy bandit regret that matches the case with an observable reward dataset.

3). If we ignore the diffusion errors, the suboptimality gap resembles the standard regret of off-policy bandit learning in $d$ -dimensional feature subspace (Jin et al., 2021, Section 3.2), (Nguyen-Tang et al., 2021; Brandfonbrener et al., 2021).

4). It is also worth mentioning that $\mathcal{E}_{2}$ and $\mathcal{E}_{3}$ depend on $t_{0}$ and that by taking $t_{0}=\left((Dd^{2}+D^{2}d)/n_{1}\right)^{1/6}$ one gets a good trade-off in $\mathcal{E}_{2}$ .

5). On-support diffusion error entangles with distribution shift in complicated ways. We show

\displaystyle\mathcal{E}_{2}=\left({\tt DistroShift}(a)\cdot\left(d^{2}D+D^{2}d\right)^{1/6}{n_{1}}^{-1/6}\cdot a\right),

6). When generated samples deviate away from the latent space, the reward may substantially degrade as determined by the nature of $h$ .

6.2 Extension to Nonparametric Setting

Our theoretical analysis, in its full generality, extends to using general nonparametric function approximation for both the reward and score functions. Built upon the insights from the linear setting, we provide an analysis of the nonparametric reward and general data distribution setting. We generalize Assumption 6.1 to the following.

Assumption 6.4 .

We assume that $g^{*}(x_{\parallel})$ is $\alpha$ -Hölder continuous for $\alpha\geq 1$ . Moreover, $g^{*}$ has a bounded Hölder norm, i.e., $\|g^{*}\|_{\mathcal{H}^{\alpha}}\leq 1$ .

Hölder continuity is widely studied in nonparametric statistics literature (Györfi et al., 2002; Tsybakov, 2008). Under Assumption 6.4, we use nonparametric regression for estimating $f^{*}$ . Specifically, we specialize (4.1) in Algorithm 1 to

\displaystyle\widehat{f}_{\theta}\in\mathop{\mathrm{argmin}}_{f_{\theta}\in\mathcal{F}}\frac{1}{2n}\sum_{i=1}^{n_{1}}(f_{\theta}(x_{i})-y_{i})^{2},

where $\mathcal{F}$ is chosen to be a class of neural networks. Hyperparameters in $\mathcal{F}$ will be chosen properly in Theorem 6.7.

Our theory also considers generic sampling distributions on $x$ . Since $x$ lies in a low-dimensional subspace, this translates to a sampling distribution assumption on latent variable $z$ .

Assumption 6.5 .

The latent variable $z$ follows a distribution $P_{z}$ with density $p_{z}$ , such that there exist constants $B,C_{1},C_{2}$ verifying $p_{z}(z)\leq(2\pi)^{-d/2}C_{1}\exp\left(-C_{2}\|z\|_{2}^{2}/2\right)$ whenever $\|z\|_{2}>B$ . Moreover, there exists a positive constant $c_{0}$ such that $c_{0}I_{d}\preceq\mathbb{E}_{z\sim P_{z}}\left[zz^{\top}\right]$ .

Assumption 6.5 says $P_{z}$ has a light tail and also encodes distributions with compact support. Furthermore, we assume that the curated data $(x,\widehat{y})$ induces Lipschitz conditional scores. Motivated by Chen et al. (2023a), we show that the linear subspace structure in $x$ leads to a similar conditional score decomposition $\nabla\log p_{t}(x|\widehat{y})=s_{\parallel}(x,\widehat{y},t)+s_{\perp}(x,\widehat{y},t)$ , where $s_{\parallel}$ is the on-support score and $s_{\perp}$ is the orthogonal score. The formal statement and proof for this decomposition are deferred to Appendix E. We make the following assumption on $s_{\parallel}$ .

Assumption 6.6 .

The on-support conditional score function $s_{\parallel}(x,\widehat{y},t)$ is Lipschitz with respect to $x,\widehat{y}$ for any $t\in(0,T]$ , i.e., there exists a constant $C_{\rm lip}$ , such that for any $x,\widehat{y}$ and $x^{\prime},\widehat{y}^{\prime}$ , it holds

\displaystyle\|s_{\parallel}(x,\widehat{y},t)-s_{\parallel}(x^{\prime},\widehat{y}^{\prime},t)\|_{2}\leq C_{\rm lip}\|x-x^{\prime}\|_{2}+C_{\rm lip}|\widehat{y}-\widehat{y}^{\prime}|.

Lipschitz score is commonly adopted in existing works (Chen et al., 2022b; Lee et al., 2023a). Yet Assumption 6.6 only requires the Lipschitz continuity of the on-support score, which matches the weak regularity conditions in Lee et al. (2023a); Chen et al. (2023a). We then choose the score network architecture similar to that in the linear reward setting, except we replace $m$ by a nonlinear network. Recall the linear encoder and decoder estimate the representation matrix $A$ .

We consider feedforward networks with ReLU activation functions as concept classes $\mathcal{F}$ and ${\mathcal{S}}$ for nonparametric regression and conditional score matching. Given an input $x$ , neural networks compute

\displaystyle f_{\rm NN}(x)=W_{L}\sigma(\dots\sigma(W_{1}x+b_{1})\dots)+b_{L},

(6.3)

where $W_{i}$ and $b_{i}$ are weight matrices and intercepts, respectively. We then define a class of neural networks as

	$\displaystyle{\rm NN}(L,M,J,K,\kappa)=\Big{\{}f:f~{}\text{in the form of \eqref{eq:fnn} with $L$ layers and width bounded by~{}}M,$
	$\displaystyle\sup_{x}\\|f(x)\\|_{2}\leq K,\max\{\\|b_{i}\\|_{\infty},\\|W_{i}\\|_{\infty}\}\leq\kappa~{}\text{for}~{}i=1,\dots,L,~{}\text{and}~{}\sum_{i=1}^{L}\big{(}\\|W_{i}\\|_{0}+\\|b_{i}\\|_{0}\big{)}\leq J\Big{\}}.$

For the conditional score network, we will additionally impose some Lipschitz continuity requirement, i.e., $\|f(x)-f(y)\|_{2}\leq c_{\rm lip}\|x-y\|_{2}$ for some Lipschitz coefficient $c_{\rm lip}$ . Now we are ready to bound $\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a)$ in Theorem 6.7 in terms of non-parametric regression error, score matching error and distribution shifts in both regression and score matching.

Theorem 6.7.

Suppose Assumptions 3.1, 6.4, 6.5 and 6.6 hold. Let $\delta(n)=\frac{d\log\log n}{\log n}$ . Properly chosen $\mathcal{F}$ and ${\mathcal{S}}$ , with high probability, running Algorithm 1 with a target reward value $a$ and early-stopping time $t_{0}=\left(n_{1}^{-\frac{2-2\delta(n_{1})}{d+6}}+Dn_{1}^{-\frac{d+4}{d+6}}\right)^{\frac{1}{3}}$ gives rise to $\angle({V},{A})\leq\widetilde{\mathcal{O}}\left(\frac{1}{c_{0}}\left(n_{1}^{-\frac{2-2\delta(n_{1})}{d+6}}+Dn_{1}^{-\frac{d+4}{d+6}}\right)\right)$ and

	$\displaystyle\quad\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a)$
	$\displaystyle\leq\underbrace{\sqrt{{\mathcal{T}}(P(x\|\widehat{y}=a),P_{x};\bar{\mathcal{F}})}\cdot\widetilde{\mathcal{O}}\left(n_{2}^{-\frac{\alpha-\delta(n_{2})}{2\alpha+d}}+D/n_{2}\right)}_{\mathcal{E}_{1}}$
	$\displaystyle\quad+\underbrace{\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{c_{0}}}\cdot\\|g^{*}\\|_{\infty}+\sqrt{M(a)}\right)\cdot\widetilde{\mathcal{O}}\left(\left(n_{1}^{-\frac{2-2\delta(n_{1})}{d+6}}+Dn_{1}^{-\frac{d+4}{d+6}}\right)^{\frac{1}{3}}\right)}_{\mathcal{E}_{2}}$
	$\displaystyle\quad+\underbrace{\mathbb{E}_{x\sim\widehat{P}_{a}}[h^{*}(x_{\perp})]}_{\mathcal{E}_{3}},$

where $M(a):=\mathbb{E}_{z\sim\mathbb{P}(a)}[\|z\|^{2}_{2}]$ and

\displaystyle\bar{\mathcal{F}}:=\{|f^{*}(x)-f(x)|^{2}:f\in\mathcal{F}\},\quad\bar{{\mathcal{S}}}=\left\{\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\mathbb{E}_{x_{t}\mid x}\|\nabla\log p_{t}(x_{t}\mid y)-s(x_{t},y,t)\|_{2}^{2}\mathrm{d}t:s\in{\mathcal{S}}\right\},

$\mathcal{E}_{3}$ penalizes the component in $\widehat{P}_{a}$ that is off the subspace. The function classes $\mathcal{F}$ and ${\mathcal{S}}$ are chosen as $\mathcal{F}={\rm NN}(L_{f},M_{f},J_{f},K_{f},\kappa_{f})$ with

	$\displaystyle L_{f}=\mathcal{O}(\log n_{2}),\ M_{f}=\mathcal{O}\left(n_{2}^{-\frac{d}{d+2\alpha}}(\log n_{2})^{d/2}\right),\ J_{f}=\mathcal{O}\left(n_{2}^{-\frac{d}{d+2\alpha}}(\log n_{2})^{d/2+1}\right)$
	$\displaystyle\hskip 130.08621ptK_{f}=1,\ \kappa_{f}=\mathcal{O}\left(\sqrt{\log n_{2}}\right)$

and ${\mathcal{S}}={\rm NN}(L_{s},M_{s},J_{s},K_{s},\kappa_{s})$ with

	$\displaystyle L_{s}=\mathcal{O}(\log n_{1}+d),\ M_{s}=\mathcal{O}\left(d^{d/2}n_{1}^{-\frac{d+2}{d+6}}(\log n_{1})^{d/2}\right),\ J_{s}=\mathcal{O}\left(d^{d/2}n_{1}^{-\frac{d+2}{d+6}}(\log n_{1})^{d/2+1}\right)$
	$\displaystyle\hskip 108.405ptK_{s}=\mathcal{O}\left(d\log(dn_{1})\right),\ \kappa_{s}=\mathcal{O}\left(\sqrt{d\log(n_{1}d)}\right).$

Moreover, ${\mathcal{S}}$ is also Lipschitz with respect to $(x,y)$ and the Lipschitz coefficient is $c_{\rm lip}=\mathcal{O}\left(10dC_{\rm lip}\right)$ .

The proof is provided in Appendix D.2. Here the $\delta(n)$ term accounts for the unbounded domain of $x$ , which is negligible when $n$ is large. As a consequency, we prove that under a nonparametric setting, Algorithm 1 enjoys a $\widetilde{\mathcal{O}}\big{(}\texttt{DistroShift}(a)\cdot\big{(}n_{2}^{-\frac{\alpha}{2\alpha+d}}+n_{1}^{-\frac{2}{3(d+6)}}\big{)}\big{)}$ suboptimality when the neural network is properly chosen.

6.3 Results under Human Preference

In this section, we study the setup where real-valued reward $y$ is not observable, and the labeled dataset consists of pairwise comparisons, i.e., $\mathcal{D}_{\rm label}=\{x_{i}^{(1)},x_{i}^{(2)},u_{i}\}_{i=1}^{n_{2}}$ , where $u_{i}$ is the human preference between $\{x_{i}^{(1)},x_{i}^{(2)}\}$ for every $i\in[n_{2}]$ . Recall that we model human preference by the Bradley-Terry model, and the choice probability is explicitly given in (3.2):

\displaystyle{\mathbb{P}}(u\mid x^{(1)},x^{(2)})=\frac{\exp(f^{*}(u))}{\exp(f^{*}(x^{(1)}))+\exp(f^{*}(x^{(2)}))}.

Under such a model, changing the ground truth $f^{*}$ by a constant does not change the human choice probability. Thus the value of $f^{*}$ is in general unidentifiable. For identifiability, we make the following assumption:

Assumption 6.8 .

Suppose the ground truth reward $f^{*}$ is linear as in Assumption 6.1. To identify $\theta^{*}$ with human preference, we assume that $\sum_{i=1}^{D}\theta_{i}^{*}=0$ .

In the rest of this section, we denote the space of parameter $\theta$ as $\Theta=\{\theta\in\mathbb{R}^{D}\mid\|\theta\|_{2}\leq 1,\sum_{i=1}^{D}\theta_{i}=0\}$ . We remark that Assumption 6.8 is imposed only for technical reasons and does not compromise the generality of our analysis. Such identification assumptions are widely made in previous works on choice models (Zhu et al., 2023; Aguirregabiria and Mira, 2010; Bajari et al., 2015; Chernozhukov et al., 2022; Fan et al., 2023).

We learn the underlying ground truth reward function by maximum likelihood estiomation under linear function approximation. Specifically, by the human choice model (3.2), we apply the following minimization in Step 2 of Algorithm 1:

\displaystyle\mathop{\mathrm{argmin}}_{\theta\in\Theta}~{}-\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\bigg{\{}\theta^{\top}u_{i}-\log\bigg{(}\exp(\theta^{\top}x_{i}^{(1)})+\exp(\theta^{\top}x_{i}^{(2)})\bigg{)}\bigg{\}}.

Then we establish the following sub-optimality guarantee with human preference.

Theorem 6.9.

Suppose Assumptions 3.1, 5.3, 6.1 and 6.8 hold. We set $t_{0}=\big{(}(Dd^{2}+D^{2}d)/n_{1}\big{)}^{1/6}$ and $\nu=1/\sqrt{D}$ . Then for an arbitrary $\lambda>0$ , with high probability, running Algorithm 1 with a target reward value $a$ , we have

	$\displaystyle\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a)$	$\displaystyle\leq\underbrace{\sqrt{\operatorname{Tr}(\widetilde{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})}\cdot\mathcal{O}\left(\sqrt{\frac{d+\log n_{2}+\lambda}{n_{2}}}\right)}_{\mathcal{E}_{1}:\textrm{off-policy bandit regret}}$
		$\displaystyle\quad+\underbrace{{\tt DistroShift}(a)\cdot\left(d^{2}D+D^{2}d\right)^{1/6}{n_{1}}^{-1/6}\cdot a}_{\mathcal{E}_{2}:\textrm{on-support diffusion error}}+\underbrace{\mathbb{E}_{\widehat{P}_{a}}[h^{*}(\\|x_{\perp}\\|)]}_{\mathcal{E}_{3}:\textrm{off-support diffusion error}},$		(6.4)

where $\Sigma_{P_{a}}:=\mathbb{E}_{P_{a}}[xx^{\top}]$ and $\widetilde{\Sigma}_{\lambda}:=\frac{1}{n_{2}}\big{(}\sum_{i=1}^{n_{2}}(x_{i}^{(1)}-x_{i}^{(2)})(x_{i}^{(1)}-x_{i}^{(2)})^{\top}+\lambda I_{D}\big{)}$ .

The proof is provided in Appendix B.3.

Implications and Discussions:

1). Similar to the real-valued labeled setup, $\mathcal{E}_{1}$ depending on $d$ shows diffusion model learns a low-dimensional representation of $x$ , reducing $D$ to smaller latent dimension $d$ . We can prove that $\operatorname{Tr}(\widetilde{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})\leq\mathcal{O}\left(\frac{a^{2}}{\|{\beta}^{*}\|_{\Sigma}}+d\right)$ when $n_{2}=\Omega(\frac{1}{\lambda_{\min}})$ . Such a result shows that even without an observable reward, we can still obtain an off-policy bandit regret that matches the case with an observable reward dataset.
2). By setting $t_{0}=\left((Dd^{2}+D^{2}d)/n_{1}\right)^{1/6}$ , we still have

\displaystyle\mathcal{E}_{2}=\left({\tt DistroShift}(a)\cdot\left(d^{2}D+D^{2}d\right)^{1/6}{n_{1}}^{-1/6}\cdot a\right),

where ${\tt DistroShift}(a)$ quantifies the distribution shift depending on different reward values. In the special case of the latent covariance matrix $\Sigma$ is known, we can quantify the distribution shift as ${\tt DistroShift}(a)=\mathcal{O}(a\vee d)$ . 3). Learning from human preferences is almost as good as learning from an explicitly labeled dataset. Specifically, our results in human preference setup match those in the setup with the observable real-valued label. We also claim that Algorithm 1 can provably improve the reward on the basis of dataset samples. When the latent covariance matrix $\Sigma$ is known, we can quantify the distribution shift as ${\tt DistroShift}(a)=\mathcal{O}(a\vee d)$ . If $n_{2}=o(n_{1}^{1/3})$ and the off-support diffusion error is negligible, if we denote the average reward in $\mathcal{D}_{\rm label}$ with $s$ , by selecting $a=O(s+\frac{d}{n_{2}}\sqrt{\log n_{2}})$ , we have $\mathbb{E}_{\widehat{P}_{a}}[f^{*}(x)]\geq s$ with high probability. To the authors’ best knowledge, this is the first theoretical attempt to understand reward improvement of conditional diffusion. These results imply a potential connection between diffusion theory and off-policy bandit learning, which is interesting for more future research. See proofs in Appendix B.3.

7 Numerical Experiments

We now present numerical results in both simulated and real data environments to showcase the performance and validate our established theories.

7.1 Synthetic Data

We consider a data-driven product design optimization problem. The goal is to generate new designs that attain high rewards from customers, based on pre-collected customer feedback. Let $x\in\mathbb{R}^{D}$ denote the product design. We consider $x$ having a linear subspace structure, i.e., $x=Az$ for some unknown matrix $A$ and latent variable $z\in\mathbb{R}^{d}$ . We set $D=64$ and $d=16$ . The representation matrix $A$ is randomly sampled with orthonormal columns. The latent variable $z$ follows the standard Gaussian distribution $\sf{N}(0,I_{d})$ . For each product design, we associate a reward function in the form of Assumption 6.1. In particular, for the linear part, we generate $\beta^{*}$ by uniformly sampling from the unit sphere and then fix it. The penalty term is chosen as $h^{*}(x)=-5\|x\|_{2}^{2}$ .

To implement Algorithm 1, we use ridge regression to estimate the reward function and train a diffusion model utilizing a $1$ -dimensional version of the UNet (Ronneberger et al., 2015) for learning the score function. More training configurations and code are deferred to Appendix F.

Figure 4 shows the average reward of the generated samples under different target reward values. We also plot the distribution shift and off-support deviation in terms of the $l_{2}$ -norm distance from the support. For small target reward values, the generated average reward almost scales linearly with the target value, which is consistent with the theory as the distribution shift remains small for these target values. The generated reward begins to decrease as we further increase the target reward value, and the reason is twofold. Firstly, the off-support deviation of the generated samples becomes large in this case, which prevents the generated reward from further going up. Secondly, the distribution shift increases rapidly as we further increase the target value. In Figure 5, we show the distribution of the rewards in the generated samples. As we increase the target reward value, the generated rewards become less concentrated and are shifted to the left of the target value, which is also due to the distribution shift and off-support deviation.

7.2 Directed Text-to-Image Generation

Next, we empirically verify our theory through directed text-to-image generation. Instead of training a diffusion model from scratch, we use Stable Diffusion v1.5 (Rombach et al., 2022), pre-trained on LAION dataset (Schuhmann et al., 2022). Stable Diffusion operates on the latent space of its Variational Auto-Encoder and can incorporate text conditions. We show that by training a reward model we can further guide the Stable Diffusion model to generate images of desired properties.

Ground-truth Reward Model. We start from an ImageNet (Deng et al., 2009) pre-trained ResNet-18 (He et al., 2016) model and replace the final prediction layer with a randomly initialized linear layer of scalar outputs. Then we use this model as the ground-truth reward model. To investigate the meaning of this randomly-generated reward model, we generate random samples and manually inspect the images with high rewards and low rewards. The ground-truth reward model seems to favor colorful and vivid natural scenes against monochrome and dull images; see Appendix F for sample images.

Labelled Dataset. We use the ground-truth reward model to compute a scalar output for each instance in the CIFAR-10 (Krizhevsky et al., 2009) training dataset and perturb the output by adding a Gaussian noise from $\mathcal{N}(0,0.01)$ . We use the images and the corresponding outputs as the training dataset.

Reward-network Training. To avoid adding additional input to the diffusion model and tuning the new parameters, we introduce a new network $\mu_{\theta}$ and approximate $p_{t}(y|x_{t})$ by ${\sf N}(\mu_{\theta}(x_{t}),\sigma^{2})$ . For simplicity, we set $\sigma^{2}$ as a tunable hyperparameter. We share network parameters for different noise levels $t$ , so our $\mu_{\theta}$ has no additional input of $t$ . We train $\mu_{\theta}$ by minimizing the expected KL divergence between $p_{t}(y|x_{t})$ and ${\sf N}(\mu_{\theta}(x_{t}),\sigma^{2})$ :

\mathbb{E}_{t}\mathbb{E}_{x_{t}}\Big{[}\mathrm{KL}(p_{t}(y|x_{t})\mid{\sf N}(\mu_{\theta}(x_{t}),\sigma^{2}))\Big{]}=\mathbb{E}_{t}\mathbb{E}_{(x_{t},y)\sim p_{t}}\frac{\|y-\mu_{\theta}(x_{t})\|_{2}^{2}}{2\sigma^{2}}+\mathrm{Constant}.

Equivalently, we train the reward model $\mu_{\theta}$ to predict the noisy reward $y$ from the noisy inputs $x_{t}$ . Also, notice that the minimizers of the objective do not depend on the choice of $\sigma^{2}$ .

Reward-network-based Directed Diffusion. To perform reward-directed conditional diffusion, observe that $\nabla_{x}\log p_{t}(x|y)=\nabla_{x}\log p_{t}(x)+\nabla_{x}\log p_{t}(y|x)$ , and $p_{t}(y|x)\propto\exp\Big{(}-\frac{\|y-\mu_{\theta}(x)\|_{2}^{2}}{2\sigma^{2}}\Big{)}$ . Therefore,

\nabla_{x}\log p_{t}(y|x)=-1/\sigma^{2}\cdot\nabla_{x}\Big{[}\frac{1}{2}\|y-\mu_{\theta}(x)\|_{2}^{2}\Big{]}.

In our implementation, we compute the gradient by back-propagation through $\mu_{\theta}$ and incorporate this gradient guidance into each denoising step of the DDIM sampler (Song et al., 2021) following (Dhariwal and Nichol, 2021) (equation (14)). We see that $1/\sigma^{2}$ corresponds to the weights of the gradient with respect to unconditioned score. In the sequel, we refer to $1/\sigma^{2}$ as the “guidance level”, and $y$ as the “target value”.

Quantitative Results. We vary $1/\sigma^{2}$ in $\{25,50,100,200,400\}$ and $y$ in $\{1,2,4,8,16\}$ . For each combination, we generate 100 images with the text prompt “A nice photo” and calculate the mean and the standard variation of the predicted rewards and the ground-truth rewards. The results are plotted in Figure 6. From the plot, we see similar effects of increasing the target value $y$ at different guidance levels $1/\sigma^{2}$ . A larger target value puts more weight on the guidance signals $\nabla_{x}\mu_{\theta}(x)$ , which successfully drives the generated images towards higher predicted rewards, but suffers more from the distribution-shift effects between the training distribution and the reward-conditioned distribution, which renders larger gaps between the predicted rewards and the ground-truth rewards. To optimally choose a target value, we must trade off between the two counteractive effects.

Qualitative Results. To qualitatively test the effects of the reward conditioning, we generate a set of images with increasing target values $y$ under different text prompts and investigate the visual properties of the produced images. We isolate the effect of reward conditioning by fixing all the randomness during the generation processes, so the generated images have similar semantic layouts. After hyper-parameter tuning, we find that setting $1/\sigma^{2}=100$ and $y\in\{2,4,6,8,10\}$ achieves good results across different text prompts and random seeds. We pick out typical examples and summarized the results in Figure 7, which demonstrates that as we increase the target value, the generated images become more colorful at the expense of degradations of the image qualities.

7.3 Offline High-Reward Trajectory Generation in Reinforcement Learning

Lastly, we demonstrate using our proposed reward-directed conditional diffusion model for direct high-reward trajectory generation in reinforcement learning problems. We replicate the implementation of Decision Diffuser (Ajay et al., 2023) under the (Med-Expert, Hopper) setting. In Decision Diffuser, the RL trajectory and the final reward are jointly modeled with a conditioned diffusion model. The policy is given by first performing reward-directed conditional diffusion to sample a trajectory of high reward and then extracting action sequences from the trajectory using a trained inverse dynamics model. We plot the mean and standard deviation of the total returns (averaged across 10 independent episodes) v.s. the target rewards in Figure 8. The theoretical maximum of the reward is 400. Therefore, trajectories with rewards greater than 400 are never seen during training. We observe that when we increase the target reward beyond 320, the actual total reward decreases. This is supported by our theory that as we increase the reward guidance signal, the condition effect becomes stronger but the distribution-shift effect also becomes stronger. We also observe that the reward-directed conditional generation yields comparable and even marginally better performance compared to state-of-the-art methods, such as trajectory transformer (Janner et al., 2021) and MoReL (Kidambi et al., 2020).

References

Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
Aguirregabiria and Mira (2010) Victor Aguirregabiria and Pedro Mira. Dynamic discrete choice structural models: A survey. Journal of Econometrics, 156(1):38–67, 2010.
Ajay et al. (2022) Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
Ajay et al. (2023) Anurag Ajay, Yilun Du, Abhi Gupta, Joshua B. Tenenbaum, Tommi S. Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=sP1fo2K9DFG.
Albergo et al. (2023) Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.
Austin et al. (2021) Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
Bajari et al. (2015) Patrick Bajari, Victor Chernozhukov, Han Hong, and Denis Nekipelov. Identification and efficient semiparametric estimation of a dynamic discrete game. Technical report, National Bureau of Economic Research, 2015.
Balaji et al. (2022) Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
Bansal et al. (2023) Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 843–852, 2023.
Barrera et al. (2016) Luis A Barrera, Anastasia Vedenko, Jesse V Kurland, Julia M Rogers, Stephen S Gisselbrecht, Elizabeth J Rossin, Jaie Woodard, Luca Mariani, Kian Hong Kock, Sachi Inukai, et al. Survey of variation in human transcription factors reveals prevalent dna binding changes. Science, 351(6280):1450–1454, 2016.
Benton et al. (2023) Joe Benton, Valentin De Bortoli, Arnaud Doucet, and George Deligiannidis. Linear convergence bounds for diffusion models via stochastic localization. arXiv preprint arXiv:2308.03686, 2023.
Block et al. (2020) Adam Block, Youssef Mroueh, and Alexander Rakhlin. Generative modeling with denoising auto-encoders and langevin sampling. arXiv preprint arXiv:2002.00107, 2020.
Brandfonbrener et al. (2021) David Brandfonbrener, William Whitney, Rajesh Ranganath, and Joan Bruna. Offline contextual bandits with overparameterized models. In International Conference on Machine Learning, pages 1049–1058. PMLR, 2021.
Brookes et al. (2019) David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling for robust design. In International conference on machine learning, pages 773–782. PMLR, 2019.
Bu et al. (2023) Jinzhi Bu, David Simchi-Levi, and Li Wang. Offline pricing and demand learning with censored data. Management Science, 69(2):885–903, 2023.
Chen and Jiang (2019) Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
Chen et al. (2020) Minshuo Chen, Yu Bai, Jason D Lee, Tuo Zhao, Huan Wang, Caiming Xiong, and Richard Socher. Towards understanding hierarchical learning: Benefits of neural representations. Advances in Neural Information Processing Systems, 33:22134–22145, 2020.
Chen et al. (2022a) Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery. Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022a.
Chen et al. (2023a) Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. arXiv preprint arXiv:2302.07194, 2023a.
Chen et al. (2022b) Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022b.
Chen et al. (2023b) Sitan Chen, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, and Adil Salim. The probability flow ode is provably fast. arXiv preprint arXiv:2305.11798, 2023b.
Chen et al. (2023c) Sitan Chen, Giannis Daras, and Alex Dimakis. Restoration-degradation beyond linear diffusions: A non-asymptotic analysis for ddim-type samplers. In International Conference on Machine Learning, pages 4462–4484. PMLR, 2023c.
Chernozhukov et al. (2022) Victor Chernozhukov, Juan Carlos Escanciano, Hidehiko Ichimura, Whitney K Newey, and James M Robins. Locally robust semiparametric estimation. Econometrica, 90(4):1501–1535, 2022.
Christiano et al. (2017) Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
De Bortoli (2022) Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. arXiv preprint arXiv:2208.05314, 2022.
De Bortoli et al. (2021) Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
Duan et al. (2020) Yaqi Duan, Zeyu Jia, and Mengdi Wang. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
Esser et al. (2024) Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
Fan et al. (2020) Jianqing Fan, Zhaoran Wang, Yuchen Xie, and Zhuoran Yang. A theoretical analysis of deep q-learning. In Learning for Dynamics and Control, pages 486–489. PMLR, 2020.
Fan et al. (2023) Jianqing Fan, Zhipeng Lou, Weichen Wang, and Mengxin Yu. Spectral ranking inferences based on general multiway comparisons. arXiv preprint arXiv:2308.02918, 2023.
Fu and Levine (2021) Justin Fu and Sergey Levine. Offline model-based optimization via normalized maximum likelihood estimation. arXiv preprint arXiv:2102.07970, 2021.
Glaese et al. (2022) Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
Gong et al. (2019) Sixue Gong, Vishnu Naresh Boddeti, and Anil K Jain. On the intrinsic dimensionality of image representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3987–3996, 2019.
Graikos et al. (2022) Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models as plug-and-play priors. arXiv preprint arXiv:2206.09012, 2022.
Guo et al. (2023) Zhiye Guo, Jian Liu, Yanli Wang, Mengrui Chen, Duolin Wang, Dong Xu, and Jianlin Cheng. Diffusion models in bioinformatics and computational biology. Nature Reviews Bioengineering, pages 1–19, 2023.
Györfi et al. (2002) László Györfi, Michael Köhler, Adam Krzyżak, and Harro Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Hebbal et al. (2019) Ali Hebbal, Loic Brevault, Mathieu Balesdent, El-Ghazali Talbi, and Nouredine Melab. Bayesian optimization using deep gaussian processes. arXiv preprint arXiv:1905.03350, 2019.
Ho and Salimans (2022) Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
Homburg et al. (2019) Christian Homburg, Karin Lauer, and Arnd Vomberg. The multichannel pricing dilemma: Do consumers accept higher offline than online prices? International Journal of Research in Marketing, 36(4):597–612, 2019.
Hsu et al. (2011) Daniel Hsu, Sham M. Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors, 2011.
Hyvärinen and Dayan (2005) Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
Janner et al. (2021) Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
Janner et al. (2022) Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, 2022.
Ji et al. (2022) Xiang Ji, Minshuo Chen, Mengdi Wang, and Tuo Zhao. Sample complexity of nonparametric off-policy evaluation on low-dimensional manifolds using deep networks. arXiv preprint arXiv:2206.02887, 2022.
Jin et al. (2021) Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
Kidambi et al. (2020) Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
Krishnamoorthy et al. (2023) Siddarth Krishnamoorthy, Satvik Mehul Mashkaria, and Aditya Grover. Diffusion models for black-box optimization. In International Conference on Machine Learning, pages 17842–17857. PMLR, 2023.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Kshetri et al. (2023) Nir Kshetri, Yogesh K Dwivedi, Thomas H Davenport, and Niki Panteli. Generative artificial intelligence in marketing: Applications, opportunities, challenges, and research agenda, 2023.
Kumar and Levine (2020) Aviral Kumar and Sergey Levine. Model inversion networks for model-based optimization. Advances in Neural Information Processing Systems, 33:5126–5137, 2020.
Lee et al. (2022a) Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative modeling with polynomial complexity. arXiv preprint arXiv:2206.06227, 2022a.
Lee et al. (2022b) Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. arXiv preprint arXiv:2209.12381, 2022b.
Lee et al. (2023a) Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for general data distributions. In International Conference on Algorithmic Learning Theory, pages 946–985. PMLR, 2023a.
Lee et al. (2023b) Jin Sub Lee, Jisun Kim, and Philip M. Kim. Proteinsgm: Score-based generative modeling for de novo protein design. bioRxiv, 2023b. doi: 10.1101/2022.07.13.499967. URL https://www.biorxiv.org/content/early/2023/02/04/2022.07.13.499967.
Lee et al. (2021) Kimin Lee, Laura Smith, and Pieter Abbeel. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training, 2021.
Levine et al. (2020) Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Li et al. (2022) Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
Li et al. (2023) Zihao Li, Zhuoran Yang, and Mengdi Wang. Reinforcement learning with human feedback: Learning dynamic choices via pessimism. arXiv preprint arXiv:2305.18438, 2023.
Liang et al. (2023) Zhixuan Liang, Yao Mu, Mingyu Ding, Fei Ni, Masayoshi Tomizuka, and Ping Luo. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. arXiv preprint arXiv:2302.01877, 2023.
Liu et al. (2018) Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems, 31, 2018.
Liu et al. (2024) Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024.
Luenberger et al. (2021) David G Luenberger, Yinyu Ye, et al. Linear and nonlinear programming, volume 2. Springer, 2021.
Mei and Wu (2023) Song Mei and Yuchen Wu. Deep networks as denoising algorithms: Sample-efficient learning of diffusion models in high-dimensional graphical models. arXiv preprint arXiv:2309.11420, 2023.
Montanari and Wu (2023) Andrea Montanari and Yuchen Wu. Posterior sampling from the spiked models via diffusion processes. arXiv preprint arXiv:2304.11449, 2023.
Munos and Szepesvári (2008) Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5), 2008.
Nakada and Imaizumi (2020) Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. The Journal of Machine Learning Research, 21(1):7018–7055, 2020.
Nguyen-Tang et al. (2021) Thanh Nguyen-Tang, Sunil Gupta, A Tuan Nguyen, and Svetha Venkatesh. Offline neural contextual bandits: Pessimism, optimization and generalization. arXiv preprint arXiv:2111.13807, 2021.
Nichol et al. (2021) Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
Oko et al. (2023) Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023. URL https://openreview.net/forum?id=6961CeTSFA.
Ooi et al. (2023) Keng-Boon Ooi, Garry Wei-Han Tan, Mostafa Al-Emran, Mohammed A Al-Sharafi, Alexandru Capatina, Amrita Chakraborty, Yogesh K Dwivedi, Tzu-Ling Huang, Arpan Kumar Kar, Voon-Hsien Lee, et al. The potential of generative artificial intelligence across disciplines: Perspectives and future directions. Journal of Computer Information Systems, pages 1–32, 2023.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Pan and Wu (2020) Yuchen Pan and Desheng Wu. A novel recommendation model for online-to-offline service based on the customer network and service location. Journal of Management Information Systems, 37(2):563–593, 2020.
Pan et al. (2019) Yuchen Pan, Desheng Wu, Cuicui Luo, and Alexandre Dolgui. User activity measurement in rating-based online-to-offline (o2o) service recommendation. Information Sciences, 479:180–196, 2019.
Pearce et al. (2023) Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, and Sam Devlin. Imitating human behaviour with diffusion models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Pv1GPQzRrC8.
Peters and Schaal (2007) Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007.
Pope et al. (2021) Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. arXiv preprint arXiv:2104.08894, 2021.
Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
Schlett et al. (2022) Torsten Schlett, Christian Rathgeb, Olaf Henniger, Javier Galbally, Julian Fierrez, and Christoph Busch. Face image quality assessment: A literature survey. ACM Computing Surveys, 54(10s):1–49, January 2022. ISSN 1557-7341. doi: 10.1145/3507901. URL http://dx.doi.org/10.1145/3507901.
Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
Shahriari et al. (2015) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.
Snoek et al. (2014) Jasper Snoek, Kevin Swersky, Rich Zemel, and Ryan Adams. Input warping for bayesian optimization of non-stationary functions. In International Conference on Machine Learning, pages 1674–1682. PMLR, 2014.
Song et al. (2021) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=St1giarCHLP.
Song and Ermon (2020) Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
Song et al. (2020a) Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574–584. PMLR, 2020a.
Song et al. (2020b) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
Song et al. (2023) Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
Tenenbaum et al. (2000) Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
Tsybakov (2008) Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519.
Vahdat et al. (2021) Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
Vincent (2011) Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
Wainwright (2019) Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
Wang et al. (2021) Ningning Wang, Ting Zhang, Xiaowei Zhu, and Peimiao Li. Online-offline competitive pricing with reference price effect. Journal of the Operational Research Society, 72(3):642–653, 2021.
Wang et al. (2022) Xiaofei Wang, Kimin Lee, Kourosh Hakhamaneshi, Pieter Abbeel, and Michael Laskin. Skill preferences: Learning to extract and execute robotic skills from human feedback. In Conference on Robot Learning, pages 1259–1268. PMLR, 2022.
Wang et al. (2023) Zhizhong Wang, Lei Zhao, and Wei Xing. Stylediffusion: Controllable disentangled style transfer via diffusion models, 2023.
Watson et al. (2023) Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
Wibisono et al. (2024) Andre Wibisono, Yihong Wu, and Kaylee Yingxi Yang. Optimal score estimation via empirical bayes smoothing. arXiv preprint arXiv:2402.07747, 2024.
Wu et al. (2021) Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
Xie et al. (2021) Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694, 2021.
Yuan et al. (2023) Hui Yuan, Kaixuan Huang, Chengzhuo Ni, Minshuo Chen, and Mengdi Wang. Reward-directed conditional diffusion: Provable distribution estimation and reward improvement. arXiv preprint arXiv:2307.07055, 2023.
Zhang and Agrawala (2023) Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
Zhang et al. (2023) Yangwenhui Zhang, Hong Qian, Xiang Shu, and Aimin Zhou. High-dimensional dueling optimization with preference embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11280–11288, 2023.
Zhu et al. (2023) Banghua Zhu, Jiantao Jiao, and Michael I Jordan. Principled reinforcement learning with human feedback from pairwise or $k$ -wise comparisons. arXiv preprint arXiv:2301.11270, 2023.
Žiga Babnik et al. (2023) Žiga Babnik, Peter Peer, and Vitomir Štruc. Diffiqa: Face image quality assessment using denoising diffusion probabilistic models, 2023.

Appendix A Conditional Score Matching and Generation

A.1 Proof of Proposition 4.1

For any $t\geq 0$ , it hold that $\nabla_{x_{t}}\log p_{t}(x_{t}\mid y)=\nabla_{x_{t}}\log p_{t}(x_{t},y)$ since the gradient is taken w.r.t. $x_{t}$ only. Then plugging in this equation and expanding the norm square on the LHS gives

\displaystyle\mathbb{E}_{(x_{t},y)\sim P_{t}}\left[\|\nabla_{x_{t}}\log p_{t}(x_{t},y)-s(x_{t},y,t)\|_{2}^{2}\right]=\mathbb{E}_{(x_{t},y)\sim P_{t}}\big{[}\|s(x_{t},y,t)\|_{2}^{2}-2\langle\nabla_{x_{t}}\log p_{t}(x_{t},y),s(x_{t},y,t)\rangle\big{]}+C.

Then it suffices to prove

\mathbb{E}_{(x_{t},y)\sim P_{t}}\left[\langle\nabla_{x_{t}}\log p_{t}(x_{t},y),s(x_{t},y,t)\rangle\right]=\mathbb{E}_{(x,y)\sim P_{x\widehat{y}}}\mathbb{E}_{x^{\prime}\sim{\sf N}(\alpha(t)x,h(t)I)}\left[\langle\nabla_{x^{\prime}}\phi_{t}(x^{\prime}\mid x),s(x^{\prime},y,t)\rangle\right]

Using integration by parts to rewrite the inner product we have

	$\displaystyle\mathbb{E}_{(x_{t},y)\sim P_{t}}\left[\langle\nabla_{x_{t}}\log p_{t}(x_{t},y),s(x_{t},y,t)\rangle\right]$	$\displaystyle=\int p_{t}(x_{t},y)\langle\nabla_{x_{t}}\log p_{t}(x_{t},y),s(x_{t},y,t)\rangle dx_{t}dy$
		$\displaystyle=\int\langle\nabla_{x_{t}}p_{t}(x_{t},y),s(x_{t},y,t)\rangle dx_{t}dy$
		$\displaystyle=-\int p_{t}(x_{t},y)\operatorname{div}(s(x_{t},y,t))dx_{t}dy,$

where denote by $\phi_{t}(x^{\prime}|x)$ the density of ${\sf N}(\alpha(t)x,h(t)I_{D})$ with $\alpha(t)=\exp(-t/2)$ and $h(t)=1-\exp(-t)$ , then

	$\displaystyle-\int p_{t}(x_{t},y)\operatorname{div}(s(x_{t},y,t))dx_{t}dy$	$\displaystyle=-\mathbb{E}_{(x,y)\sim P_{x\widehat{y}}}\int\phi_{t}(x^{\prime}\mid x)\operatorname{div}(s(x^{\prime},y,t))dx^{\prime}$
		$\displaystyle=\mathbb{E}_{(x,y)\sim P_{x\widehat{y}}}\int\langle\nabla_{x^{\prime}}\phi_{t}(x^{\prime}\mid x),s(x^{\prime},y,t)\rangle dx^{\prime}$
		$\displaystyle=\mathbb{E}_{(x,y)\sim P_{x\widehat{y}}}\mathbb{E}_{x^{\prime}\sim{\sf N}(\alpha(t)x,h(t)I)}\left[\langle\nabla_{x^{\prime}}\phi_{t}(x^{\prime}\mid x),s(x^{\prime},y,t)\rangle\right].$

Appendix B Omitted Proofs in Section 5

Additional Notations: We follow the notations in the main paper along with some additional ones. Use $P_{t}^{LD}(z)$ to denote the low-dimensional distribution on $z$ corrupted by diffusion noise. Formally, $p_{t}^{LD}(z)=\int\phi_{t}(z^{\prime}|z)p_{z}(z)\mathrm{d}z$ with $\phi_{t}(\cdot|z)$ being the density of ${\sf N}(\alpha(t)z,h(t)I_{d})$ . $P^{LD}_{t_{0}}(z\mid\widehat{f}(Az)=a)$ the corresponding conditional distribution on $\widehat{f}(Az)=a$ at $t_{0}$ , with shorthand as $P^{LD}_{t_{0}}(a)$ . Also give $P_{z}(z\mid\widehat{f}(Az)=a)$ a shorthand as $P^{LD}(a)$ . In our theorems, $\mathcal{O}$ hides constant factors and higher order terms in $n_{1}^{-1}$ and $n_{2}^{-1}$ and , $\widetilde{\mathcal{O}}$ further hides logarithmic terms and can also hide factors in $d$ .

B.1 Parametric Conditional Score Matching Error

Theorems presented in Section 5 are established upon the conditional score estimation error, which has been studied in Chen et al. [2023a] for general distributions, but in Lemma B.1 we provide a new one specific to our setting where the true score is linear in input $(x_{t},\widehat{y})$ due to the Gaussian design. Despite the linearity of score in Gaussian case, we emphasize matching score in (4.2) is not simply linear regression as ${\mathcal{S}}$ consists of an encoder-decoder structure for estimating matrix $A$ to reduce dimension (see §E for ${\mathcal{S}}$ construction and more proof details).

In the following lemma, we first present a general result for the case where the true score is within ${\mathcal{S}}$ , which is constructed as a parametric function class. Then the score matching error is bounded in terms of $\mathcal{N}({\mathcal{S}},1/n_{1})$ , the $\log$ covering number of ${\mathcal{S}}$ , recall $n_{1}$ is the size of $\mathcal{D}_{\rm unlabel}$ . Instantiating this general result, we derive score matching error for Gaussian case by upper bounding $\mathcal{N}({\mathcal{S}},1/n_{1})$ in this special case.

Lemma B.1.

Under Assumption 3.1, if $\nabla\log p_{t}(x\mid y)\in{\mathcal{S}}$ , where

\displaystyle{\mathcal{S}}=\bigg{\{}\mathbf{s}_{V,\psi}(x,y,t)=\frac{1}{h(t)}(V\cdot\psi(V^{\top}x,y,t)-x)

\displaystyle:~{}V\in\mathbb{R}^{D\times d},~{}\psi\in\Psi:\mathbb{R}^{d+1}\times[t_{0},T]\to\mathbb{R}^{d}~{}\bigg{\}},

((4.8) revisited)

with $\Psi$ parametric. Then for $\delta\geq 0$ , with probability $1-\delta$ , the square score matching error is bounded by $\epsilon^{2}_{diff}=\mathcal{O}\left(\frac{1}{t_{0}}\sqrt{\frac{\mathcal{N}({\mathcal{S}},1/n_{1})(d^{2}\vee D)\log\frac{1}{\delta}}{n_{1}}}\right)$ , i.e.,

\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\mathbb{E}_{(x_{t},y)\sim P_{t}}\left[\|\nabla\log p_{t}(x_{t}|y)-\widehat{s}(x_{t},y,t)\|_{2}^{2}\right]\mathrm{d}t\leq\epsilon_{diff}^{2},

(B.1)

Proof of this lemma can be found in Appendix E. Recall $P_{t}$ comes from $P_{x\widehat{y}}$ by noising $x$ at $t$ in the forward process. Under Assumption 5.3 and given $\widehat{f}(x)=\widehat{\theta}^{\top}x$ and $\widehat{y}=\widehat{f}(x)+\xi,\xi\sim{\sf N}(0,\nu^{2})$ , the score function $\nabla\log p_{t}(x\mid\widehat{y})$ to approximate is linear in $x$ and $\widehat{y}$ . When approximated by $S$ with $\Psi$ linear, $\mathcal{N}({\mathcal{S}},1/n_{1})=\mathcal{O}((d^{2}+Dd)\log(Ddn_{1}))$ .

To provide fidelity and reward guarantees of $\widehat{P}_{a}$ : the generated distribution of $x$ given condition $\widehat{y}=a$ , we will need the following lemma. It provides a subspace recovery guarantee between $V$ (score matching output) and $A$ (ground truth), as well as a distance measure between distributions $P_{a}$ and $\widehat{P}_{a}$ , given score matching error $\epsilon_{diff}$ .

Note $P_{a}$ and $\widehat{P}_{a}$ are over $x$ , which admits an underlying low-dimensional structure $x=Az$ . Thus we measure distance between $P_{a}$ and $\widehat{P}_{a}$ by defining

Definition B.2.

We define $TV(\widehat{P}_{a}):=\texttt{d}_{\rm TV}\left(P^{LD}_{t_{0}}(z\mid\widehat{f}(Az)=a),(U^{\top}V^{\top})_{\#}\widehat{P}_{a}\right)$ with notations:

•

$\texttt{d}_{\rm TV}(\cdot,\cdot)$ is the TV distance between two distribution.
•

$f_{\sharp}P$ denotes a push-forward measure, i.e., for any measurable $\Omega$ , $(f_{\sharp}P)(\Omega)=P(f^{-1}(\Omega))$
•

$(V^{\top})_{\#}\widehat{P}_{a}$ pushes generated $\widehat{P}_{a}$ forward to the low dimensional subspace using learned subspace matrix $V$ . $U$ is an orthonormal matrix of dimension $d$ .
•

$P^{LD}_{t_{0}}(z\mid\widehat{f}(Az)=a)$ is close to $(A^{\top})_{\#}P_{a}$ , with $t_{0}$ taking account for the early stopping in backward process.

We note that there is a distribution shift between the training and the generated data, which has a profound impact on the generative performance. We quantify the influence of distribution shift by the following class restricted divergence measure.

Definition B.3.

Distribution shift between two arbitrary distributions $P_{1}$ and $P_{2}$ restricted under function class $\mathcal{L}$ is defined as

\displaystyle\textstyle{\mathcal{T}}(P_{1},P_{2};\mathcal{L})=\sup_{l\in\mathcal{L}}\mathbb{E}_{x\sim P_{1}}[l(x)]/\mathbb{E}_{x\sim P_{2}}[l(x)]\quad\text{with arbitrary two distributions~{}}P_{1},P_{2}.

Definition B.3 is well perceived in bandit and RL literature [Munos and Szepesvári, 2008, Liu et al., 2018, Chen and Jiang, 2019, Fan et al., 2020].

Lemma B.4.

Given the square score matching error (B.1) upper bounded by $\epsilon_{diff}^{2}$ , and when $P_{z}$ satisfying Assumption 5.2 with $c_{0}I_{d}\preceq\mathbb{E}_{z\sim P_{z}}\left[zz^{\top}\right]$ , it guarantees on for $x\sim\widehat{P}_{a}$ and $\angle({V},{A}):=\|VV^{\top}-AA^{\top}\|^{2}_{\rm F}$ that

		$\displaystyle(I_{D}-VV^{\top})x\sim{\sf N}(0,\Lambda),\quad\Lambda\prec ct_{0}I_{D},$		(B.2)
		$\displaystyle\angle({V},{A})=\widetilde{\mathcal{O}}\left(\frac{t_{0}}{c_{0}}\cdot\epsilon^{2}_{diff}\right).$		(B.3)

In addition,

TV(\widehat{P}_{a})=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{c_{0}}}\cdot\epsilon_{diff}\right).

(B.4)

with $\bar{{\mathcal{S}}}=\left\{\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\mathbb{E}_{x_{t}\mid x}\|\nabla\log p_{t}(x_{t}\mid y)-s(x_{t},y,t)\|_{2}^{2}\mathrm{d}t:s\in{\mathcal{S}}\right\}$ . $TV(\widehat{P}_{a})$ and ${\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})$ are defined in Definition B.2 and B.3. Proof of Lemma B.4 is in Appendix C.2.

B.2 Proof of Theorem 5.4 and Theorem 6.2

Proof of Theorem 5.4: bounding $\angle({V},{A})$ . By Lemma 3 of Chen et al. [2023a], we have

\angle({V},{A})={\mathcal{O}}\left(\frac{t_{0}}{c_{0}}\cdot\epsilon^{2}_{diff}\right)

when the latent $z$ satisfying Assumption 5.2 and $c_{0}I_{d}\preceq\mathbb{E}_{z\sim P_{z}}\left[zz^{\top}\right]$ . Therefore, by (B.1), we have with high probability that

\angle({V},{A})=\widetilde{\mathcal{O}}\left(\frac{1}{c_{0}}\sqrt{\frac{\mathcal{N}({\mathcal{S}},1/n_{1})(D\vee d^{2})}{n_{1}}}\right).

When Assumption 5.3 holds, plugging in $c_{0}=\lambda_{\min}$ and $\mathcal{N}({\mathcal{S}},1/n_{1})=\mathcal{O}((d^{2}+Dd)\log(Ddn_{1}))$ , it gives

\angle({V},{A})=\widetilde{\mathcal{O}}\left(\frac{1}{\lambda_{\min}}\sqrt{\frac{(D\vee d^{2})d^{2}+(D\vee d^{2})Dd}{n_{1}}}\right),

where $\widetilde{\mathcal{O}}$ hides logarithmic terms. When $D>d^{2}$ , which is often the case in practical applications, we have

\displaystyle\angle({V},{A})=\widetilde{\mathcal{O}}\left(\frac{1}{\lambda_{\min}}\sqrt{\frac{Dd^{2}+D^{2}d}{n_{1}}}\right).

Proof of Theorem 6.2: bounding $\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x_{\perp}\|_{2}]$ . By the definition of $x_{\perp}$ that $x_{\perp}=(I_{D}-AA^{\top})x$ ,

\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x^{\perp}\|_{2}]=\mathbb{E}_{x\sim\widehat{P}_{a}}[\|(I_{D}-AA^{\top})x\|_{2}]\leq\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}[\|(I_{D}-AA^{\top})x\|_{2}^{2}]}.

Score matching returns $V$ as an approximation of $A$ , then

	$\displaystyle\\|(I_{D}-AA^{\top})x\\|_{2}$	$\displaystyle\leq\\|(I_{D}-VV^{\top})x\\|_{2}+\\|(VV^{\top}-AA^{\top})x\\|_{2},$
	$\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|(I_{D}-AA^{\top})x\\|^{2}_{2}]$	$\displaystyle\leq 2\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|(I_{D}-VV^{\top})x\\|^{2}_{2}]+2\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|(VV^{\top}-AA^{\top})x\\|^{2}_{2}],$

where by (B.2) in Lemma B.4 we have

(I_{D}-VV^{\top})x\sim\sf N(0,\Lambda),\quad\Lambda\prec ct_{0}I

for some constant $c\geq 0$ . Thus

\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|(I_{D}-VV^{\top})x\|^{2}_{2}\right]=\operatorname{Tr}(\Lambda)\leq ct_{0}D.

(B.5)

On the other hand,

\|(VV^{\top}-AA^{\top})x\|^{2}_{2}\leq\|VV^{\top}-AA^{\top}\|^{2}_{op}\|x\|^{2}_{2}\leq\|VV^{\top}-AA^{\top}\|^{2}_{F}\|x\|^{2}_{2},

where $\|VV^{\top}-AA^{\top}\|^{2}_{F}$ has an upper bound as in (B.3) and $\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|x\|^{2}_{2}\right])$ is bounded in Lemma C.3 by

\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|x\|^{2}_{2}\right]=\mathcal{O}\left(ct_{0}D+M(a)\cdot(1+TV(\widehat{P}_{a})\right).

with $M(a)=O\left(\frac{a^{2}}{\|{\beta}^{*}\|_{\Sigma}}+d\right)$ .

Therefore, to combine things together, we have

	$\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|x^{\perp}\\|_{2}]$	$\displaystyle\leq\sqrt{2\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|(I_{D}-VV^{\top})x\\|^{2}_{2}]+2\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|(VV^{\top}-AA^{\top})x\\|^{2}_{2}]}$
		$\displaystyle\leq c^{\prime}\sqrt{t_{0}D}+2\sqrt{\angle({V},{A})}\cdot\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\\|x\\|^{2}_{2}\right]}$
		$\displaystyle=\mathcal{O}\left(\sqrt{t_{0}D}+\sqrt{\angle({V},{A})}\cdot\sqrt{M(a)}\right).$

$\mathcal{O}$ hides multiplicative constant and $\sqrt{\angle({V},{A})t_{0}D}$ , $\sqrt{\angle({V},{A})M(a)TV(\widehat{P}_{a})}$ , which are terms with higher power of $n_{1}^{-1}$ than the leading term.

B.3 Proof of Theorem 6.3 and 6.9

Proof of Theorem 6.3 and 6.9 are provided in this section. This section breaks down into three parts: Suboptimality Decomposition, bounding $\mathcal{E}_{1}$ Relating to Offline Bandits for different datasets , Bounding $\mathcal{E}_{2}$ and the Distribution Shift in Diffusion.

B.3.1 $\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a)$ Decomposition.

Recall notations $\widehat{P}_{a}:=\widehat{P}(\cdot|\widehat{y}=a)$ (generated distribution) and $P_{a}:=P(\cdot|\widehat{y}=a)$ (target distribution) and $f^{*}(x)=g^{*}(x_{\parallel})+h^{*}(x_{\perp})$ . $\mathbb{E}_{x\sim\widehat{P}_{a}}[f^{\star}(x)]$ can be decomposed into 3 terms:

	$\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}[f^{\star}(x)]\geq$	$\displaystyle\mathbb{E}_{x\sim P_{a}}[f^{}(x)]-\left\|\mathbb{E}_{x\sim\widehat{P}_{a}}[f^{}(x)]-\mathbb{E}_{x\sim P_{a}}[f^{*}(x)]\right\|$
	$\displaystyle\geq$	$\displaystyle\mathbb{E}_{x\sim P_{a}}[\widehat{f}(x)]-\mathbb{E}_{x\sim P_{a}}\left[\left\|\widehat{f}(x)-f^{}(x)\right\|\right]-\left\|\mathbb{E}_{x\sim\widehat{P}_{a}}[f^{}(x)]-\mathbb{E}_{x\sim P_{a}}[f^{*}(x)]\right\|$
	$\displaystyle\geq$	$\displaystyle\mathbb{E}_{x\sim P_{a}}[\widehat{f}(x)]-\underbrace{\mathbb{E}_{x\sim P_{a}}\left[\left\|\widehat{f}(x)-g^{}(x)\right\|\right]}_{\mathcal{E}_{1}}-\underbrace{\left\|\mathbb{E}_{x\sim P_{a}}[g^{}(x_{\parallel})]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{}(x_{\parallel})]\right\|}_{\mathcal{E}_{2}}-\underbrace{\mathbb{E}_{x\sim\widehat{P}_{a}}[h^{}(x_{\perp})]}_{\mathcal{E}_{3}},$

where $\mathbb{E}_{x\sim P_{a}}[\widehat{f}(x)]=\mathbb{E}_{a\sim q}[a]$ and we use $x=x_{\parallel}$ , $f^{*}(x)=g^{*}(x)$ when $x\sim P_{a}$ . Therefore

	$\displaystyle\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a)$	$\displaystyle=a-\mathbb{E}_{x\sim\widehat{P}_{a}}[f^{\star}(x)]$
		$\displaystyle\leq\underbrace{\mathbb{E}_{x\sim P_{a}}\left[\left\|(\widehat{\theta}-\theta^{})^{\top}x\right\|\right]}_{\mathcal{E}_{1}}+\underbrace{\left\|\mathbb{E}_{x\sim P_{a}}[g^{}(x_{\parallel})]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{}(x_{\parallel})]\right\|}_{\mathcal{E}_{2}}+\underbrace{\mathbb{E}_{x\sim\widehat{P}_{a}}[h^{}(x_{\perp})]}_{\mathcal{E}_{3}}.$

$\mathcal{E}_{1}$ comes from regression: prediction/generalization error onto $P_{a}$ , which is independent from any error of distribution estimation that occurs in diffusion. $\mathcal{E}_{2}$ and $\mathcal{E}_{3}$ do not measure regression-predicted $\widehat{f}$ , thus they are independent from the prediction error in $\widehat{f}$ for pseudo-labeling. $\mathcal{E}_{2}$ measures the disparity between $\widehat{P}_{a}$ and $P_{a}$ on the subspace support and $\mathcal{E}_{3}$ measures the off-subspace component in generated $\widehat{P}_{a}$ .

B.3.2 Bounding $\mathcal{E}_{1}$ under the Labeled Dataset Setup.

For all $x_{i}\in\mathcal{D}_{\rm label},y_{i}=f^{*}(x_{i})+\epsilon_{i}=g(x_{i})+\epsilon_{i}$ . Thus, trained on $\mathcal{D}_{\rm label}$ the prediction model $\widehat{f}$ is essentially approximating $g$ . By estimating $\theta^{*}$ with ridge regression on $\mathcal{D}_{\rm label}$ , we have $\widehat{f}(x)=\widehat{\theta}^{\top}x$ with

\widehat{\theta}=\left(X^{\top}X+\lambda I\right)^{-1}X^{\top}\left(X\theta^{*}+\eta\right),

(B.6)

where $X^{\top}=(x_{1},\cdots,x_{i},\cdots,x_{n_{2}})$ and $\eta=(\epsilon_{1},\cdots,\epsilon_{i},\cdots,\epsilon_{n_{2}})$ .

Lemma B.5.

Under Assumption 3.1 and 6.1 and given $\epsilon_{i}\sim\sf N(0,\sigma^{2})$ , define $V_{\lambda}:=X^{\top}X+\lambda I$ , $\widehat{\Sigma}_{\lambda}:=\frac{1}{n_{2}}V_{\lambda}$ and $\Sigma_{P_{a}}:=\mathbb{E}_{x\sim P_{a}}xx^{\top}$ the covariance matrix (uncentered) of $P_{a}$ , and take $\lambda=1$ , then with high probability

\mathcal{E}_{1}\leq\sqrt{\operatorname{Tr}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})}\cdot\frac{\mathcal{O}\left(\sqrt{d\log n_{2}}\right)}{\sqrt{n_{2}}}.

(B.7)

Proof of Lemma B.5 is in Appendix C.3.

Lemma B.6.

Under Assumption 3.1, 6.1 and 5.3, when $\lambda=1$ , $P_{a}$ has a shift from the empirical marginal of $x$ in dataset by

\operatorname{Tr}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})\leq\mathcal{O}\left(\frac{a^{2}}{\|\beta^{*}\|_{\Sigma}}+d\right).

(B.8)

when $n_{2}=\Omega(\max\{\frac{1}{\lambda_{\min}},\frac{d}{\|\beta^{*}\|^{2}_{\Sigma}}\})$ .

Proof of Lemma B.6 is in Appendix C.4.

B.3.3 Bounding $\mathcal{E}_{1}$ under the Human Preference Setup

Recall that in the human preference setup, we use the MLE to learn the underlying parameter $\widehat{\theta}$ . For all $x_{i}^{(j)}\in\mathcal{D}_{\rm label}$ , $j\in\{1,2\}$ , we have $f(x_{i}^{(j)})=g(x_{i}^{(j)})$ . Therefore, by Assumption 6.1, we have the following objective:

\displaystyle\widehat{\theta}=\operatorname{argmin}_{\theta\in\Theta}-\frac{1}{n}\sum_{i=1}^{n}\bigg{\{}u_{i}^{\top}\theta-\log\bigg{(}\exp(\theta^{\top}x_{i}^{(1)})+\exp(\theta^{\top}x_{i}^{(2)})\bigg{)}\bigg{\}},

(B.9)

We are now ready to present our results.

Lemma B.7.

With probability at least $1-\delta$ , we have

\displaystyle\mathcal{E}_{1}\leq O\bigg{(}\sqrt{\kappa^{2}\cdot\frac{d+\log 1/\delta}{n}+\lambda}\cdot\sqrt{\operatorname{Tr}(\mathbb{E}_{P_{a}}[xx^{\top}](\widetilde{\Sigma_{\lambda}})^{-1})}\bigg{)}

here $\kappa=\big{(}1+\exp(\sqrt{\log n/\delta})\big{)}^{2}$ .

Proof of Lemma B.7 is in Appendix C.5.

B.3.4 Bounding $\mathcal{E}_{2}$ and the Distribution Shift in Diffusion

Lemma B.8.

Under Assumption 3.1, 6.1 and 5.3, when $t_{0}=\left((Dd^{2}+D^{2}d)/n_{1}\right)^{1/6}$

\displaystyle\mathcal{E}_{2}=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{\lambda_{\min}}}\cdot\left(\frac{Dd^{2}+D^{2}d}{n_{1}}\right)^{\frac{1}{6}}\cdot a\right).

Proof of Lemma B.8 can be found in Appendix C.6.

Note that ${\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})$ depends on $a$ and measures the distribution shift between the desired distribution $P(x,\widehat{y}=a)$ and the data distribution $P_{x\widehat{y}}$ . To understand this distribution’s dependency on $a$ , it what follows we give ${\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})$ a shorthand as $\texttt{DistroShift}^{2}(a)$ and give it an upper bound in one special case of the problem.

Distribution Shift

In the special case of covariance $\Sigma$ of $z$ is known and $\|A-V\|_{2}^{2}=\mathcal{O}\left(\|AA^{\top}-VV^{\top}\|_{\rm F}^{2}\right)$ , we showcase a bound on the distribution shift in $\mathcal{E}_{2}$ , as promised in the discussion following Theorem 6.3. We have

\displaystyle\texttt{DistroShift}^{2}(a)=\frac{\mathbb{E}_{P_{x,\widehat{y}=a}}[\ell(x,y;\widehat{s})]}{\mathbb{E}_{P_{x\widehat{y}}}[\ell(x,y;\widehat{s})]},

where $\ell(x,y;\widehat{s})=\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\mathbb{E}_{x^{\prime}|x}\|\nabla_{x^{\prime}}\log\phi_{t}(x^{\prime}|x)-\widehat{s}(x^{\prime},y,t)\|_{2}^{2}\mathrm{d}t$ . By Proposition 4.1, it suffices to bound

\displaystyle\texttt{DistroShift}^{2}(a)=\frac{\mathbb{E}_{P_{x,\widehat{y}=a}}[\int_{t_{0}}^{T}\|\nabla\log p_{t}(x,y)-\widehat{s}(x,y,t)\|_{2}^{2}\mathrm{d}t]}{\mathbb{E}_{P_{x\widehat{y}}}[\int_{t_{0}}^{T}\|\nabla\log p_{t}(x,y)-\widehat{s}(x,y,t)\|_{2}^{2}\mathrm{d}t]}.

We expand the difference $\|\nabla\log p_{t}(x,y)-\widehat{s}(x,y,t)\|_{2}^{2}$ by

	$\displaystyle\\|\nabla\log p_{t}(x,y)-\widehat{s}(x,y,t)\\|_{2}^{2}$	$\displaystyle\leq\frac{2}{h^{2}(t)}\Big{[}\\|(A-V)B_{t}(A^{\top}x+\nu^{-2}y\theta)\\|_{2}^{2}+\\|VB_{t}(A-V)^{\top}x\\|_{2}^{2}\Big{]}$
		$\displaystyle\leq\frac{2}{h^{2}(t)}\Big{[}\\|A-V\\|_{2}^{2}\\|B_{t}(A^{\top}x+\nu^{-2}y\theta)\\|_{2}^{2}+\\|A-V\\|_{2}^{2}\\|x\\|_{2}^{2}\Big{]}$
		$\displaystyle\leq\frac{2}{h^{2}(t)}\\|A-V\\|_{2}^{2}(3\\|x\\|_{2}^{2}+y^{2}),$

where we recall $B_{t}$ is defined in (E) and in the last inequality, we use $(a+b)^{2}\leq 2a^{2}+2b^{2}$ . In the case of covariance matrix $\Sigma$ is known, i.e., $B_{t}$ is known, we also consider matrix $V$ directly matches $A$ without rotation. Then by Chen et al. [2023a, Lemma 3 and 17], we have $\|A-V\|_{2}^{2}=\mathcal{O}\left(\|AA^{\top}-VV^{\top}\|_{\rm F}^{2}\right)=\mathcal{O}\left(t_{0}/c_{0}\mathbb{E}_{P_{x\widehat{y}}}[\ell(x,y;\widehat{s})]\right)$ . To this end, we only need to find $\mathbb{E}_{P_{x|\widehat{y}=a}}[\|x\|_{2}^{2}]$ . Since we consider on-support $x$ , which can be represented as $x=Az$ , we have $\|x\|_{2}=\|z\|_{2}$ . Thus, we only need to find the conditional distribution of $z|\widehat{y}=a$ . Fortunately, we know $(z,\widehat{y})$ is jointly Gaussian, with mean $0$ and covariance

\displaystyle\begin{bmatrix}\Sigma&\Sigma\widehat{\beta}\\ \widehat{\beta}^{\top}\Sigma&\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\end{bmatrix}.

Consequently, the conditional distribution of $z|\widehat{y}=a$ is still Gaussian, with mean $\Sigma\widehat{\beta}a/(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2})$ and covariance $\Sigma-\Sigma\widehat{\beta}\widehat{\beta}^{\top}\Sigma/(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2})$ . Hence, we have

\displaystyle\mathbb{E}_{P_{z|\widehat{y}=a}}[\|z\|_{2}^{2}]=\frac{1}{(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2})^{2}}\left((a^{2}-\widehat{\beta}^{\top}\Sigma\widehat{\beta}-\nu^{2})\widehat{\beta}^{\top}\Sigma^{2}\widehat{\beta}\right)+{\rm Tr}(\Sigma)=\mathcal{O}\left(a^{2}\vee d\right).

We integrate over $t$ for the numerator in $\texttt{DistroShift}(a)$ to obtain $\mathbb{E}_{P_{x|\widehat{y}=a}}[\int_{t_{0}}^{T}\|\nabla\log p_{t}(x,y)-\widehat{s}(x,y,t)\|_{2}^{2}\mathrm{d}t=\mathcal{O}\left((a^{2}\vee d)\frac{1}{c_{0}}\mathbb{E}_{P_{x\widehat{y}}}[\ell(x,y;\widehat{s})]\right)$ . Note the cancellation between the numerator and denominator, we conclude

\displaystyle\texttt{DistroShift}(a)=\mathcal{O}\left(\frac{1}{c_{0}}(a\vee\sqrt{d})\right).

As $d$ is a natural upper bound of $\sqrt{d}$ and viewing $c_{0}$ as a constant, we have $\texttt{DistroShift}(a)=\mathcal{O}\left(a\vee d\right)$ as desired.

Appendix C Supporting Lemmas and Proofs

C.1 Supporting Lemmas

Lemma C.1.

The estimated subspace $V$ satisfies

\|VU-A\|_{F}=\mathcal{O}\left(d^{\frac{3}{2}}\sqrt{\angle({V},{A})}\right)

(C.1)

for some orthogonal matrix $U\in\mathbb{R}^{d\times d}$ .

Proof of Lemma C.1 is in Appendix C.7.

Lemma C.2.

Suppose $P_{1}$ and $P_{2}$ are two distributions over $\mathbb{R}^{d}$ and $m$ is a function defined on $\mathbb{R}^{d}$ , then $\left|\mathbb{E}_{x\sim P_{1}}[m(z)]-\mathbb{E}_{z\sim P_{2}}[m(z)]\right|$ can be bounded in terms of $\texttt{d}_{\rm TV}(P_{1},P_{2})$ , specifically when $P_{1}$ and $P_{2}$ are Gaussians and $m(z)=\|z\|^{2}_{2}$ :

\mathbb{E}_{x\sim P_{1}}[\|z\|^{2}_{2}]=\mathcal{O}\left(\mathbb{E}_{z\sim P_{2}}[\|z\|^{2}_{2}](1+\texttt{d}_{\rm TV}(P_{1},P_{2}))\right).

(C.2)

When $P_{1}$ and $P_{2}$ are Gaussians and $m(z)=\|z\|_{2}$ :

\left|\mathbb{E}_{z\sim P_{1}}[\|z\|_{2}]-\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}]\right|=\mathcal{O}\left(\left(\sqrt{\mathbb{E}_{z\sim P_{1}}[\|z\|_{2}^{2}]}+\sqrt{\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}^{2}]}\right)\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})\right).

(C.3)

Proof of Lemma C.2 is in Appendix C.8.

Lemma C.3.

We compute $\mathbb{E}_{z\sim P^{LD}(a)}[\left\|z\right\|_{2}^{2}],\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[\|z\|_{2}^{2}],\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x\|_{2}^{2}],\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[\|z\|_{2}^{2}]$ in this Lemma.

\mathbb{E}_{z\sim P^{LD}(a)}[\left\|z\right\|_{2}^{2}]=\frac{\widehat{\beta}^{\top}\Sigma^{2}\widehat{\beta}}{\left(\|\widehat{\beta}\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}+\operatorname{trace}(\Sigma-\Sigma\widehat{\beta}\left(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\right)^{-1}\widehat{\beta}^{\top}\Sigma).

(C.4)

Let $M(a):=\mathbb{E}_{z\sim P^{LD}(a)}[\left\|z\right\|^{2}_{2}]$ , which has an upper bound $M(a)=O\left(\frac{a^{2}}{\|{\beta}^{*}\|_{\Sigma}}+d\right)$ .

		$\displaystyle\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[\\|z\\|^{2}_{2}]\leq M(a)+t_{0}d.$		(C.5)
		$\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\\|x\\|^{2}_{2}\right]\leq\mathcal{O}\left(ct_{0}D+M(a)\cdot(1+TV(\widehat{P}_{a})\right).$		(C.6)

Proof of Lemma C.3 is in Appendix C.9.

C.2 Proof of Lemma B.4

The first two assertions (B.2) and (B.3) are consequences of Chen et al. [2023a, Theorem 3, item 1 and 3]. To show (B.4), we first have the conditional score matching error under distribution shift being

\displaystyle{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})\cdot\epsilon_{diff}^{2},

where ${\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})$ accounts for the distribution shift as in the parametric case (Lemma B.8). Then we apply Chen et al. [2023a, Theorem 3, item 2] to conclude

\displaystyle TV(\widehat{P}_{a})=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{c_{0}}}\cdot\epsilon_{diff}\right).

The proof is complete.

C.3 Proof of Lemma B.5

Given

\displaystyle\mathcal{E}_{1}

\displaystyle=\mathbb{E}_{\widehat{P}_{a}}\left|x^{\top}(\theta^{*}-\widehat{\theta})\right|\leq\mathbb{E}_{\widehat{P}_{a}}\|x\|_{V_{\lambda}^{-1}}\cdot\|\theta^{*}-\widehat{\theta}\|_{V_{\lambda}},

then things to prove are

	$\displaystyle\mathbb{E}_{\widehat{P}_{a}}\\|x\\|_{V_{\lambda}^{-1}}$	$\displaystyle=\sqrt{\operatorname{trace}(V_{\lambda}^{-1}\Sigma_{\widehat{P}_{a}})};$		(C.7)
	$\displaystyle\\|\theta^{*}-\widehat{\theta}\\|_{V_{\lambda}}$	$\displaystyle\leq\mathcal{O}\left(\sqrt{d\log n_{2}}\right),$		(C.8)

where the second inequality is to be proven with high probability w.r.t the randomness in $\mathcal{D}_{label}$ . For (C.7), $\mathbb{E}_{\widehat{P}_{a}}\|x\|_{V_{\lambda}^{-1}}\leq\sqrt{\mathbb{E}_{\widehat{P}_{a}}x^{\top}V_{\lambda}^{-1}x}=\sqrt{\mathbb{E}_{\widehat{P}_{a}}\operatorname{trace}(V_{\lambda}^{-1}xx^{\top})}=\sqrt{\operatorname{trace}(V_{\lambda}^{-1}\mathbb{E}_{\widehat{P}_{a}}xx^{\top})}.$

For (C.8), what’s new to prove compared to a classic bandit derivation is its $d$ dependency instead of $D$ , due to the linear subspace structure in $x$ . From the closed form solution of $\widehat{\theta}$ , we have

\widehat{\theta}-\theta^{*}=V_{\lambda}^{-1}X^{\top}\eta-\lambda V_{\lambda}^{-1}\theta^{*}.

(C.9)

Therefore,

\|\theta^{*}-\widehat{\theta}\|_{V_{\lambda}}\leq\|X^{\top}\eta\|_{V_{\lambda}^{-1}}+\lambda\|\theta^{*}\|_{V_{\lambda}^{-1}},

(C.10)

where $\lambda\|\theta^{*}\|_{V_{\lambda}^{-1}}\leq\sqrt{\lambda}\|\theta^{*}\|_{2}\leq\sqrt{\lambda}$ and

	$\displaystyle\\|X^{\top}\eta\\|_{V_{\lambda}^{-1}}^{2}$	$\displaystyle=\eta^{\top}X\left(X^{\top}X+\lambda I_{D}\right)^{-1}X^{\top}\eta$
		$\displaystyle=\eta^{\top}XX^{\top}\left(XX^{\top}+\lambda I_{n_{2}}\right)^{-1}\eta.$

Let $Z^{\top}=(z_{1},\cdots,z_{i},\cdots,z_{n_{2}})$ s.t. $Az_{i}=x_{i}$ , then it holds that $X=ZA^{\top}$ , and $XX^{\top}=ZA^{\top}AZ^{\top}=ZZ^{\top}$ , thus

	$\displaystyle\\|X^{\top}\eta\\|_{V_{\lambda}^{-1}}^{2}$	$\displaystyle=\eta^{\top}XX^{\top}\left(XX^{\top}+\lambda I_{n_{2}}\right)^{-1}\eta$
		$\displaystyle=\eta^{\top}ZZ^{\top}\left(ZZ^{\top}+\lambda I_{n_{2}}\right)^{-1}\eta$
		$\displaystyle=\eta^{\top}Z\left(Z^{\top}Z+\lambda I_{d}\right)^{-1}Z^{\top}\eta$
		$\displaystyle=\\|Z^{\top}\eta\\|_{\left(Z^{\top}Z+\lambda I_{d}\right)^{-1}}.$

With probability $1-\delta$ , $\|z_{i}\|^{2}\leq d+\sqrt{d\log\left(\frac{2n_{2}}{\delta}\right)}:=L^{2},\forall i\in[n_{2}]$ . Then applying Abbasi-Yadkori et al. [2011, Theorem 1] gives rise to

\|Z^{\top}\eta\|_{\left(Z^{\top}Z+\lambda I_{d}\right)^{-1}}\leq\sqrt{2\log(2/\delta)+d\log(1+n_{2}L^{2}/(\lambda d))}

with probability $1-\delta/2$ . Combine things together and plugging in $\lambda=1$ , $L^{2}=d+\sqrt{d\log\left(\frac{2n_{2}}{\delta}\right)}$ , we have with high probability

\|\theta^{*}-\widehat{\theta}\|_{V_{\lambda}}=\mathcal{O}\left(\sqrt{d\log\left(n_{2}\sqrt{\log(n_{2})}\right)}\right)=\mathcal{O}\left(\sqrt{d\log n_{2}+\frac{1}{2}d\log(\log n_{2})}\right)=\mathcal{O}\left(\sqrt{d\log n_{2}}\right).

C.4 Proof of Lemma B.6

Recall the definition of $\widehat{\Sigma}_{\lambda}$ and $\Sigma_{P_{a}}$ that

	$\displaystyle\widehat{\Sigma}_{\lambda}$	$\displaystyle=\frac{1}{n_{2}}X^{\top}X+\frac{\lambda}{n_{2}}I_{D},$
	$\displaystyle\Sigma_{P_{a}}$	$\displaystyle=\mathbb{E}_{x\sim P_{a}}\left[xx^{\top}\right],$

where $X$ are stack matrix of data supported on $\mathcal{A}$ and $P_{a}$ is also supported on $\mathcal{A}$ , $\mathcal{A}$ is the subspace encoded by matrix $A$ . The following lemma shows it is equivalent to measure $\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})$ on $\mathcal{A}$ subspace.

Lemma C.4.

For any P.S.D. matrices $\Sigma_{1},\Sigma_{2}\in\mathbb{R}^{d\times d}$ and $A\in\mathbb{R}^{D\times d}$ such that $A^{\top}A=I_{d}$ , we have

\displaystyle\mathrm{Tr}\left((\lambda I_{D}+A\Sigma_{1}A^{\top})^{-1}A\Sigma_{2}A^{\top}\right)=\mathrm{Tr}\left(\left(\lambda I_{d}+\Sigma_{1}\right)^{-1}\Sigma_{2}\right).

The lemma above allows us to abuse notations $\widehat{\Sigma}_{\lambda}$ and $\Sigma_{P_{a}}$ in the following way while keeping the same $\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})$ value:

	$\displaystyle\widehat{\Sigma}_{\lambda}$	$\displaystyle=\frac{1}{n_{2}}Z^{\top}Z+\frac{\lambda}{n_{2}}I_{d},$
	$\displaystyle\Sigma_{P_{a}}$	$\displaystyle=\mathbb{E}_{z\sim P^{LD}(a)}\left[zz^{\top}\right],$

where $Z^{\top}=(z_{1},\cdots,z_{i},\cdots,z_{n_{2}})$ s.t. $Az_{i}=x_{i}$ and recall notataion $P^{LD}(a)=P_{z}\left(z\mid\widehat{f}(Az)\right)$ .

Given $z\sim{\sf N}(\mu,\Sigma)$ , as a proof artifact, let $\widehat{f}(x)=\widehat{\theta}^{\top}x+\xi,\xi\sim{\sf N}(0,\nu^{2})$ where we will let $\nu\to 0$ in the end, then let $\widehat{\beta}=A^{\top}\widehat{\theta}\in\mathbb{R}^{d}$ , $(z,\widehat{f}(Az))$ has a joint distribution

(z,\widehat{f})\sim{\sf N}\left(\begin{bmatrix}\mu\\ \widehat{\beta}^{\top}\mu\end{bmatrix},\begin{bmatrix}\Sigma&\Sigma\widehat{\beta}\\ \widehat{\beta}^{\top}\Sigma&\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\end{bmatrix}\right).

(C.11)

Then we have the conditional distribution $z\mid\widehat{f}(Az)=a$ following

P_{z}\left(z\mid\widehat{f}(Az)=a\right)={\sf N}\left(\mu+\Sigma\widehat{\beta}\left(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\right)^{-1}(a-\widehat{\beta}^{\top}\mu),\Gamma\right)

(C.12)

with $\Gamma:=\Sigma-\Sigma\widehat{\beta}\left(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\right)^{-1}\widehat{\beta}^{\top}\Sigma$ .

When $\mu=0$ , we compute $\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})$ as

	$\displaystyle\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})$	$\displaystyle=\operatorname{trace}\left(\widehat{\Sigma}_{\lambda}^{-1}\frac{\Sigma\widehat{\beta}\widehat{\beta}^{\top}\Sigma}{\left(\\|\widehat{\beta}\\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}\right)+\operatorname{trace}\left(\widehat{\Sigma}_{\lambda}^{-1}\Gamma\right)$
		$\displaystyle=\operatorname{trace}\left(\frac{\widehat{\beta}^{\top}\Sigma\widehat{\Sigma}_{\lambda}^{-1}\Sigma\widehat{\beta}}{\left(\\|\widehat{\beta}\\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}\right)+\operatorname{trace}\left(\widehat{\Sigma}_{\lambda}^{-1}\Sigma\right)-\operatorname{trace}\left(\widehat{\Sigma}_{\lambda}^{-1}\frac{\Sigma\widehat{\beta}\widehat{\beta}^{\top}\Sigma}{\\|\widehat{\beta}\\|_{\Sigma}^{2}+\nu^{2}}\right)$
		$\displaystyle=\operatorname{trace}\left(\frac{\Sigma^{1/2}\widehat{\beta}\widehat{\beta}^{\top}\Sigma^{1/2}\Sigma^{1/2}\widehat{\Sigma}_{\lambda}^{-1}\Sigma^{1/2}}{\left(\\|\widehat{\beta}\\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}\right)$
		$\displaystyle\leq\frac{\\|\Sigma\widehat{\Sigma}_{\lambda}^{-1}\Sigma\\|_{op}\cdot\\|\widehat{\beta}\\|_{\Sigma}^{2}}{\left(\\|\widehat{\beta}\\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}\cdot a^{2}+\operatorname{trace}\left(\Sigma^{\frac{1}{2}}\widehat{\Sigma}_{\lambda}^{-1}\Sigma^{\frac{1}{2}}\right)$

By Lemma 3 in Chen et al. [2020], it holds that

\|\Sigma^{\frac{1}{2}}\widehat{\Sigma}_{\lambda}^{-1}\Sigma^{\frac{1}{2}}-I_{d}\|_{2}\leq O\left(\frac{1}{\sqrt{\lambda_{\min}n_{2}}}\right).

(C.13)

Therefore,

\displaystyle\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})

\displaystyle\leq\frac{1+\frac{1}{\sqrt{\lambda_{\min}n_{2}}}}{\|\widehat{\beta}\|_{\Sigma}^{2}}\cdot a^{2}+O\left(d\left(1+\frac{1}{\sqrt{\lambda_{\min}n_{2}}}\right)\right).

Then, what left is to bound $\|\widehat{\beta}\|_{\Sigma}=\|\widehat{\theta}\|_{A\Sigma A^{\top}}\geq\|\theta^{*}\|_{A\Sigma A^{\top}}-\|\widehat{\theta}-\theta^{*}\|_{A\Sigma A^{\top}}$ by triangle inequality. On one hand,

\|\theta^{*}\|_{A\Sigma A^{\top}}=\|\beta^{*}\|_{\Sigma}.

(C.14)

On the other hand,

	$\displaystyle\\|\widehat{\theta}-\theta^{*}\\|_{A\Sigma A^{\top}}$	$\displaystyle=\mathcal{O}\left(\\|\widehat{\theta}-\theta^{*}\\|_{\widehat{\Sigma}_{\lambda}}\right)$
		$\displaystyle=\mathcal{O}\left(\frac{\\|\widehat{\theta}-\theta^{*}\\|_{V_{\lambda}}}{\sqrt{n_{2}}}\right)$
		$\displaystyle=\mathcal{O}\left(\sqrt{\frac{d\log(n_{2})}{n_{2}}}\right).$

with high probability. Thus when $n_{2}=\Omega(\frac{d}{\|\beta^{*}\|^{2}_{\Sigma}})$

\|\widehat{\beta}\|_{\Sigma}\geq\frac{1}{2}\|\beta^{*}\|_{\Sigma}.

Therefore

\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})\leq\mathcal{O}\left(\frac{1+\frac{1}{\sqrt{\lambda_{\min}n_{2}}}}{\|\beta^{*}\|_{\Sigma}}\cdot a^{2}+d\left(1+\frac{1}{\sqrt{\lambda_{\min}n_{2}}}\right)\right)\\ =\mathcal{O}\left(\frac{a^{2}}{\|\beta^{*}\|_{\Sigma}}+d\right).

(C.15)

when $n_{2}=\Omega(\max\{\frac{1}{\lambda_{\min}},\frac{d}{\|\beta^{*}\|^{2}_{\Sigma}}\})$ .

C.5 Proof of Lemma B.7

Denote the loss function in Equation (B.9) by $l(\theta)$ . By simple algebra we have

	$\displaystyle\nabla l(\theta)=\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\bigg{\{}\frac{\exp(\theta^{\top}x_{i}^{(1)})\cdot(x_{i}^{(1)}-u_{i})+\exp(\theta^{\top}x_{i}^{(2)})\cdot(x_{i}^{(2)}-h_{i})}{\exp(\theta^{\top}x_{i}^{(1)})+\exp(\theta^{\top}x_{i}^{(2)})}\bigg{\}},$
	$\displaystyle\nabla^{2}l(\theta)=\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\bigg{\{}\frac{\exp((x_{i}^{(1)}+x_{i}^{(2)})^{\top}\theta)}{(\exp(\theta^{\top}x_{i}^{(1)})+\exp(\theta^{\top}x_{i}^{(2)}))^{2}}\cdot(x_{i}^{(1)}-x_{i}^{(2)})(x_{i}^{(1)}-x_{i}^{(2)})^{\top}\bigg{\}},$

holds for all $\theta\in\mathbb{R}^{d}$ . By Assumption 5.3 and Hoeffding’s inequality, we have

\mathbb{P}\big{(}\max_{i\in[n_{2}]}\big{\{}|\theta^{*\top}x_{i}^{(1)}|,|\theta^{*\top}x_{i}^{(2)}|\big{\}}\leq\log(2n_{2}/\delta)\big{)}\geq 1-\delta.

In the following proof, we will condition on this event. We have

	$\displaystyle\nabla^{2}l(\theta^{*})$	$\displaystyle=\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\bigg{(}\frac{1}{1+\exp((x_{i}^{(1)}-x_{i}^{(2)})^{\top}\theta^{})}\cdot\frac{1}{1+\exp((x_{i}^{(2)}-x_{i}^{(1)})^{\top}\theta^{})}\bigg{)}\cdot(x_{i}^{(1)}-x_{i}^{(2)})(x_{i}^{(1)}-x_{i}^{(2)})^{\top}$
		$\displaystyle\succeq\frac{1}{n_{2}}\cdot\frac{1}{(1+\exp(\sqrt{\log 2n_{2}/\delta}))^{2}}\sum_{i=1}^{n}(x_{i}^{(1)}-x_{i}^{(2)})(x_{i}^{(1)}-x_{i}^{(2)})^{\top}.$

In the following we adapt the notation of $\Sigma_{x}=\frac{1}{n}\sum_{i=1}^{n}(x_{i}^{(1)}-x_{i}^{(2)})(x_{i}^{(1)}-x_{i}^{(2)})^{\top}$ , then $\nabla^{2}l(\theta^{*})\succeq 1/\kappa\cdot\Sigma_{x}$ . To bound $\mathbb{E}_{x\sim P_{a}}(|\widehat{f}(x)-f^{*}(x)|)$ , we have the following inequality:

	$\displaystyle\mathbb{E}_{x\sim P_{a}}[\|\widehat{f}(x)-f^{*}(x)\|]$	$\displaystyle=\mathbb{E}_{x\sim P_{a}}[\|x^{\top}(\widehat{\theta}-\theta^{*})\|]$
		$\displaystyle=\mathbb{E}_{z\sim(A^{\top})_{\sharp}P_{a}}[\|z^{\top}(A^{\top}\widehat{\theta}-\beta)\|]$
		$\displaystyle\leq\underbrace{\mathbb{E}_{z\sim(A^{\top})_{\sharp}P_{a}}[\\|z\\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}]}_{\operatorname{(i)}}\cdot\underbrace{\\|\Delta_{z}\\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}}_{\operatorname{(ii)}}$

Here $\Delta_{z}=A^{\top}\widehat{\theta}-\beta$ . We also denote $\Delta_{x}=\widehat{\theta}-\theta^{*}$ . To bound (i) note that by Jensen’s inequality, we have

\mathbb{E}_{z\sim(A^{\top})_{\sharp}P_{a}}[\|z\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}]\leq\sqrt{\mathbb{E}_{z\sim(A^{\top})_{\sharp}P_{a}}\big{[}\|z\|^{2}_{(\Sigma_{z}+\lambda I_{d})^{-1}}\big{]}}=\sqrt{\operatorname{Tr}\big{(}\mathbb{E}_{z\sim(A^{\top})_{\sharp}P_{a}}[zz^{\top}]\big{)}(\Sigma_{z}+\lambda I_{d})^{-1}}.

Next we bound (ii). It is easy to prove that

l(\widehat{\theta})-l(\theta^{*})-\nabla l(\theta^{*})^{\top}(\widehat{\theta}-\theta^{*})\geq\frac{1}{2\kappa}\|\Delta_{x}\|_{\Sigma_{x}}^{2}.

By definition of MLE, we have $l(\widehat{\theta})-l(\theta)\leq 0$ . Therefore

	$\displaystyle\frac{1}{2\kappa}\\|\Delta_{x}\\|_{\Sigma_{x}}^{2}$	$\displaystyle\leq\|\nabla l(\theta^{})^{\top}(\widehat{\theta}-\theta^{})\|$
		$\displaystyle\leq\|\nabla l(\theta^{})^{\top}(AA^{\top}\widehat{\theta}-\theta^{})\|+\|\nabla l(\theta^{})^{\top}(I-AA^{\top})(\widehat{\theta}-\theta^{})\|.$		(C.16)

Recall that

\nabla l(\theta^{*})=\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\frac{\exp(\theta^{\top}x_{i}^{(1)})\cdot(x_{i}^{(1)}-u_{i})+\exp(\theta^{\top}x_{i}^{(2)})\cdot(x_{i}^{(2)}-u_{i})}{\exp(\theta^{\top}x_{i}^{(1)})+\exp(\theta^{\top}x_{i}^{(2)})},

since $x_{i}^{(1)},x_{i}^{(2)}$ are in the image set of $A$ , there exists $\alpha\in\mathbb{R}^{d}$ such that $\nabla l(\theta^{*})=A\alpha$ , and we have

\nabla l(\theta^{*})^{\top}(I-AA^{\top})=\alpha^{\top}A^{\top}(I-AA^{\top})=0.

Therefore we have

$\displaystyle\frac{1}{2\kappa}\\|\Delta_{x}\\|_{\Sigma_{x}}^{2}$	$\displaystyle\leq\|\nabla l(\theta^{})^{\top}(AA^{\top}\widehat{\theta}-\theta^{})\|$
	$\displaystyle=\|(A^{\top}\nabla l(\theta^{*}))^{\top}\Delta_{z}\|$
	$\displaystyle\leq\\|A^{\top}\nabla l(\theta^{*})\\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}\cdot\\|\Delta_{z}\\|_{\Sigma_{z}+\lambda I_{d}}$	(C.17)

By $AA^{\top}(x_{i}^{(1)}-x_{i}^{(2)})=(x_{i}^{(1)}-x_{i}^{(2)})$ , we have

\displaystyle\|\Delta_{x}\|_{\Sigma_{x}}^{2}=\Delta_{x}^{\top}A\Sigma_{z}A^{\top}\Delta_{x}=\Delta_{z}^{\top}\Sigma_{z}\Delta_{z}=\|\Delta_{z}\|_{\Sigma_{z}}^{2},

plug this in (C.5), we have

	$\displaystyle\frac{1}{2\kappa}\\|\Delta_{z}\\|_{\Sigma_{z}+\lambda I_{d}}^{2}$	$\displaystyle\leq\frac{1}{2\kappa}(\\|\Delta_{z}\\|_{\Sigma_{z}}^{2}+\lambda\\|\Delta_{z}\\|_{2}^{2})$
		$\displaystyle\leq\\|A^{\top}\nabla l(\theta^{*})\\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}\cdot\\|\Delta_{z}\\|_{\Sigma_{z}+\lambda I_{d}}+\frac{\lambda}{2\kappa}\\|\Delta_{z}\\|_{2}^{2},$

By $\|\widehat{\theta}\|_{2}\leq 1$ , we have $\|\Delta_{z}\|_{2}=\|A^{\top}\widehat{\theta}-\beta\|_{2}=\|\widehat{\theta}-\theta^{*}\|_{2}\leq 2$ , and therefore

\displaystyle\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}}^{2}\leq 2\kappa\cdot\|A^{\top}\nabla l(\theta^{*})\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}\cdot\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}}+{2\lambda}.

(C.18)

To derive an upper bound for $\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}}$ , we need to upper bound $\|A^{\top}\nabla l(\theta^{*})\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}$ . By Bernstein’s inequality for sub-Gaussian random variables in quadratic form (e.g. see Theorem 2.1 in Hsu et al. [2011]), we have

\|A^{\top}\nabla l(\theta^{*})\|_{(\Sigma_{z}+\lambda I_{d})^{-1}}\leq c\cdot\sqrt{\frac{d+\log 1/\delta}{n}}

for some constant $c$ with probability at least $1-\delta$ . Then we have

\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}}^{2}\leq 2c\cdot\kappa\cdot\sqrt{\frac{d+\log 1/\delta}{n}}\cdot\|\Delta_{z}\|_{\Sigma_{z}+\lambda I_{d}}+{2\lambda},

solving which gives

\operatorname{(ii)}\leq O\bigg{(}\sqrt{\kappa^{2}\cdot\frac{d+\log 1/\delta}{n}+\lambda}\bigg{)}.

Combining (i) and (ii), we prove that

\mathcal{E}_{1}=\mathbb{E}_{x\sim P_{a}}[|\widehat{f}(x)-f^{*}(x)|]\leq O\bigg{(}\sqrt{\kappa^{2}\cdot\frac{d+\log 1/\delta}{n}+\lambda}\cdot\sqrt{\operatorname{Tr}(\mathbb{E}_{P_{a}}[zz^{\top}](\Sigma_{z}+\lambda I_{d})^{-1})}\bigg{)}.

By Lemma C.4, we have

\displaystyle\mathcal{E}_{1}\leq O\bigg{(}\sqrt{\kappa^{2}\cdot\frac{d+\log 1/\delta}{n}+\lambda}\cdot\sqrt{\operatorname{Tr}(\mathbb{E}_{P_{a}}[xx^{\top}](\Sigma_{x}+\lambda I_{D})^{-1})}\bigg{)},

recall that $\widetilde{\Sigma}_{\lambda}:=\frac{1}{n_{2}}(\sum_{i=1}^{n_{2}}(x_{i}^{(1)}-x_{i}^{(2)})(x_{i}^{(1)}-x_{i}^{(2)})^{\top}+\lambda I)$ , reset $\lambda=\lambda/n_{2}$ and we conclude the proof.

C.6 Proof of lemma B.8

Recall the definition of $g(x)$ that

g(x)={\theta^{*}}^{\top}AA^{\top}x,

note that $g(x)={\theta^{*}}^{\top}x$ when $x$ is supported on $\mathcal{A}$ . Thus,

		$\displaystyle\left\|\mathbb{E}_{x\sim P_{a}}[g(x)]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g(x)]\right\|$
	$\displaystyle=$	$\displaystyle\left\|\mathbb{E}_{x\sim P_{a}}[{\theta^{}}^{\top}x]-\mathbb{E}_{x\sim\widehat{P}_{a}}[{\theta^{}}^{\top}AA^{\top}x]\right\|$
	$\displaystyle\leq$	$\displaystyle\left\|\mathbb{E}_{x\sim P_{a}}[{\theta^{}}^{\top}x]-\mathbb{E}_{x\sim\widehat{P}_{a}}[{\theta^{}}^{\top}VV^{\top}x]\right\|+\underbrace{\left\|\mathbb{E}_{x\sim\widehat{P}_{a}}[{\theta^{}}^{\top}VV^{\top}x]-\mathbb{E}_{x\sim\widehat{P}_{a}}[{\theta^{}}^{\top}AA^{\top}x]\right\|}_{e_{1}},$

where

	$\displaystyle e_{1}$	$\displaystyle=\left\|\mathbb{E}_{x\sim\widehat{P}_{a}}[{\theta^{*}}^{\top}\left(VV^{\top}-AA^{\top}\right)x]\right\|$
		$\displaystyle\leq\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\left(\left\\|(VV^{\top}-AA^{\top}\right)x\right\\|\right]$
		$\displaystyle\leq\\|VV^{\top}-AA^{\top}\\|_{F}\cdot\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\left\\|x\right\\|^{2}_{2}\right]}.$

Use notation $P^{LD}(a)=P(z\mid\widehat{f}(Az)=a)$ , $P^{LD}_{t_{0}}(a)=P_{t_{0}}(z\mid\widehat{f}(Az)=a)$

		$\displaystyle\left\|\mathbb{E}_{x\sim P_{a}}[{\theta^{}}^{\top}x]-\mathbb{E}_{x\sim\widehat{P}_{a}}[{\theta^{}}^{\top}VV^{\top}x]\right\|$
	$\displaystyle=$	$\displaystyle\left\|\mathbb{E}_{z\sim P(z\mid\widehat{f}(Az)=a)}[{\theta^{}}^{\top}Az]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[{\theta^{}}^{\top}VUz]\right\|$
	$\displaystyle\leq$	$\displaystyle\left\|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{}}^{\top}Az]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[{\theta^{}}^{\top}VUz]\right\|+\underbrace{\left\|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{}}^{\top}Az]-\mathbb{E}_{z\sim P^{LD}(a)}[{\theta^{}}^{\top}Az]\right\|}_{e_{2}},$

here

	$\displaystyle e_{2}$	$\displaystyle=\left\|\alpha(t_{0})\mathbb{E}_{z\sim P^{LD}(a)}[{\theta^{}}^{\top}Az]+h(t_{0})\mathbb{E}_{u\sim\sf N(0,I_{d})}[{\theta^{}}^{\top}Au]-\mathbb{E}_{z\sim P^{LD}(a)}[{\theta^{*}}^{\top}Az]\right\|$
		$\displaystyle\leq(1-\alpha(t_{0}))\left\|\mathbb{E}_{z\sim P^{LD}(a)}[{\theta^{*}}^{\top}Az]\right\|$
		$\displaystyle\leq t_{0}\cdot\mathbb{E}_{z\sim P^{LD}(a)}[\left\\|z\right\\|_{2}].$

Then what is left to bound is

		$\displaystyle\left\|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{}}^{\top}Az]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[{\theta^{}}^{\top}VUz]\right\|$
	$\displaystyle\leq$	$\displaystyle\underbrace{\left\|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{}}^{\top}VUz]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[{\theta^{}}^{\top}VUz]\right\|}_{e_{3}}+\underbrace{\left\|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{}}^{\top}VUz]-\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{}}^{\top}Az]\right\|}_{e_{4}}.$

Then for term $e_{3}$ , by Lemma C.2, we get

	$\displaystyle e_{3}$	$\displaystyle\leq\left\|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[\\|z\\|_{2}]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[\\|z\\|_{2}]\right\|$
		$\displaystyle=\mathcal{O}\left(TV(\widehat{P}_{a})\cdot\left(\sqrt{\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[\\|z\\|^{2}_{2}]}+\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|x\\|^{2}_{2}]}\right)\right),$

where we use $\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[\|z\|_{2}]\leq\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x\|_{2}]$

For $e_{4}$ , we have

	$\displaystyle e_{4}$	$\displaystyle=\left\|\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}[{\theta^{*}}^{\top}(VU-A)z]\right\|$
		$\displaystyle=\alpha(t_{0})\left\|\mathbb{E}_{z\sim P(a)}[{\theta^{*}}^{\top}(VU-A)z]\right\|$
		$\displaystyle\leq\\|VU-A\\|_{F}\cdot\mathbb{E}_{z\sim P^{LD}(a)}[\left\\|z\right\\|_{2}].$

Therefore, by combining things together, we have

	$\displaystyle\mathcal{E}_{2}\leq$	$\displaystyle e_{1}+e_{2}+e_{3}+e_{4}$
	$\displaystyle\leq$	$\displaystyle\\|VV^{\top}-AA^{\top}\\|_{F}\cdot\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|x\\|^{2}_{2}]}+(\\|VU-A\\|_{F}+t_{0})\cdot\sqrt{M(a)}$
		$\displaystyle+\mathcal{O}\left(TV(\widehat{P}_{a})\cdot\left(\sqrt{M(a)+t_{0}d}+\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|x\\|^{2}_{2}]}\right)\right).$

By Lemma B.4 and Lemma C.1, we have

	$\displaystyle TV(\widehat{P}_{a})=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{\lambda_{\min}}}\cdot\epsilon_{diff}\right),$
	$\displaystyle\\|VV^{\top}-AA^{\top}\\|_{\rm F}=\widetilde{\mathcal{O}}\left(\frac{\sqrt{t_{0}}}{\sqrt{\lambda_{\min}}}\cdot\epsilon_{diff}\right),$
	$\displaystyle\\|VU-A\\|_{\rm F}=\mathcal{O}(d^{\frac{3}{2}}\\|VV^{\top}-AA^{\top}\\|_{\rm F}).$

And by Lemma C.3

\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|x\|^{2}_{2}\right]=\mathcal{O}\left(ct_{0}D+M(a)\cdot(1+TV(\widehat{P}_{a})\right).

Therefore. leading term in $\mathcal{E}_{2}$ is

\mathcal{E}_{2}=\mathcal{O}\left((TV(\widehat{P}_{a})+t_{0})\sqrt{M(a)}\right).

By plugging in score matching error $\epsilon^{2}_{diff}=\widetilde{\mathcal{O}}\left(\frac{1}{t_{0}}\sqrt{\frac{Dd^{2}+D^{2}d}{n_{1}}}\right)$ , we have

TV(\widehat{P}_{a})=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{\lambda_{\min}}}\cdot\left(\frac{Dd^{2}+D^{2}d}{n_{1}}\right)^{\frac{1}{4}}\cdot\frac{1}{\sqrt{t_{0}}}\right).

When $t_{0}=\left((Dd^{2}+D^{2}d)/n_{1}\right)^{1/6}$ , it admits the best trade off in $\mathcal{E}_{2}$ and $\mathcal{E}_{2}$ is bounded by

\mathcal{E}_{2}=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{\lambda_{\min}}}\cdot\left(\frac{Dd^{2}+D^{2}d}{n_{1}}\right)^{\frac{1}{6}}\cdot a\right).

C.7 Proof of Lemma C.1

From Lemma 17 in Chen et al. [2023a], we have

\|U-V^{\top}A\|_{F}=\mathcal{O}(\|VV^{\top}-AA^{\top}\|_{F}).

Then it suffices to bound

\left|\|VU-A\|^{2}_{F}-\|U-V^{\top}A\|^{2}_{F}\right|,

where

	$\displaystyle\\|VU-A\\|^{2}_{F}=2d-\operatorname{trace}\left(U^{\top}V^{\top}A+A^{\top}VU\right)$
	$\displaystyle\\|U-V^{\top}A\\|^{2}_{F}=d+\operatorname{trace}\left(A^{\top}VV^{\top}A\right)-\operatorname{trace}\left(U^{\top}V^{\top}A+A^{\top}VU\right).$

Thus

\left|\|VU-A\|^{2}_{F}-\|U-V^{\top}A\|^{2}_{F}\right|=\left|d-\operatorname{trace}\left(A^{\top}VV^{\top}A\right)\right|=\left|\operatorname{trace}\left(AA^{\top}(VV^{\top}-AA^{\top})\right)\right|,

which is because $\operatorname{trace}\left(A^{\top}VV^{\top}A\right)$ is calcualted as

	$\displaystyle\operatorname{trace}\left(A^{\top}VV^{\top}A\right)$	$\displaystyle=\operatorname{trace}\left(AA^{\top}VV^{\top}\right)$
		$\displaystyle=\operatorname{trace}\left(AA^{\top}AA^{\top}\right)+\operatorname{trace}\left(AA^{\top}(VV^{\top}-AA^{\top})\right)$
		$\displaystyle=d+\operatorname{trace}\left(AA^{\top}(VV^{\top}-AA^{\top})\right).$

Then we will bound $\left|\operatorname{trace}\left(AA^{\top}(VV^{\top}-AA^{\top})\right)\right|$ by $\|VV^{\top}-AA^{\top}\|_{F}$ ,

	$\displaystyle\left\|\operatorname{trace}\left(AA^{\top}(VV^{\top}-AA^{\top})\right)\right\|$	$\displaystyle\leq\operatorname{trace}\left(AA^{\top}\right)\operatorname{trace}\left(\left\|VV^{\top}-AA^{\top}\right\|\right)$
		$\displaystyle\leq d\cdot\operatorname{trace}\left(\left\|VV^{\top}-AA^{\top}\right\|\right)$
		$\displaystyle\leq d\cdot\sqrt{2d\left\\|VV^{\top}-AA^{\top}\right\\|^{2}_{F}}.$

Thus, $\|VU-A\|_{F}=\mathcal{O}\left(d^{\frac{3}{2}}\sqrt{\angle({V},{A})}\right).$

C.8 Proof of Lemma C.2

When $P_{1}$ and $P_{2}$ are Gaussian, $m(z)=\|z\|^{2}_{2}$ ,

	$\displaystyle\quad\left\|\mathbb{E}_{z\sim P_{1}}[m(z)]-\mathbb{E}_{z\sim P_{2}}[m(z)]\right\|$
	$\displaystyle=\left\|\int m(z)\left(p_{1}(z)-p_{2}(z)\right)\mathrm{d}z\right\|$
	$\displaystyle\leq\left\|\int_{\\|z\\|_{2}\leq R}\\|z\\|^{2}_{2}\left(p_{1}(z)-p_{2}(z)\right)\mathrm{d}z\right\|+\int_{\\|z\\|_{2}>R}\\|z\\|^{2}_{2}p_{1}(z)\mathrm{d}z+\int_{\\|z\\|_{2}>R}\\|z\\|^{2}_{2}p_{2}(z)\mathrm{d}z$
	$\displaystyle\leq R^{2}\texttt{d}_{\rm TV}(P_{1},P_{2})+\int_{\\|z\\|_{2}>R}\\|z\\|^{2}_{2}p_{1}(z)\mathrm{d}z+\int_{\\|z\\|_{2}>R}\\|z\\|^{2}_{2}p_{2}(z)\mathrm{d}z.$

Since $P_{1}$ and $P_{2}$ are Gaussains, $\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{1}(z)\mathrm{d}z$ and $\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{2}(z)\mathrm{d}z$ are bounded by some constant $C_{1}$ when $R^{2}\geq C_{2}\max\{\mathbb{E}_{z\sim P_{1}}[\|z\|_{2}^{2}],\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}^{2}]\}$ as suggested by Lemma 16 in Chen et al. [2023a].

Therefore,

	$\displaystyle\mathbb{E}_{z\sim P_{1}}[\\|z\\|^{2}_{2}]$	$\displaystyle\leq\mathbb{E}_{z\sim P_{2}}[\\|z\\|^{2}_{2}]+C_{2}\max\{\mathbb{E}_{P_{1}}[\\|z\\|_{2}^{2}],\mathbb{E}_{P_{2}}[\\|z\\|_{2}^{2}]\}\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})+2C_{1}$
		$\displaystyle\leq\mathbb{E}_{z\sim P_{2}}[\\|z\\|^{2}_{2}]+C_{2}(\mathbb{E}_{z\sim P_{1}}[\\|z\\|_{2}^{2}]+\mathbb{E}_{z\sim P_{2}}[\\|z\\|_{2}^{2}])\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})+2C_{1}.$

Then

\mathbb{E}_{z\sim P_{1}}[\|z\|^{2}_{2}]=\mathcal{O}\left(\mathbb{E}_{z\sim P_{2}}[\|z\|^{2}_{2}]+\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}^{2}]\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})\right)

since $\texttt{d}_{\rm TV}(P_{1},P_{2})$ decays with $n_{1}$ .

Similarly, when $m(z)=\|z\|_{2}$

	$\displaystyle\quad\left\|\mathbb{E}_{z\sim P_{1}}[m(z)]-\mathbb{E}_{z\sim P_{2}}[m(z)]\right\|$
	$\displaystyle=\left\|\int m(z)\left(p_{1}(z)-p_{2}(z)\right)\mathrm{d}z\right\|$
	$\displaystyle\leq\left\|\int_{\\|z\\|_{2}\leq R}\\|z\\|_{2}\left(p_{1}(z)-p_{2}(z)\right)\mathrm{d}z\right\|+\int_{\\|z\\|_{2}>R}\\|z\\|_{2}p_{1}(z)\mathrm{d}z+\int_{\\|z\\|_{2}>R}\\|z\\|_{2}p_{2}(z)\mathrm{d}z$
	$\displaystyle\leq R\texttt{d}_{\rm TV}(P_{1},P_{2})+\sqrt{\int_{\\|z\\|_{2}>R}\\|z\\|^{2}_{2}p_{1}(z)\mathrm{d}z}+\sqrt{\int_{\\|z\\|_{2}>R}\\|z\\|^{2}_{2}p_{2}(z)\mathrm{d}z},$

where $\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{1}(z)\mathrm{d}z$ and $\int_{\|z\|_{2}>R}\|z\|^{2}_{2}p_{2}(z)\mathrm{d}z$ are simultaneously bounded by a constant $C_{1}$ , if the radius $R^{2}\geq C_{2}\max\{\mathbb{E}_{z\sim P_{1}}[\|z\|_{2}^{2}],\mathbb{E}_{z\sim P_{2}}[\|z\|_{2}^{2}]\}$ as suggested by Chen et al. [2023a, Lemma 16].

Therefore,

	$\displaystyle\left\|\mathbb{E}_{z\sim P_{1}}[\\|z\\|_{2}]-\mathbb{E}_{z\sim P_{2}}[\\|z\\|_{2}]\right\|$	$\displaystyle\leq\sqrt{C_{2}\max\{\mathbb{E}_{P_{1}}[\\|z\\|_{2}^{2}],\mathbb{E}_{P_{2}}[\\|z\\|_{2}^{2}]\}}\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})+2C_{1}$
		$\displaystyle\leq\left(\sqrt{C_{2}\mathbb{E}_{z\sim P_{1}}[\\|z\\|_{2}^{2}]}+\sqrt{C_{2}\mathbb{E}_{z\sim P_{2}}[\\|z\\|_{2}^{2}]}\right)\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})+2C_{1}$
		$\displaystyle=\mathcal{O}\left(\left(\sqrt{\mathbb{E}_{z\sim P_{1}}[\\|z\\|_{2}^{2}]}+\sqrt{\mathbb{E}_{z\sim P_{2}}[\\|z\\|_{2}^{2}]}\right)\cdot\texttt{d}_{\rm TV}(P_{1},P_{2})\right).$

C.9 Proof of Lemma C.3

Recall from (C.12) that

P^{LD}(a)=P_{z}(z\mid\widehat{f}(Az)=a)=\sf N\left(\mu(a),\Gamma\right)

with $\mu(a):=\Sigma\widehat{\beta}\left(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\right)^{-1}a$ , $\Gamma:=\Sigma-\Sigma\widehat{\beta}\left(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\right)^{-1}\widehat{\beta}^{\top}\Sigma$ .

	$\displaystyle\mathbb{E}_{z\sim P^{LD}(a)}\left[\\|z\\|^{2}_{2}\right]$	$\displaystyle=\mu(a)^{\top}\mu(a)+\operatorname{trace}(\Gamma)$
		$\displaystyle=\frac{\widehat{\beta}^{\top}\Sigma^{2}\widehat{\beta}}{\left(\\|\widehat{\beta}\\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}+\operatorname{trace}(\Sigma-\Sigma\widehat{\beta}\left(\widehat{\beta}^{\top}\Sigma\widehat{\beta}+\nu^{2}\right)^{-1}\widehat{\beta}^{\top}\Sigma)$
		$\displaystyle=:M(a).$

	$\displaystyle M(a)$	$\displaystyle=\mathcal{O}\left(\frac{\widehat{\beta}^{\top}\Sigma^{2}\widehat{\beta}}{\left(\\|\widehat{\beta}\\|_{\Sigma}^{2}\right)^{2}}a^{2}+\operatorname{trace}(\Sigma)\right)$
		$\displaystyle=\mathcal{O}\left(\frac{a^{2}}{\\|\widehat{\beta}\\|_{\Sigma}}+d\right),$

and by Lemma B.6

\|\widehat{\beta}\|_{\Sigma}\leq\frac{1}{2}\|{\beta}^{*}\|_{\Sigma}.

Thus $\mathbb{E}_{z\sim P^{LD}(a)}\left[\|z\|^{2}_{2}\right]=M(a),M(a)=\mathcal{O}\left(\frac{a^{2}}{\|{\beta}^{*}\|_{\Sigma}}+d\right)$ .

Thus after adding diffusion noise at $t_{0}$ , we have for $\alpha(t)=e^{-t/2}$ and $h(t)=$ $1-e^{-t}$ :

	$\displaystyle\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}\left[\\|z\\|^{2}_{2}\right]$	$\displaystyle=\mathbb{E}_{z_{0}\sim P^{LD}(a)}\mathbb{E}_{z\sim\sf N\left(\alpha(t_{0})\cdot z_{0},h(t_{0})\cdot I_{d}\right)}\left[\\|z\\|^{2}_{2}\right]$
		$\displaystyle=\mathbb{E}_{z_{0}\sim P^{LD}(a)}\left[\alpha^{2}(t_{0})\\|z_{0}\\|^{2}_{2}+d\cdot h(t_{0})\right]$
		$\displaystyle=\alpha^{2}(t_{0})\cdot\mathbb{E}_{z_{0}\sim P^{LD}(a)}\left[\\|z_{0}\\|^{2}_{2}\right]+d\cdot h(t_{0})$
		$\displaystyle=e^{-t_{0}}\cdot\mathbb{E}_{z_{0}\sim P^{LD}(a)}\left[\\|z_{0}\\|^{2}_{2}\right]+(1-e^{-t_{0}})\cdot d.$

Thus $\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}\left[\|z\|^{2}_{2}\right]\leq M(a)+t_{0}d$ .

By orthogonal decomposition we have

	$\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\\|x\\|^{2}_{2}\right]$	$\displaystyle\leq\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\\|(I_{D}-VV^{\top})x\\|^{2}_{2}\right]+\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\\|VV^{\top}x\\|^{2}_{2}\right]$
		$\displaystyle=\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\\|(I_{D}-VV^{\top})x\\|^{2}_{2}\right]+\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\\|U^{\top}V^{\top}x\\|^{2}_{2}\right],$

where $\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|(I_{D}-VV^{\top})x\|^{2}_{2}\right]$ is bounded by (B.5) and the distribution of $U^{\top}V^{\top}x$ , which is $(U^{\top}V^{\top})_{\#}\widehat{P}_{a}$ , is close to $\mathbb{P}^{LD}_{t_{0}}(a)$ up to $TV(\widehat{P}_{a})$ , which is defined in Definition B.2. Then by Lemma C.2, we have

\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|U^{\top}V^{\top}x\|^{2}_{2}\right]

\displaystyle=\mathcal{O}\left(\mathbb{E}_{z\sim P^{LD}_{t_{0}}(a)}\left[\|z\|^{2}_{2}\right](1+TV(\widehat{P}_{a})\right).

Thus $\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\|x\|^{2}_{2}\right]=\mathcal{O}\left(ct_{0}D+(M(a)+t_{0}d)\cdot(1+TV(\widehat{P}_{a})\right)$ .

C.10 Proof of Lemma C.4

Firstly, one can verify the following two equations by direct calculation:

	$\displaystyle(\lambda I_{D}+A\Sigma_{1}A^{\top})^{-1}$	$\displaystyle=\frac{1}{\lambda}\left(I_{D}-A(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}A^{\top}\right),$
	$\displaystyle(\lambda I_{d}+\Sigma_{1})^{-1}$	$\displaystyle=\frac{1}{\lambda}\left(I_{d}-(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}\right).$

Then we have

	$\displaystyle(\lambda I_{D}+A\Sigma_{1}A^{\top})^{-1}A\Sigma_{2}A^{\top}=$	$\displaystyle\frac{1}{\lambda}\left(I_{D}-A(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}A^{\top}\right)A\Sigma_{2}A^{\top}$
	$\displaystyle=$	$\displaystyle\frac{1}{\lambda}\left(A\Sigma_{2}A^{\top}-A(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}\Sigma_{2}A^{\top}\right).$

Therefore,

	$\displaystyle\mathrm{Tr}\left((\lambda I_{D}+A\Sigma_{1}A^{\top})^{-1}A\Sigma_{2}A^{\top}\right)=$	$\displaystyle\mathrm{Tr}\left(\frac{1}{\lambda}\left(A\Sigma_{2}A^{\top}-A(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}\Sigma_{2}A^{\top}\right)\right)$
	$\displaystyle=$	$\displaystyle\mathrm{Tr}\left(\frac{1}{\lambda}\left(\Sigma_{2}-(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}\Sigma_{2}\right)\right)$
	$\displaystyle=$	$\displaystyle\mathrm{Tr}\left(\frac{1}{\lambda}\left(I_{d}-(\lambda I_{d}+\Sigma_{1})^{-1}\Sigma_{1}\right)\Sigma_{2}\right)$
	$\displaystyle=$	$\displaystyle\mathrm{Tr}\left(\left(\lambda I_{d}+\Sigma_{1}\right)^{-1}\Sigma_{2}\right),$

which has finished the proof.

Appendix D Omitted Proofs in Section 6.2

D.1 Conditional Score Decomposition and Score Matching Error

Lemma D.1.

Under Assumption 3.1, 6.5 and 6.6, with high probability

\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\|\widehat{s}(\cdot,t)-\nabla\log p_{t}(\cdot)\|_{L^{2}(P_{t})}^{2}\mathrm{d}t\leq\epsilon_{diff}^{2}(n_{1}),

with $\epsilon_{diff}^{2}(n_{1})=\widetilde{\mathcal{O}}\left(\frac{1}{t_{0}}\left(n_{1}^{-\frac{2-2\delta(n_{1})}{d+6}}+Dn_{1}^{-\frac{d+4}{d+6}}\right)\right)$ for $\delta(n_{1})=\frac{d\log\log n_{1}}{\log n_{1}}$ .

Proof.

Chen et al. [2023a, Theorem 1] is easily adapted here to prove Lemma D.1 with the input dimension $d+1$ and the Lipschitzness in Assumption 6.6. Network size of ${\mathcal{S}}$ is implied by Chen et al. [2023a, Theorem 1] with $\epsilon=n_{1}^{-\frac{1}{d+6}}$ accounting for the additional dimension of reward $\widehat{y}$ and then the score matching error follows. ∎

D.2 Proof of Theorem 6.7

Additional Notations: Similar as before, use $P_{t}^{LD}(z)$ to denote the low-dimensional distribution on $z$ corrupted by diffusion noise. Formally, $p_{t}^{LD}(z)=\int\phi_{t}(z^{\prime}|z)p_{z}(z)\mathrm{d}z$ with $\phi_{t}(\cdot|z)$ being the density of ${\sf N}(\alpha(t)z,h(t)I_{d})$ . $P^{LD}_{t_{0}}(z\mid\widehat{f}(Az)=a)$ the corresponding conditional distribution on $\widehat{f}(Az)=a$ at $t_{0}$ , with shorthand as $P^{LD}_{t_{0}}(a)$ . Also give $P_{z}(z\mid\widehat{f}(Az)=a)$ a shorthand as $P^{LD}(a)$ .

Now we are ready to begin our proof. By the same argument as in Appendix B.3.1, we have

\displaystyle\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a)

\displaystyle\leq\underbrace{\mathbb{E}_{x\sim P_{a}}\left[\left|f^{*}(x)-\widehat{f}(x)\right|\right]}_{\mathcal{E}_{1}}+\underbrace{\left|\mathbb{E}_{x\sim P_{a}}[g^{*}(x_{\parallel})]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(x_{\parallel})]\right|}_{\mathcal{E}_{2}}+\underbrace{\mathbb{E}_{x\sim\widehat{P}_{a}}[h^{*}(x_{\perp})]}_{\mathcal{E}_{3}}.

In the following, we will bound terms $\mathcal{E}_{1},\mathcal{E}_{2}$ respectively. Since $\mathcal{E}_{3}$ is essentially the off-support error, its magnitude is bounded by Theorem 5.4.

D.2.1 $\mathcal{E}_{1}$ : Nonparamtric Regression Induced Error.

Since $P_{z}$ has a light tail due to Assumption 6.5, by union bound and Chen et al. [2023a, Lemma 16], we have

\displaystyle\mathbb{P}(\exists~{}x_{i}~{}\text{with}~{}\|x_{i}\|_{2}>R~{}\text{for}~{}i=1,\dots,n_{2})\leq n_{2}\frac{C_{1}d2^{-d/2+1}}{C_{2}\Gamma(d/2+1)}R^{d-2}\exp(-C_{2}R^{2}/2),

where $C_{1},C_{2}$ are constants and $\Gamma(\cdot)$ is the Gamma function. Choosing $R=\mathcal{O}(\sqrt{d\log d+\log\frac{n}{\delta}})$ ensures $\mathbb{P}(\exists~{}x_{i}~{}\text{with}~{}\|x_{i}\|_{2}>R~{}\text{for}~{}i=1,\dots,n_{2})<\delta$ . On the event $\mathcal{E}=\{\|x_{i}\|_{2}\leq R~{}\text{for all}~{}i=1,\dots,n_{2}\}$ , denoting $\delta(n_{2})=\frac{d\log\log n_{2}}{\log n_{2}}$ , we have

\displaystyle\|f^{*}-\widehat{f}\|^{2}_{L^{2}}=\widetilde{\mathcal{O}}\left(n_{2}^{-\frac{2(\alpha-\delta(n_{2}))}{d+2\alpha}}\right)

by Nakada and Imaizumi [2020, Theorem 7] with a new covering number of ${\mathcal{S}}$ , when $n_{2}$ is sufficiently large. The corresponding network architecture follows Chen et al. [2022a, Theorem 2].

We remark that linear subspace is a special case of low Minkowski dimension. Moreover, $\delta(n_{2})$ is asymptotically negligible and accounts for the truncation radius $R$ of $x_{i}$ ’s (see also Chen et al. [2023a, Theorem 2 and 3]). The covering number of ${\mathcal{S}}$ is $\widetilde{\mathcal{O}}\left(d^{d/2}n^{-\frac{d}{\alpha}}(\log n_{2})^{d/2}+Dd\right)$ as appear in Chen et al. [2023a, Proof of Theorem 2]. Therefore

	$\displaystyle\mathbb{E}_{x\sim P_{a}}\left[\left\|f^{*}(x)-\widehat{f}(x)\right\|\right]$	$\displaystyle\leq\sqrt{\mathbb{E}_{x\sim P_{a}}\left[\left\|f^{*}(x)-\widehat{f}(x)\right\|^{2}\right]}$
		$\displaystyle\leq\sqrt{{\mathcal{T}}(P(x\|\widehat{y}=a),P_{x};\bar{\mathcal{F}})\cdot\\|f^{*}-\widehat{f}\\|^{2}_{L^{2}}}$
		$\displaystyle=\sqrt{{\mathcal{T}}(P(x\|\widehat{y}=a),P_{x};\bar{\mathcal{F}})}\cdot\widetilde{\mathcal{O}}\left(n_{2}^{-\frac{\alpha-\delta(n_{2})}{2\alpha+d}}+D/n_{2}\right).$

D.2.2 $\mathcal{E}_{2}$ : Diffusion Induced On-support Error

Suppose $L_{2}$ score matching error is $\epsilon^{2}_{diff}(n_{1})$ , i.e.

\displaystyle\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\mathbb{E}_{x,\widehat{f}}\|\nabla_{x}\log p_{t}(x,\widehat{f})-s_{\widehat{w}}(x,\widehat{f},t)\|_{2}^{2}\mathrm{d}t\leq\epsilon^{2}_{diff}(n_{1}),

We revoke Definition B.2 measuring the distance between $\widehat{P}_{a}$ to $P_{a}$ that

TV(\widehat{P}_{a}):=\texttt{d}_{\rm TV}\left(P^{LD}_{t_{0}}(z\mid\widehat{f}(Az)=a),(U^{\top}V^{\top})_{\#}\widehat{P}_{a}\right).

Lemma B.4 applies to nonparametric setting, so we have

	$\displaystyle(I_{D}-VV^{\top})x\sim{\sf N}(0,\Lambda),\quad\Lambda\prec ct_{0}I_{D},$		(D.1)
	$\displaystyle\angle({V},{A})=\widetilde{\mathcal{O}}\left(\frac{t_{0}}{c_{0}}\cdot\epsilon^{2}_{diff}(n_{1})\right).$		(D.2)

In addition,

TV(\widehat{P}_{a})=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{c_{0}}}\cdot\epsilon_{diff}(n_{1})\right).

(D.3)

$\mathcal{E}_{2}$ will be bounded by

	$\displaystyle\mathcal{E}_{2}$	$\displaystyle=\left\|\mathbb{E}_{x\sim P_{a}}[g^{}(x)]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{}(x)]\right\|$
		$\displaystyle\leq\left\|\mathbb{E}_{x\sim P_{a}}[g^{}(AA^{\top}x)]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{}(VV^{\top}x)]\right\|+\left\|\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{}(VV^{\top}x)-g^{}(AA^{\top}x)]\right\|,$

where for $\left|\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(VV^{\top}x)-g^{*}(AA^{\top}x)]\right|$ , we have

\left|\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(VV^{\top}x)-g^{*}(AA^{\top}x)]\right|\leq\mathbb{E}_{x\sim\widehat{P}_{a}}[\|VV^{\top}x-AA^{\top}x\|_{2}]\leq\|VV^{\top}-AA^{\top}\|_{F}\cdot\mathbb{E}_{x\sim\widehat{P}_{a}}[\|x\|_{2}].

(D.4)

For the other term $\left|\mathbb{E}_{x\sim P_{a}}[g^{*}(AA^{\top}x)]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{*}(VV^{\top}x)]\right|$ , we will bound it with $TV(\widehat{P}_{a})$ .

		$\displaystyle\left\|\mathbb{E}_{x\sim P_{a}}[g^{}(AA^{\top}x)]-\mathbb{E}_{x\sim\widehat{P}_{a}}[g^{}(VV^{\top}x)]\right\|$
	$\displaystyle\leq$	$\displaystyle\left\|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{}(Az)]-\mathbb{E}_{z\sim(V^{\top})_{\#}\widehat{P}_{a}}[g^{}(Vz)]\right\|+\left\|\mathbb{E}_{z\sim\mathbb{P}(a)}[g^{}(Az)]-\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{}(Az)]\right\|$

Since any $z\sim\mathbb{P}_{t_{0}}(a)$ can be represented by $\alpha(t_{0})z+\sqrt{h(t_{0})}u$ , where $z\sim\mathbb{P}(a),u\sim{\sf N}(0,I_{d})$ , then

	$\displaystyle\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(Az)]$
	$\displaystyle=\mathbb{E}_{z\sim\mathbb{P}(a),u\sim{\sf N}(0,I_{d})}[g^{*}(\alpha(t)Az+\sqrt{h(t)}Au))]$
	$\displaystyle\leq\mathbb{E}_{z\sim\mathbb{P}(a)}[g^{*}(\alpha(t_{0})Az))]+\sqrt{h(t_{0})}\mathbb{E}_{u\sim{\sf N}(0,I_{d})}[\\|Au\\|_{2}]$
	$\displaystyle\leq\mathbb{E}_{z\sim\mathbb{P}(a)}[g^{*}(Az))]+(1-\alpha(t_{0}))\mathbb{E}_{z\sim\mathbb{P}(a)}[\\|Az\\|_{2}]+\sqrt{h(t_{0})}\mathbb{E}_{u\sim\sf N(0,I_{d})}[\\|Au\\|_{2}],$

thus

\left|\mathbb{E}_{z\sim\mathbb{P}(a)}[g^{*}(Az)]-\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(Az)]\right|\leq t_{0}\cdot\mathbb{E}_{z\sim\mathbb{P}(a)}[\|z\|_{2}]+d,

where we further use $1-\alpha(t_{0})=1-e^{-t_{0}/2}\leq t_{0}/2$ , $h(t_{0})\leq 1$ .

As for $\left|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(Az)]-\mathbb{E}_{z\sim(V^{\top})_{\#}\widehat{P}_{a}}[g^{*}(Vz)]\right|$ , we have

		$\displaystyle\left\|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{}(Az)]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[g^{}(VUz)]\right\|$
	$\displaystyle=$	$\displaystyle\left\|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{}(VUz)]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[g^{}(VUz)]\right\|+\left\|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{}(Az)]-\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{}(VUz)]\right\|,$

where

\left|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(Az)]-\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(VUz)]\right|\leq\|A-VU\|_{F}\cdot\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[\|z\|_{2}],

and

\left|\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[g^{*}(VUz)]-\mathbb{E}_{z\sim(U^{\top}V^{\top})_{\#}\widehat{P}_{a}}[g^{*}(VUz)]\right|\leq TV(\widehat{P}_{a})\cdot\|g^{*}\|_{\infty}.

Combining things up, we have

	$\displaystyle\mathcal{E}_{2}\leq$	$\displaystyle\\|VV^{\top}-AA^{\top}\\|_{F}\cdot\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|x\\|_{2}]+\\|A-VU\\|_{F}\cdot\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[\\|z\\|_{2}]$
		$\displaystyle+t_{0}\cdot\mathbb{E}_{z\sim\mathbb{P}(a)}[\\|z\\|_{2}]+d+TV(\widehat{P}_{a})\cdot\\|g^{*}\\|_{\infty}.$

Similar to parametric case, Let $M(a):=\mathbb{E}_{z\sim\mathbb{P}(a)}[\|z\|^{2}_{2}]$ , then

\mathbb{E}_{z\sim\mathbb{P}_{t_{0}}(a)}[\|z\|^{2}_{2}]\leq M(a)+t_{0}d,

expect for in nonparametric case, we can not compute $M(a)$ out as it is not Gaussian. But still, with higher-order terms in $n_{1}^{-1}$ hided, we have

	$\displaystyle\mathcal{E}_{2}$	$\displaystyle=\mathcal{O}\left(TV(\widehat{P}_{a})\cdot\\|g^{*}\\|_{\infty}+t_{0}M(a)\right)$
		$\displaystyle=\widetilde{\mathcal{O}}\left(\sqrt{\frac{{\mathcal{T}}(P(x,\widehat{y}=a),P_{x\widehat{y}};\bar{{\mathcal{S}}})}{c_{0}}}\cdot\epsilon_{diff}(n_{1})\cdot\\|g^{*}\\|_{\infty}+t_{0}M(a)\right).$

Appendix E Parametric Conditional Score Estimation: Proof of Lemma B.1

We first derive a decomposition of the conditional score function similar to Chen et al. [2023a]. We have

	$\displaystyle p_{t}(x,y)$	$\displaystyle=\int p_{t}(x,y\|z)p_{z}(z)\mathrm{d}z$
		$\displaystyle=\int p_{t}(x\|z)p(y\|z)p_{z}(z)\mathrm{d}z$
		$\displaystyle=C\int\exp\left(-\frac{1}{2h(t)}\\|x-\alpha(t)Az\\|_{2}^{2}\right)\exp\left(-\frac{1}{\sigma^{2}_{y}}\left(\theta^{\top}z-y\right)^{2}\right)p_{z}(z)\mathrm{d}z$
		$\displaystyle\overset{(i)}{=}C\exp\left(-\frac{1}{2h(t)}\\|(I_{D}-AA^{\top})x\\|_{2}^{2}\right)\cdot\int\exp\left(-\frac{1}{2h(t)}\\|A^{\top}x-\alpha(t)z\\|_{2}^{2}\right)\exp\left(-\frac{1}{\sigma^{2}_{y}}\left(\theta^{\top}z-y\right)^{2}\right)p_{z}(z)\mathrm{d}z,$

where equality $(i)$ follows from the fact $AA^{\top}x\perp(I_{D}-AA^{\top})x$ and $C$ is the normalizing constant of Gaussian densities. Taking logarithm and then derivative with respect to $x$ on $p_{t}(x,y)$ , we obtain

\displaystyle\nabla_{x}\log p_{t}(x,y)=\frac{\alpha(t)}{h(t)}\frac{A\int z\exp\left(-\frac{1}{2h(t)}\|A^{\top}x-\alpha(t)z\|_{2}^{2}\right)\exp\left(-\frac{1}{\sigma^{2}_{y}}\left(\theta^{\top}z-y\right)^{2}\right)p_{z}(z)\mathrm{d}z}{\int\exp\left(-\frac{1}{2h(t)}\|A^{\top}x-\alpha(t)z\|_{2}^{2}\right)\exp\left(-\frac{1}{\sigma^{2}_{y}}\left(\theta^{\top}z-y\right)^{2}\right)p_{z}(z)\mathrm{d}z}-\frac{1}{h(t)}x.

Note that the first term in the right-hand side above only depends on $A^{\top}x$ and $y$ . Therefore, we can compactly write $\nabla_{x}\log p_{t}(x,y)$ as

\displaystyle\nabla_{x}\log p_{t}(x,y)=\frac{1}{h(t)}Au(A^{\top}x,y,t)-\frac{1}{h(t)}x,

(E.1)

where mapping $u$ represents

\frac{\alpha(t)\int z\exp\left(-\frac{1}{2h(t)}\|A^{\top}x-\alpha(t)z\|_{2}^{2}\right)\exp\left(-\frac{1}{\sigma^{2}_{y}}\left(\theta^{\top}z-y\right)^{2}\right)p_{z}(z)\mathrm{d}z}{\int\exp\left(-\frac{1}{2h(t)}\|A^{\top}x-\alpha(t)z\|_{2}^{2}\right)\exp\left(-\frac{1}{\sigma^{2}_{y}}\left(\theta^{\top}z-y\right)^{2}\right)p_{z}(z)\mathrm{d}z}.

We observe that (E.1) motivates our choice of the neural network architecture ${\mathcal{S}}$ in (4.8). In particular, $\psi$ attempts to estimate $u$ and matrix $V$ attempts to estimate $A$ .

In the Gaussian design case (Assumption 5.3), we instantiate $p_{z}(z)$ to the Gaussian density $(2\pi|\Sigma|)^{-d/2}\exp\left(-\frac{1}{2}z^{\top}\Sigma^{-1}z\right)$ . Some algebra on the Gaussian integral gives rise to

	$\displaystyle\nabla_{x}\log p_{t}(x,y)$	$\displaystyle=\frac{\alpha(t)}{h(t)}AB_{t}\mu_{t}(x,y)-\frac{1}{h(t)}(I_{D}-AA^{\top})x-\frac{1}{h(t)}AA^{\top}x$
		$\displaystyle=\frac{\alpha(t)}{h(t)}AB_{t}\left(\alpha(t)A^{\top}x+\frac{h(t)}{\nu^{2}}y\theta\right)-\frac{1}{h(t)}x,$		(E.2)

where we have denoted

\displaystyle\mu_{t}(x,y)=\alpha(t)A^{\top}x+\frac{h(t)}{\nu^{2}}y\theta\quad\text{and}\quad B_{t}=\left(\alpha^{2}(t)I_{d}+\frac{h(t)}{\nu^{2}}\theta\theta^{\top}+h(t)\Sigma^{-1}\right)^{-1}.

E.1 Score Estimation Error

Recall that we estimate the conditional score function via minimizing the denoising score matching loss in Proposition 4.1. To ease the presentation, we denote

\displaystyle\ell(x,y;s)=\frac{1}{T-t_{0}}\int_{t_{0}}^{T}\mathbb{E}_{x^{\prime}|x}\|\nabla_{x^{\prime}}\log\phi_{t}(x^{\prime}|x)-s(x^{\prime},y,t)\|_{2}^{2}\mathrm{d}t

as the loss function for a pair of clean data $(x,y)$ and a conditional score function $s$ . Further, we denote the population loss as

\displaystyle\mathcal{L}(s)=\mathbb{E}_{x,y}[\ell(x,y;s)],

whose empirical counterpart is denoted as $\widehat{\mathcal{L}}(s)=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\ell(x_{i},y_{i};s)$ .

To bound the score estimation error, we begin with an oracle inequality. Denote $\mathcal{L}^{\rm trunc}(s)$ as a truncated loss function defined as

\displaystyle\mathcal{L}^{\rm trunc}(s)=\mathbb{E}[\ell(x,y;s)\mathds{1}\{\|x\|_{2}\leq R,|y|\leq R\}],

where $R>0$ is a truncation radius chosen as $\mathcal{O}(\sqrt{d\log d+\log K+\log\frac{n_{1}}{\delta}})$ . Here $K$ is a uniform upper bound of $s(x,y,t)\mathds{1}\{\|x\|_{2}\leq R,|y|\leq R\}$ for $s\in{\mathcal{S}}$ , i.e., $\sup_{s\in{\mathcal{S}}}\|s(x,y,t)\mathds{1}\{\|x\|_{2}\leq R,|y|\leq R\}\|_{2}\leq K$ . To this end, we have

	$\displaystyle\mathcal{L}(\widehat{s})$	$\displaystyle=\mathcal{L}(\widehat{s})-\widehat{\mathcal{L}}(\widehat{s})+\widehat{\mathcal{L}}(\widehat{s})$
		$\displaystyle=\mathcal{L}(\widehat{s})-\widehat{\mathcal{L}}(\widehat{s})+\inf_{s\in{\mathcal{S}}}\widehat{\mathcal{L}}(s)$
		$\displaystyle\overset{(i)}{=}\mathcal{L}(\widehat{s})-\widehat{\mathcal{L}}(\widehat{s})$
		$\displaystyle\leq\mathcal{L}(\widehat{s})-\mathcal{L}^{\rm trunc}(\widehat{s})+\mathcal{L}^{\rm trunc}(\widehat{s})-\widehat{\mathcal{L}}^{\rm trunc}(\widehat{s})$
		$\displaystyle\leq\underbrace{\sup_{s}\mathcal{L}^{\rm trunc}(s)-\widehat{\mathcal{L}}^{\rm trunc}(s)}_{(A)}+\underbrace{\sup_{s}\mathcal{L}(s)-\mathcal{L}^{\rm trunc}(s)}_{(B)},$

where equality $(i)$ holds since ${\mathcal{S}}$ contains the ground truth score function. We bound term $(A)$ by a PAC-learning concentration argument. Using the same argument in Chen et al. [2023a, Theorem 2, term $(A)$ ], we have

\displaystyle\sup_{s\in{\mathcal{S}}}\ell^{\rm trunc}(x,y;s)=\mathcal{O}\left(\frac{1}{t_{0}(T-t_{0})}(K^{2}+R^{2})\right).

Applying the standard metric entropy and symmetrization technique, we can show

\displaystyle(A)=\mathcal{O}\left(\widehat{\mathfrak{R}}({\mathcal{S}})+\left(\frac{K^{2}+R^{2}}{t_{0}(T-t_{0})}\right)\sqrt{\frac{\log\frac{2}{\delta}}{2n_{1}}}\right),

where $\widehat{\mathfrak{R}}$ is the empirical Rademacher complexity of ${\mathcal{S}}$ . Unfamiliar readers can refer to Theorem 3.3 in “Foundations of Machine Learning”, second edition for details. The remaining step is to bound the Rademacher complexity by Dudley’s entropy integral. Indeed, we have

\displaystyle\widehat{\mathfrak{R}}({\mathcal{S}})\leq\inf_{\epsilon}\frac{4\epsilon}{\sqrt{n_{1}}}+\frac{12}{n_{1}}\int_{\epsilon}^{K^{2}\sqrt{n_{1}}}\sqrt{\mathcal{N}({\mathcal{S}},\epsilon,\|\cdot\|_{2})}\mathrm{d}\epsilon.

We emphasize that the log covering number considers $x,y$ in the truncated region. Taking $\epsilon=\frac{1}{n_{1}}$ gives rise to

\displaystyle(A)=\mathcal{O}\left(\left(\frac{K^{2}+R^{2}}{t_{0}(T-t_{0})}\right)\sqrt{\frac{\mathcal{N}({\mathcal{S}},1/n_{1})\log\frac{1}{\delta}}{n_{1}}}\right).

Here $K$ is instance-dependent and majorly depends on $d$ . In the Gaussian design case, we can verify that $K$ is $\mathcal{O}(\sqrt{d})$ . To this end, we deduce $(A)=\widetilde{\mathcal{O}}\left(\frac{1}{t_{0}}\sqrt{d^{2}\frac{\mathcal{N}({\mathcal{S}},1/n_{1})\log\frac{1}{\delta}}{n_{1}}}\right)$ . In practice, $d$ is often much smaller than $D$ (see for example Pope et al. [2021], where ImageNet has intrinsic dimension no more than $43$ in contrast to image resolution of $224\times 224\times 3$ ). In this way, we can upper bound $d^{2}$ by $D$ , yet $d^{2}$ is often a tighter upper bound.

For term $(B)$ , we invoke the same upper bound in Chen et al. [2023a, Theorem 2, term $(B)$ ] to obtain

\displaystyle(B)=\mathcal{O}\left(\frac{1}{n_{1}t_{0}(T-t_{0})}\right),

which is negligible compared to $(A)$ . Therefore, summing up $(A)$ and $(B)$ , we deduce

\displaystyle\epsilon_{diff}^{2}=\mathcal{O}\left(\frac{1}{t_{0}}\sqrt{\frac{\mathcal{N}({\mathcal{S}},1/n_{1})(d^{2}\vee D)\log\frac{1}{\delta}}{n_{1}}}\right).

E.2 Gaussian Design

We only need to find the covering number under the Gaussian design case. Using (E), we can construct a covering from coverings on matrices $V$ and $\Sigma^{-1}$ . Suppose $V_{1},V_{2}$ are two matrices with $\|V_{1}-V_{2}\|_{2}\leq\eta_{V}$ for some $\eta>0$ . Meanwhile, let $\Sigma^{-1}_{1},\Sigma^{-1}_{2}$ be two covariance matrices with $\|\Sigma^{-1}_{1}-\Sigma^{-1}_{2}\|_{2}\leq\eta_{\Sigma}$ . Then we bound

	$\displaystyle\quad\sup_{\\|x\\|_{2}\leq R,\|y\|\leq R}\\|s_{V_{1},\Sigma^{-1}_{1}}(s,y,t)-s_{V_{2},\Sigma^{-1}_{2}}(x,y,t)\\|_{2}$
	$\displaystyle\leq\frac{1}{h(t)}\sup_{\\|x\\|_{2}\leq R,\|y\|\leq R}\Big{[}\big{\lVert}V_{1}\psi_{\Sigma^{-1}_{1}}(V_{1}^{\top}x,y,t)-V_{1}\psi_{\Sigma^{-1}_{1}}(V_{2}^{\top}x,y,t)\big{\rVert}_{2}$
	$\displaystyle\quad+\underbrace{\big{\lVert}V_{1}\psi_{\Sigma^{-1}_{1}}(V_{2}^{\top}x,y,t)-V_{1}\psi_{\Sigma^{-1}_{2}}(V_{2}^{\top}x,y,t)\big{\rVert}_{2}}_{(\spadesuit)}+\big{\lVert}V_{1}\psi_{\Sigma^{-1}_{2}}(V_{2}^{\top}x,y,t)-V_{2}\psi_{\Sigma^{-1}_{2}}(V_{2}^{\top}x,y,t)\big{\rVert}_{2}\Big{]}$
	$\displaystyle\leq\frac{1}{h(t)}\left(2R\eta_{V}+2\nu^{-2}R\eta_{\Sigma}\right),$

where for bounding $(\spadesuit)$ , we invoke the identity $\|(I+A)^{-1}-(I+B)^{-1}\|_{2}\leq\|B-A\|_{2}$ . Further taking supremum over $t\in[t_{0},T]$ leads to

\displaystyle\sup_{\|x\|_{2}\leq R,|y|\leq R}\|s_{V_{1},\Sigma^{-1}_{1}}(s,y,t)-s_{V_{2},\Sigma^{-1}_{2}}(x,y,t)\|_{2}\leq\frac{1}{t_{0}}\left(2R\eta_{V}+2\nu^{-2}R\eta_{\Sigma}\right)

for any $t\in[t_{0},T]$ . Therefore, the inequality above suggests that coverings on $V$ and $\Sigma^{-1}$ form a covering on ${\mathcal{S}}$ . The covering numbers of $V$ and $\Sigma^{-1}$ can be directly obtained by a volume ratio argument; we have

\displaystyle\mathcal{N}(V,\eta_{V},\|\cdot\|_{2})\leq Dd\log\left(1+\frac{2\sqrt{d}}{\eta_{V}}\right)\quad\text{and}\quad\mathcal{N}(\Sigma^{-1},\eta_{\Sigma},\|\cdot\|_{2})\leq d^{2}\log\left(1+\frac{2\sqrt{d}}{\lambda_{\min}\eta_{\Sigma}}\right).

Thus, the log covering number of ${\mathcal{S}}$ is

	$\displaystyle\mathcal{N}({\mathcal{S}},\eta,\\|\cdot\\|_{2})$	$\displaystyle=\mathcal{N}(V,t_{0}\eta_{V}/2R,\\|\cdot\\|_{2})+\mathcal{N}(\Sigma^{-1},t_{0}\nu^{2}\eta_{\Sigma}/2R,\\|\cdot\\|_{2})$
		$\displaystyle\leq(Dd+d^{2})\log\left(1+\frac{dD}{t_{0}\lambda_{\min}\eta}\right),$

where we have plugged $\nu^{2}=1/D$ into the last inequality. Setting $\eta=1/n_{1}$ and substituting into $\epsilon_{diff}^{2}$ yield the desired result.

We remark that the analysis here does not try to optimize the error bounds, but aims to provide a provable guarantee for conditional score estimation using finite samples. We foresee that sharper analysis via Bernstein-type concentration may result in a better dependence on $n_{1}$ . Nonetheless, the optimal dependence should not beat a $1/n_{1}$ -rate.

Appendix F Additional Experimental Results

F.1 Simulation

We generate the latent sample $z$ from standard normal distribution $z\sim\sf{N}(0,I_{d})$ and set $x=Az$ for a randomly generated orthonormal matrix $A\in\mathbb{R}^{D\times d}$ . The dimensions are set to be $d=16,D=64$ . The reward function is set to be $f(x)=(\theta^{*})^{\top}x_{\parallel}-5\|x_{\perp}\|^{2}_{2}$ , where $\theta^{*}$ is defined by $A\beta^{*}$ . We generate $\beta^{*}$ by uniformly sampling from the unit sphere.

When estimating $\widehat{\theta}$ , we set $\lambda=1.0$ . The score matching network is based on the UNet implementation from https://github.com/lucidrains/denoising-diffusion-pytorch, where we modified the class embedding so it accepts continuous input. The predictor is trained using $8192$ samples and the score function is trained using $65536$ samples. When training the score function, we choose Adam as the optimizer with learning rate $8\times 10^{-5}$ . We train the score function for $10$ epochs, each epoch doing a full iteration over the whole training dataset with batch size $32$ .

For evaluation, the statistics is computed using $2048$ samples generated from the diffusion model. The curve in the figures is computed by averaging over $5$ runs.

F.2 Directed Text-to-Image Generation

Samples of high rewards and low rewards from the ground-truth reward model. In Section 7.2, the ground-truth reward model is built by replacing the final prediction layer of the ImageNet pre-trained ResNet-18 model with a randomly initialized linear layer of scalar outputs. To investigate the meaning of this randomly-generated reward model, we generate images using Stable Diffusion and filter out images with rewards $\geq 0.4$ (positive samples) and rewards $\leq-0.4$ (negative samples) and pick two typical images for each; see Figure 9. We note that in real-world use cases, the ground-truth rewards are often measured and annotated by human labors according to the demands.

Training Details. In our implementation, as the Stable Diffusion model operates on the latent space of its VAE, we build a 3-layer ConvNet with residual connections and batch normalizations on top of the VAE latent space. We train the network using Adam optimizer with learning rate $0.001$ for 100 epochs.

	$\displaystyle\\|(I_{D}-AA^{\top})x\\|_{2}$	$\displaystyle\leq\\|(I_{D}-VV^{\top})x\\|_{2}+\\|(VV^{\top}-AA^{\top})x\\|_{2},$
	$\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|(I_{D}-AA^{\top})x\\|^{2}_{2}]$	$\displaystyle\leq 2\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|(I_{D}-VV^{\top})x\\|^{2}_{2}]+2\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|(VV^{\top}-AA^{\top})x\\|^{2}_{2}],$

	$\displaystyle\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|x^{\perp}\\|_{2}]$	$\displaystyle\leq\sqrt{2\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|(I_{D}-VV^{\top})x\\|^{2}_{2}]+2\mathbb{E}_{x\sim\widehat{P}_{a}}[\\|(VV^{\top}-AA^{\top})x\\|^{2}_{2}]}$
		$\displaystyle\leq c^{\prime}\sqrt{t_{0}D}+2\sqrt{\angle({V},{A})}\cdot\sqrt{\mathbb{E}_{x\sim\widehat{P}_{a}}\left[\\|x\\|^{2}_{2}\right]}$
		$\displaystyle=\mathcal{O}\left(\sqrt{t_{0}D}+\sqrt{\angle({V},{A})}\cdot\sqrt{M(a)}\right).$

	$\displaystyle\\|\nabla\log p_{t}(x,y)-\widehat{s}(x,y,t)\\|_{2}^{2}$	$\displaystyle\leq\frac{2}{h^{2}(t)}\Big{[}\\|(A-V)B_{t}(A^{\top}x+\nu^{-2}y\theta)\\|_{2}^{2}+\\|VB_{t}(A-V)^{\top}x\\|_{2}^{2}\Big{]}$
		$\displaystyle\leq\frac{2}{h^{2}(t)}\Big{[}\\|A-V\\|_{2}^{2}\\|B_{t}(A^{\top}x+\nu^{-2}y\theta)\\|_{2}^{2}+\\|A-V\\|_{2}^{2}\\|x\\|_{2}^{2}\Big{]}$
		$\displaystyle\leq\frac{2}{h^{2}(t)}\\|A-V\\|_{2}^{2}(3\\|x\\|_{2}^{2}+y^{2}),$

	$\displaystyle\operatorname{trace}(\widehat{\Sigma}_{\lambda}^{-1}\Sigma_{P_{a}})$	$\displaystyle=\operatorname{trace}\left(\widehat{\Sigma}_{\lambda}^{-1}\frac{\Sigma\widehat{\beta}\widehat{\beta}^{\top}\Sigma}{\left(\\|\widehat{\beta}\\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}\right)+\operatorname{trace}\left(\widehat{\Sigma}_{\lambda}^{-1}\Gamma\right)$
		$\displaystyle=\operatorname{trace}\left(\frac{\widehat{\beta}^{\top}\Sigma\widehat{\Sigma}_{\lambda}^{-1}\Sigma\widehat{\beta}}{\left(\\|\widehat{\beta}\\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}\right)+\operatorname{trace}\left(\widehat{\Sigma}_{\lambda}^{-1}\Sigma\right)-\operatorname{trace}\left(\widehat{\Sigma}_{\lambda}^{-1}\frac{\Sigma\widehat{\beta}\widehat{\beta}^{\top}\Sigma}{\\|\widehat{\beta}\\|_{\Sigma}^{2}+\nu^{2}}\right)$
		$\displaystyle=\operatorname{trace}\left(\frac{\Sigma^{1/2}\widehat{\beta}\widehat{\beta}^{\top}\Sigma^{1/2}\Sigma^{1/2}\widehat{\Sigma}_{\lambda}^{-1}\Sigma^{1/2}}{\left(\\|\widehat{\beta}\\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}a^{2}\right)$
		$\displaystyle\leq\frac{\\|\Sigma\widehat{\Sigma}_{\lambda}^{-1}\Sigma\\|_{op}\cdot\\|\widehat{\beta}\\|_{\Sigma}^{2}}{\left(\\|\widehat{\beta}\\|_{\Sigma}^{2}+\nu^{2}\right)^{2}}\cdot a^{2}+\operatorname{trace}\left(\Sigma^{\frac{1}{2}}\widehat{\Sigma}_{\lambda}^{-1}\Sigma^{\frac{1}{2}}\right)$

	$\displaystyle\\|\widehat{\theta}-\theta^{*}\\|_{A\Sigma A^{\top}}$	$\displaystyle=\mathcal{O}\left(\\|\widehat{\theta}-\theta^{*}\\|_{\widehat{\Sigma}_{\lambda}}\right)$
		$\displaystyle=\mathcal{O}\left(\frac{\\|\widehat{\theta}-\theta^{*}\\|_{V_{\lambda}}}{\sqrt{n_{2}}}\right)$
		$\displaystyle=\mathcal{O}\left(\sqrt{\frac{d\log(n_{2})}{n_{2}}}\right).$

Diffusion Model for Data-Driven Black-Box Optimization††thanks: ⋄\diamond denotes equal contribution listed in alphabetical order. ‡\ddagger denotes co-correspondence.††thanks: This paper is a journal version of Yuan et al. (2023).

Abstract

1 Introduction

Our Approach.

Contributions.

2 Related Work

2.1 Guided Diffusion Models

2.2 Theory of Diffusion Models

2.3 Black-Box Optimization

3 Problem Setup

Assumption 3.1 .

Assumption 3.2 .

4 Reward-Directed Generation via Conditional Diffusion Models

4.1 Meta Algorithm

4.2 Training of Conditional Diffusion Model

4.3 Conditional Score Matching

Proposition 4.1 (Score Matching Objective for Implementation).

4.4 Conditioned Generation

Remark 4.2 (Alternative methods).

5 Statistical Analysis of Reward-Directed Condition Diffusion

5.1 Sub-Optimality Decomposition

Proposition 5.1.

5.2 Sample Fidelity

Assumption 5.2 .

Assumption 5.3 .

Theorem 5.4 (Subspace Fidelity of Generated Data).

Remark 5.5.

6 Optimization Theory

6.1 Results under Real-Valued Reward

Parametric Setting.

Assumption 6.1 .

Theorem 6.2.

Theorem 6.3 (Off-policy Regret of Generated Samples).

Implications and Discussions:

6.2 Extension to Nonparametric Setting

Assumption 6.4 .

Assumption 6.5 .

Assumption 6.6 .

Theorem 6.7.

6.3 Results under Human Preference

Assumption 6.8 .

Theorem 6.9.

Implications and Discussions:

7 Numerical Experiments

7.1 Synthetic Data

7.2 Directed Text-to-Image Generation

7.3 Offline High-Reward Trajectory Generation in Reinforcement Learning

References

Appendix A Conditional Score Matching and Generation

A.1 Proof of Proposition 4.1

Appendix B Omitted Proofs in Section 5

B.1 Parametric Conditional Score Matching Error

Lemma B.1.

Definition B.2.

Definition B.3.

Lemma B.4.

B.2 Proof of Theorem 5.4 and Theorem 6.2

B.3 Proof of Theorem 6.3 and 6.9

B.3.1 SubOpt​(P^a;y∗=a)\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a) Decomposition.

B.3.2 Bounding ℰ1\mathcal{E}_{1} under the Labeled Dataset Setup.

Lemma B.5.

Lemma B.6.

B.3.3 Bounding ℰ1\mathcal{E}_{1} under the Human Preference Setup

Lemma B.7.

B.3.4 Bounding ℰ2\mathcal{E}_{2} and the Distribution Shift in Diffusion

Lemma B.8.

Distribution Shift

Appendix C Supporting Lemmas and Proofs

C.1 Supporting Lemmas

Lemma C.1.

Lemma C.2.

Lemma C.3.

C.2 Proof of Lemma B.4

C.3 Proof of Lemma B.5

C.4 Proof of Lemma B.6

Lemma C.4.

C.5 Proof of Lemma B.7

C.6 Proof of lemma B.8

C.7 Proof of Lemma C.1

C.8 Proof of Lemma C.2

Diffusion Model for Data-Driven Black-Box Optimization^†^†thanks: $\diamond$ denotes equal contribution listed in alphabetical order. $\ddagger$ denotes co-correspondence.^†^†thanks: This paper is a journal version of Yuan et al. (2023).

B.3.1 $\texttt{SubOpt}(\widehat{P}_{a};y^{*}=a)$ Decomposition.

B.3.2 Bounding $\mathcal{E}_{1}$ under the Labeled Dataset Setup.

B.3.3 Bounding $\mathcal{E}_{1}$ under the Human Preference Setup

B.3.4 Bounding $\mathcal{E}_{2}$ and the Distribution Shift in Diffusion

D.2.1 $\mathcal{E}_{1}$ : Nonparamtric Regression Induced Error.

D.2.2 $\mathcal{E}_{2}$ : Diffusion Induced On-support Error