This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Self-Correcting Self-Consuming Loops for Generative Model Training

Nate Gillman    Michael Freeman    Daksh Aggarwal    Chia-Hong Hsu    Calvin Luo    Yonglong Tian    Chen Sun
Abstract

As synthetic data becomes higher quality and proliferates on the internet, machine learning models are increasingly trained on a mix of human- and machine-generated data. Despite the successful stories of using synthetic data for representation learning, using synthetic data for generative model training creates “self-consuming loops” which may lead to training instability or even collapse, unless certain conditions are met. Our paper aims to stabilize self-consuming generative model training. Our theoretical results demonstrate that by introducing an idealized correction function, which maps a data point to be more likely under the true data distribution, self-consuming loops can be made exponentially more stable. We then propose self-correction functions, which rely on expert knowledge (e.g. the laws of physics programmed in a simulator), and aim to approximate the idealized corrector automatically and at scale. We empirically validate the effectiveness of self-correcting self-consuming loops on the challenging human motion synthesis task, and observe that it successfully avoids model collapse, even when the ratio of synthetic data to real data is as high as 100%.

Machine Learning, Generative Modeling, Self-Consuming Loops, Data Contamination, Deep Learning, Artificial Intelligence, Human Motion Synthesis

1 Introduction

Refer to caption
Figure 1: What happens after iteratively training a text-conditioned generative model for human motion synthesis for 50 generations? We simulate a self-consuming loop by creating synthetic data with the latest generative model, and mixing them with the original data to continue training the next generative model. We observe that by self-correcting the synthetic data with a physics simulator, the model can successfully avoid collapse and generate high-quality human motion. Faded poses represent poses from further back in time. Our paper provides theoretical and empirical justification for the self-correcting self-consuming loop.

Generative models have been used to synthesize training data for various learning tasks, to varying degrees of success. For example, for the tasks of image classification and contrastive representation learning, recent work (Azizi et al., 2023; Tian et al., 2023) finds that using data synthesized from generative models rivals using real data. Unfortunately, there is a gloomier outlook when attempting to generalize this framework to generative model training.

On one hand, there is evidence to suggest that training a generative model with its own outputs in a self-consuming manner will lead to collapse (Alemohammad et al., 2024). For example, after 50 iterations of self-consuming training, a human motion diffusion model (Tevet et al., 2023) collapses and fails to follow the text prompts or the laws of physics (see the two examples on the left of Figure 1).

On the other hand, evidence suggests that such a framework could avoid collapse, but only when a “moderate” amount of synthetic data is used (Bertrand et al., 2024). Worse still, this self-consuming scenario might happen without us knowing, and without us being able to quantify how much synthetic data is being used during training, due to the wide spread of AI generated content on the internet.

Intuitively, model collapse might be delayed or avoided by incorporating higher quality human generated data (Alemohammad et al., 2024), or by manually fixing the “mistakes” in machine created data. Considering the size of datasets used in practice (Schuhmann et al., 2022), neither of these options is a scalable solution.

In this paper, we aim to provide a theoretical analysis of how certain operations would avoid collapse in self-consuming loops, without any assumptions on the “moderateness” of synthetic data corruption. We introduce the mathematical abstraction of a self-correction operation. This operation maps synthesized data that are sampled from the generative model to data that are better representatives from the target probability distribution that the model is attempting to approximate. Instead of training on a combination of real data and synthesized data, we propose training on a combination of real data and synthesized and then self-corrected data. Note that injecting fresh human generated data can be viewed as a special case of this operation.

Our main theoretical findings (Theorem 4.3):

  1. (1)

    The self-consuming model with self-correction is exponentially more stable than the self-consuming model without any self-correction.

  2. (2)

    The self-correction procedure guarantees less unwanted variance during self-consuming model training.

In our theoretical study, we assume that correction is ideal in order to obtain rigorous performance guarantees. In our empirical study, we evaluate whether the same conclusions hold for noisy self-correction functions. We propose to automate this “self-correction” process by relying on programmed expert knowledge rather than a human-in-the-loop, such that the function can be applied at scale. We focus on the human motion synthesis task (Guo et al., 2022), and implement the self-correction function with a physics simulator-based imitation model (Luo et al., 2021). Our empirical results confirm that our theoretical findings hold in practice:

  1. (1)

    As illustrated in Figure 1, the self-correcting self-consuming model generates higher-quality human motion than the one without any self-correction.

  2. (2)

    The self-correction function allows self-consuming loops to avoid collapse even at a high synthetic data to real data ratio (e.g. 100%).

Our theory and experiments suggest that self-correction should stabilize self-consuming model training for any generative modeling task for which there exists a high quality “self-correction” function. We have released all the code associated with this paper.111Project page: https://nategillman.com/sc-sc.html

2 Related Work

2.1 Learning Representations with Synthetic Data

Real curated datasets are costly to obtain, so there has been much interest in generating synthetic data as training data for various vision tasks. Azizi et al. (2023) demonstrates that text-to-image diffusion models such as Imagen (Saharia et al., 2022) can generate synthetic examples that augment the ImageNet dataset for better image classification. He et al. (2023) studies how synthetic data from text-to-image models, when used exclusively, can be used as training data for image recognition tasks. Similarly, Tian et al. (2023) finds that using synthetic outputs from a text-to-image model results in contrastive models whose downstream performance rivals that of CLIP (Radford et al., 2021) on visual recognition tasks, including dense prediction. And the work in Jahanian et al. (2022) explored methods for multi-view representation learning by using the latent space of the generative models to generate multiple “views” of the synthetic data. The above works collectively provide evidence that some representation learning tasks, when trained on synthetic data from some given generative models, yield excellent results.

2.2 Training Generative Models on Synthetic Data

Another line of reseach investigates the use of synthetic data for training generative models. Shumailov et al. (2023) and Martínez et al. (2024) show that the use of model generated content in generative model training results in model degradation, likely because self-consuming loops remove low-density areas from the estimated probability manifold. Alemohammad et al. (2024) formalize three different kinds of self-consuming generative models: the fully synthetic loop, the synthetic augmentation loop, and the fresh data loop. In all of these loops, they iteratively re-train the model from scratch for every new generation. They empirically find that only the fresh data loop avoids model degradation.

Another recent work (Bertrand et al., 2024) considers the problem of iteratively fine-tuning in the context of synthetic augmentation loops. They find that self-consuming augmentation loops do not necessarily collapse, so long as the synthetic augmentation percentage is sufficiently low. The authors use techniques from the field of performative stability (Perdomo et al., 2020) to prove the existence of a convergence phenomenon in the space of model parameters. Our paper differs from prior work as we conduct analysis on self-consuming generative model training when the synthetic data can be optionally corrected. The correction can be performed with a human-in-the-loop, or by incorporating learned or programmed expert knowledge, as explored for natural language (Saunders et al., 2022; Welleck et al., 2023; Wu et al., 2023) and human motion (Yuan et al., 2023; Xu et al., 2023). We validate our theory with a practical self-correcting operations designed for image generation and human motion synthesis tasks.

Algorithm 1 Iterative Fine-tuning of a Generative Model With Correction
  Input: 𝒟real:={xi}i=1n\mathcal{D}_{\text{real}}:=\{x_{i}\}_{i=1}^{n}, 𝒜\mathcal{A}, 𝒜ft\mathcal{A}_{\text{ft}}, πγ\pi_{\gamma} // ground truth data, learning procedure, fine-tuning procedure, correction function
  Parameters: TT, λ\lambda, γ{\color[rgb]{0.0,0.2901960784313726,0.6784313725490196}\definecolor[named]{pgfstrokecolor}{rgb}{0.0,0.2901960784313726,0.6784313725490196}\gamma} // number of retraining iterations, proportion of generated data, correction strength
  pθ0𝒜(𝒟real)p_{\theta_{0}}\leftarrow\mathcal{A}(\mathcal{D}_{\text{real}}) // learn generative model from scratch on true data
  for t=1t=1 to TT do
     𝒟synth{πγ(x~i)}i=1λn\mathcal{D}_{\text{synth}}\leftarrow\{{\color[rgb]{0.0,0.2901960784313726,0.6784313725490196}\definecolor[named]{pgfstrokecolor}{rgb}{0.0,0.2901960784313726,0.6784313725490196}\pi_{\gamma}}(\tilde{x}_{i})\}_{i=1}^{\lfloor\lambda\cdot n\rfloor}, with x~ipθt1\tilde{x}_{i}\sim p_{\theta_{t-1}} // sample λn\lfloor\lambda\cdot n\rfloor synthetic data points, pass through correction function
     pθt𝒜ft(𝒟real𝒟synth;pθt1p_{\theta_{t}}\leftarrow\mathcal{A}_{\text{ft}}(\mathcal{D}_{\text{real}}\cup\mathcal{D}_{\text{synth}};p_{\theta_{t-1}}) // fine-tune previous generation using augmented dataset
  end for
  Return [pθ0,pθ1,pθ2,,pθT][p_{\theta_{0}},p_{\theta_{1}},p_{\theta_{2}},\dots,p_{\theta_{T}}]

3 Overall Training Procedure

We describe our proposed procedure in concise language in Algorithm 1, and we explain it in more detail here. We train the zero’th generation from scratch on the ground truth dataset 𝒟real:={xi}i=1n\mathcal{D}_{\mathrm{real}}:=\{x_{i}\}_{i=1}^{n}, and we stop training when the model is close to convergence. For all the following generations, we fine-tune the previous generation’s latest checkpoint on a combination of the ground truth dataset 𝒟real\mathcal{D}_{\mathrm{real}}, as well as λn\lfloor\lambda\cdot n\rfloor synthetic data points which are generated from the previous generation’s latest checkpoint, and then passed through the correction function πγ\pi_{\gamma}.

The correction function πγ\pi_{\gamma} is parameterized by the correction strength γ0\gamma\in\mathbb{R}_{\geq 0}, which controls how much influence the correction function has on the input data points towards increasing a given point’s likelihood with respect to the target distribution. The other main hyperparameter λ0\lambda\in\mathbb{R}_{\geq 0} is the synthetic augmentation percent, and it controls the ratio of synthetic data to real data in each iteration of fine-tuning. When γ=0\gamma=0, we recover iterative re-training with synthetic augmentation considered in (Bertrand et al., 2024). And if we choose the synthetic augmentation percent to be λ=0\lambda=0, then each generation simply corresponds to fine-tuning the model on the same dataset that it was trained on initially.

We now use iterative fine-tuning interchangeably with the more general term self-consuming loop. We also consider the idealized correction function for our theoretical analysis, and a broader family of practical correction functions for different data types.

4 Theoretical Analysis

4.1 Preliminaries

We mostly follow the notation from (Bertrand et al., 2024), except for introducing the correction function πγ\pi_{\gamma}. Let us denote by pdatap_{\mathrm{data}} the ground truth probability distribution that we want to train a generative model to estimate. Suppose we have some dataset 𝒟real={xi}i=1n\mathcal{D}_{\mathrm{real}}=\{x_{i}\}_{i=1}^{n} sampled from pdatap_{\mathrm{data}}. We write p^data=(1/n)i=1nδxi\hat{p}_{\mathrm{data}}=(1/n)\sum_{i=1}^{n}\delta_{x_{i}}. More generally, we use a hat to denote the empirical distribution over finitely many samples from the corresponding distribution.

Suppose that we have a class of generative models parameterized by Θd\Theta\subset\mathbb{R}^{d}. We denote by pθp_{\theta} a probability distribution in this class with model parameters θΘ\theta\in\Theta. We define the optimal model parameters within this class to be

θ=argmaxθΘ𝔼xpdata[logpθ(x)],\theta^{\star}=\operatorname*{arg\,max}_{\theta^{\prime}\in\Theta}\mathbb{E}_{x\sim p_{\mathrm{data}}}[\log p_{\theta^{\prime}}(x)], (1)

where we break ties by minimizing θ\|\theta^{\star}\|. Typically, such optimal parameters yield a model pθp_{\theta^{\star}} which closely approximates the oracle ground truth distribution pdatap_{\mathrm{data}}, but doesn’t equal it exactly; accordingly, we define the Wasserstein-2 distance between the distributions to be

ε:=dW(pθ,pdata).\varepsilon:=d_{W}(p_{\theta^{\star}},p_{\mathrm{data}}). (2)

The model weights for the first generation are naturally defined according to the optimization

θ0n:=argmaxθΘ[𝔼xp^data[logpθ(x)]].\theta_{0}^{n}:=\operatorname*{arg\,max}_{\theta^{\prime}\in\Theta}[\mathbb{E}_{x\sim\hat{p}_{\mathrm{data}}}[\log p_{\theta^{\prime}}(x)]]. (3)

This corresponds to training on the finite subset 𝒟real\mathcal{D}_{\mathrm{real}}. Next, let us suppose that the model weights from generation tt are denoted θtn\theta_{t}^{n}. We will formalize a procedure for updating these weights for the next generation to obtain θt+1n\theta_{t+1}^{n}. For this, we need to define our correction function, and then we will use it to define the weight update.

Definition 4.1.

For any probability distribution, and for any γ0\gamma\in\mathbb{R}_{\geq 0}, we define the correction of strength γ\gamma of distribution pθp_{\theta} to be the distribution

πγpθ(x):=pθ(x)+γpθ(x)1+γ,\pi_{\gamma}p_{\theta}(x):=\frac{{p_{\theta}}(x)+\gamma p_{\theta^{\star}}(x)}{1+\gamma}, (4)

where pθp_{\theta^{\star}} is defined in (1). For any augmentation percentage λ0\lambda\geq 0, we define the weight update mapping to be

πγ𝒢λn(θ)\displaystyle\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta) :=localargmaxθΘ^(θ,θ)\displaystyle:=\operatorname*{local\,argmax}_{\theta^{\prime}\in\Theta}\hat{\mathcal{H}}(\theta,\theta^{\prime}) (5)
:=localargmaxθΘ[𝔼xp^data[logpθ(x)]]\displaystyle:=\operatorname*{local\,argmax}_{\theta^{\prime}\in\Theta}\Big{[}\mathbb{E}_{x\sim\hat{p}_{\mathrm{data}}}[\log p_{\theta^{\prime}}(x)]]
+λ𝔼xπγpθ^[logpθ(x)]],\displaystyle\qquad\qquad\qquad\,\,\,\,\,\,\,\,\,\,\,+\lambda\mathbb{E}_{x\sim{\widehat{\pi_{\gamma}p_{\theta}}}}[\log p_{\theta^{\prime}}(x)]\Big{]},

where p^data\hat{p}_{\mathrm{data}} and πγpθ^\widehat{\pi_{\gamma}p_{\theta}} are empirical distributions of size nn and λn\lfloor\lambda\cdot n\rfloor respectively.

To continue our discussion from before, our iterative weight update is defined as θt+1n:=πγ𝒢λn(θtn)\theta_{t+1}^{n}:=\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta_{t}^{n}).

Note that we use an global maximization in (3) when defining the initial parameters θ0n\theta_{0}^{n}, but we use a local maximization when computing our parameter update in (5). This difference is analogous to the differences between how model weights update during initial training, where parameter updates are more global, and during fine-tuning, where parameter updates are more local.

4.1.1 Understanding the correction πγpθ(x)\pi_{\gamma}p_{\theta}(x)

For γ=0\gamma=0, the correction mapping in (4) simplifies to π0pθ=pθ\pi_{0}p_{\theta}=p_{\theta}, which is just the original distribution; this corresponds to no correction at all. For γ=1\gamma=1, it is π1pθ=(pθ+pθ)/2\pi_{1}p_{\theta}=(p_{\theta}+p_{\theta^{\star}})/2. And for γ=\gamma=\infty, it is πpθ=pθ\pi_{\infty}p_{\theta}=p_{\theta^{\star}}, which corresponds to the optimal distribution. So as γ\gamma increases from 0 to \infty, the distribution πγpθ\pi_{\gamma}p_{\theta} has a likelihood profile that matches pθp_{\theta} less, and pθp_{\theta^{\star}} more. As pθp_{\theta^{\star}} is the optimal model in our generative model class, this means that as γ\gamma increases from 0 to \infty, we have that πγpθ(x)\pi_{\gamma}p_{\theta}(x) is a PDF which better represents the target likelihood that we want to estimate through training the generative model.

In our theoretical formulation, we consider correction functions that correct the probability distribution pθp_{\theta}, rather than the more intuitive (and practical) case of a correction function that corrects individual points that the distribution is defined over. In Appendix C, we specify sufficient conditions under which a pointwise correction function is guaranteed to correspond to a distribution-wise correction function of the same form as those which we consider in our theoretical study and therefore can enjoy the theoretical stability guarantees we prove. We also provide a concrete example of a projection function, in the Gaussian case, which provably satisfies those conditions. We conduct a series of experiments on this toy example in Section 5.

4.1.2 Understanding the weight update πγ𝒢λn(θ)\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta)

The weight update πγGλn(θ)\pi_{\gamma}G_{\lambda}^{n}(\theta) in (5) is a formalization of the intended output of fine-tuning pθp_{\theta} on 𝒟real𝒟synth\mathcal{D}_{\mathrm{real}}\cup\mathcal{D}_{\mathrm{synth}}, where 𝒟real={xi}i=1n\mathcal{D}_{\mathrm{real}}=\{x_{i}\}_{i=1}^{n} is the ground truth dataset of size nn, and 𝒟synth={x~i:x~iπγpθ^}i=1λn\mathcal{D}_{\mathrm{synth}}=\{\tilde{x}_{i}:\tilde{x}_{i}\sim\widehat{\pi_{\gamma}p_{\theta}}\}_{i=1}^{\lfloor\lambda\cdot n\rfloor} is the synthesized-and-corrected dataset of size λn\lfloor\lambda\cdot n\rfloor. In other words, in an ideal run of stochastic gradient descent fine-tuning, the model weights θ\theta should update to πγ𝒢λn(θ)\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta), as defined in (5), when trained on 𝒟real𝒟synth\mathcal{D}_{\mathrm{real}}\cup\mathcal{D}_{\mathrm{synth}}.

Intuitively, the weight update θπγ𝒢λn(θ)\theta\mapsto\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta) avoids the loss of variance in the generated data by ensuring that at each step, the model is trained on synthetic data which is likelier to have been sampled from the diverse target distribution. This positive phenomenon is more pronounced when the correction strength γ\gamma is larger.

4.2 Assumptions

In order to prove our main result, we need some regularity assumptions about the learning procedure. Informally speaking, we will assume that the class of generative models that we consider is smoothly parameterized by its model weights; the loss landscape is concave near the ideal model weights; and the class of generative models does an increasingly good job approximating the target data distribution as the dataset size increases. We formally quantify and state these hypotheses in Assumption 4.2.

Assumption 4.2.

The following are true.

  1. 1.

    There exists some L>0L>0 such that, for all θ\theta sufficiently close to θ\theta^{\star}, the mapping xθ2logpθ(x)x\mapsto\nabla_{\theta}^{2}\log p_{\theta}(x) is LL-Lipschitz.

  2. 2.

    The mapping θ𝔼xpdata[logpθ(x)]\theta\mapsto\mathbb{E}_{x\sim p_{\mathrm{data}}}[\log p_{\theta}(x)] is continuously twice differentiable locally around θ\theta^{\star}, and there exists some α>0\alpha>0 such that 𝔼xpdata[θ2logpθ(x)]|θαId0.\mathbb{E}_{x\sim p_{\text{data}}}\left[\nabla_{\theta}^{2}\log p_{\theta}(x)\right]|_{\theta^{\star}}\preceq-\alpha I_{d}\prec 0.

  3. 3.

    There exist a,b,εOPT0a,b,\varepsilon_{\text{OPT}}\geq 0 and a neighborhood UU of θ\theta^{\star} such that, for any δ(0,1)\delta\in(0,1), with probability 1δ1-\delta over the samplings, we have222The map πγGλ\pi_{\gamma}G_{\lambda}^{\infty} is defined similarly to πγGλn\pi_{\gamma}G_{\lambda}^{n} in (5), but with p^data\hat{p}_{\mathrm{data}} replaced with pdatap_{\mathrm{data}}, and with πγpθ^\widehat{\pi_{\gamma}p_{\theta}} replaced with πγpθ\pi_{\gamma}p_{\theta}. See Appendix A for more details. This estimate is identical to the analogous Assumption 3 used in (Bertrand et al., 2024), with the only difference being it is applied to our iterative fine-tuning update function. See Appendix B for further discussion.

    πγ𝒢λn(θ)πγ𝒢λ(θ)εOPT+anlogbδ.\|\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta)-\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta)\|\leq\varepsilon_{\text{OPT}}+\frac{a}{\sqrt{n}}\sqrt{\log\frac{b}{\delta}}. (6)

    for all θU\theta\in U and nn\in\mathbb{N}. Denote this bound by τn(δ)\tau_{n}(\delta).

In Assumption 4.2 (2), the notation “\preceq” corresponds to the Loewner order on symmetric matrices: we write that ABA\preceq B if BAB-A is positive semi-definite, and ABA\prec B if BAB-A is positive definite. In particular, Assumption 4.2 (2) implies that the matrix 𝔼xpdata[θ2logpθ(x)]|θ\mathbb{E}_{x\sim p_{\text{data}}}\left[\nabla_{\theta}^{2}\log p_{\theta}(x)\right]|_{\theta^{\star}} is negative definite, and its largest eigenvalue is at most α-\alpha. And Assumption 4.2 (3) mirrors the main assumption in (Bertrand et al., 2024); it is motivated by generalization bounds in deep learning, see e.g. (Jakubovitz et al., 2019; Ji et al., 2021). The interested reader can consult Appendix B for more details on this assumption.

4.3 Iterative Fine-Tuning with Correction

We now have the language to state our main result, which essentially says that if the initial parameters θ0\theta_{0} are sufficiently close to the optimal model parameters θ\theta^{\star}, and if the augmentation percentage λ\lambda is sufficiently small, then under iterative fine-tuning with correction, we can expect our subsequent model parameters to stay close to θ\theta^{\star}.

Theorem 4.3 (Stability of Iterative Fine-Tuning with Correction).

Fix an augmentation percentage λ>0\lambda\in\mathbb{R}_{>0} and a correction strength γ0\gamma\in\mathbb{R}_{\geq 0}. Suppose we have an iterative fine-tuning procedure defined by the rule θt+1n=πγ𝒢λn(θtn)\theta_{t+1}^{n}=\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta_{t}^{n}), and suppose that Assumption 4.2 holds. Define the constant

ρ(λ)\displaystyle\rho(\lambda) :=ρ(λ;α,ε,L):=λ(α+εL)αλ(α+εL)\displaystyle:=\rho(\lambda;\alpha,\varepsilon,L):=\frac{\lambda(\alpha+\varepsilon L)}{\alpha-\lambda(\alpha+\varepsilon L)}

and fix any δ(0,1)\delta\in(0,1). If θ0\theta_{0} is sufficiently close to θ\theta^{\star}, and if λ(1+εLα)<1+γ2+γ\lambda\left(1+\frac{\varepsilon L}{\alpha}\right)<\frac{1+\gamma}{2+\gamma}, then ρ(λ)/(1+γ)<1\rho(\lambda)/(1+\gamma)<1, and it follows that the stability estimate holds with probability 1δ1-\delta:

θtn\displaystyle\|\theta_{t}^{n} θ\displaystyle-\theta^{\star}\| (7)
τn(δ/t)i=0t(ρ(λ)1+γ)i+(ρ(λ)1+γ)tθ0nθ\displaystyle\leq\tau_{n}(\delta/t)\sum_{i=0}^{t}\left(\frac{\rho(\lambda)}{1+\gamma}\right)^{i}+\left(\frac{\rho(\lambda)}{1+\gamma}\right)^{t}\|\theta_{0}^{n}-\theta^{\star}\|

for all t>0t>0.

We prove Theorem 4.3 in Appendix A.

Remark 4.4.

If we apply Theorem 4.3 with correction strength γ=0\gamma=0, then the iterative fine-tuning procedure trains successively on a combination of raw synthetic data that has not been corrected using a correction function and ground truth data. This is exactly the case considered in (Bertrand et al., 2024). Accordingly, the bound in (7), applied with γ=0\gamma=0, exactly recovers their result.

Corollary 4.5.

Under the assumptions from Theorem 4.3, iterative fine-tuning with any amount of correction outperforms iterative fine-tuning without correction–in the sense that it is exponentially more stable, and it results in better model weights.

Proof of Corollary 4.5.

We apply Theorem 4.3 with γ=0\gamma=0, which corresponds to no correction, as well as with γ>0\gamma>0, which corresponds to any amount of correction. For any γ>0\gamma>0, we notice that the RHS of (7) is strictly smaller than when γ=0\gamma=0. This guarantees better stability as tt\to\infty, as well as model weights θtn\theta_{t}^{n} closer to θ\theta^{\star}. ∎

Example 4.6.

If we apply Theorem 4.3 with correction strength γ\gamma\to\infty, then the bound (7) in Theorem 4.3 limits to τn(δ/t)\tau_{n}(\delta/t). This implies that the practical iterate θtn\theta_{t}^{n} approaches the ideal model paramaters, and is at worst some constant away, that depends on error from the optimization procedure, as well as statistical error from using finitely many ground truth data samples nn.

Note that Theorem 4.3 relies on the assumption that the initial model parameters θ0\theta_{0} are sufficiently close to the ideal model parameters θ\theta^{\star}, and also that the augmentation percentage λ\lambda is sufficiently small. We hypothesize that these assumptions can be relaxed in the case where a correction function participates in the iterative fine-tuning procedure–intuitively, the correction function should compensate for errors that arise from θ0n\theta_{0}^{n} being worse, as well as errors that arise from incorporating more synthetic data. We frame this in the following conjecture.

Conjecture 4.7.

In the case of iterative fine-tuning with correction, we may relax how close the initial model parameters θ0n\theta_{0}^{n} need to be to the optimal model parameters θ\theta^{\star}, as well as choose a larger synthetic augmentation percentage λ\lambda, while still retaining the improved stability estimate (7).

We provide empirical evidence for Conjecture 4.7 in Section 7 on the human motion synthesis task. In fact, Theorem 4.3 represents partial progress towards this conjecture. Namely, according to Theorem 4.3, for large correction strength γ\gamma, we can effectively choose a synthetic augmentation percentage that is twice as large as we would be able to without any correction, and still be able to meet the assumptions of the theorem. This is because limγ1+γ2+γ=1\lim_{\gamma\to\infty}\frac{1+\gamma}{2+\gamma}=1, which is twice as large as the bound when γ=0\gamma=0.

5 Toy Example: Gaussian

We first assume oracle knowledge of the ground truth distribution, and use a toy example to directly demonstrate the impact of the correction strength γ\gamma on model performance and stability as stated in Theorem 4.3 and Corollary 4.5. Our ground truth distribution is a 2-dimensional isotropic Gaussian centered at the origin, i.e., θ=((0,0),I2)\theta^{\star}=((0,0),I_{2}), and our correction is “distribution-wise” in this idealized scenario. We consider the more practical setting, where we don’t have oracle knowledge of the target distribution a priori, and where the data correction is “point-wise”, in the empirical studies in the following two sections. Further, in Appendix C, we show that, in theory, sufficiently well-behaved pointwise correction functions indeed correspond to distribution-wise correction functions.

Refer to caption

Figure 2: Empirical results from our Gaussian toy example. The graph demonstrates that increasing the correction strength γ\gamma, with a fixed augmentation ratio of λ=0.5\lambda=0.5, improves performance and stability after self-consuming iterations.

Concretely, our ground truth dataset contains 5050 points sampled from the target distribution, which are used to estimate θ050=(μ0,Σ0)6\theta_{0}^{50}=(\mu_{0},\Sigma_{0})\in\mathbb{R}^{6}. We fix our synthetic augmentation percentage to be λ=0.5\lambda=0.5, and inductively synthesize a new dataset 𝒟synth={yi𝒩(μt,Σt)}i=125\mathcal{D}_{\mathrm{synth}}=\{y_{i}\sim\mathcal{N}(\mu_{t},\Sigma_{t})\}_{i=1}^{25}. We implement a correction function to map 𝒟synth\mathcal{D}_{\mathrm{synth}}, which was sampled from pθt50p_{\theta_{t}^{50}}, to a dataset 𝒟corrected\mathcal{D}_{\mathrm{corrected}}, which is likelier to have been sampled from the target density pθp_{\theta^{\star}}. We do this by sampling 𝒟corrected\mathcal{D}_{\mathrm{corrected}} from the middle density corresponding to a given correction strength γ\gamma:

πγp^θt50(x):=p^θt50(x)+γpθ(x)1+γ,\pi_{\gamma}\hat{p}_{\theta_{t}^{50}}(x):=\frac{\hat{p}_{\theta_{t}^{50}}(x)+\gamma p_{\theta^{\star}}(x)}{1+\gamma}, (8)

where p^θt50\hat{p}_{\theta_{t}^{50}} is the empirical PDF obtained from 𝒟synth\mathcal{D}_{\mathrm{synth}}.

We logarithmically accrue synthetic data points to simulate the case of fine-tuning. We obtain the updated model parameters θt+150\theta_{t+1}^{50} by computing the sample mean and covariance on this augmented dataset. In Figure 2, we present the Wasserstein distance between the origin-centered isotropic Gaussian target distribution and the distribution defined by the parameters θt50\theta_{t}^{50} at each iteration tt. Our results illustrate how increasing the correction strength γ\gamma adds stability and results in convergence near better Wasserstein scores in later generations, in accordance with Theorem 4.3. The experiments also demonstrate how even a very small increase in γ\gamma can improve performance over the baseline, in accordance with our claim of exponential improvement in Corollary 4.5.

6 Toy Example: MNIST

Our proof uses the optimal target PDF pθp_{\theta^{\star}} to define the correction function πγ\pi_{\gamma}. This is empirically validated by the Gaussian toy experiment, which assumes knowing the true target distribution. In practice, the correction function only depends on the ability to map synthesized data to data which is likelier to have been sampled from the ground truth distribution. Crucially, this can be achieved without having a complete description of the target distribution. For example, with our human motion experiments, we will demonstrate that point-wise correction based on the laws of physics is one proxy approach to make a sample more likely, without knowing the true target distribution.

One has the freedom to explore alternative approaches to data correction for more general data types, such as images. For example, one simple heuristic is to identify the “anchor” or “exemplar” images, which are intuitively representative and likely. The correction function can then be implemented as mapping or morphing synthesized data towards its nearest anchor, to make the synthesized data more representative and likely. In this section, we implement this approach on MNIST and study its performance.

Refer to caption

Figure 3: Empirical results from our MNIST toy example. These synthesized images demonstrate that after 50 self-consuming iterations at 150% augmentation percentage, the model which is trained using iterative fine-tuning with self-correction is able to generate higher quality samples than the model trained using iterative fine-tuning without any self-correction.

For our MNIST (LeCun et al., 1998) experiments, we train a diffusion model (Ho et al., 2020) for class-conditional image generation, using a train split of size n=12000n=12000. For our iterative fine-tuning experiments, we train the model for 20 epochs, then synthesize λ12000/10\lambda\cdot 12000/10 images for each digit, and then augment the ground truth dataset with these to train on for the next generation; every following generation follows the same procedure, but only trains for a single epoch. We vary our experiments over augmentation percentages λ{0.2,0.5,1.0,1.5}\lambda\in\{0.2,0.5,1.0,1.5\}. To define our self-correction operation, we first compute KK-means clusters over the training split for each digit. Our iterative fine-tuning with self-correction experiments use the same setup described above, except instead of training on the synthesized images, we train on the synthesized and then corrected images, where “correcting” an image means finding the nearest centroid in the KK centroids for that digit that we computed at the start of training. We swept the values K{1,2,4,,1024}K\in\{1,2,4,\dots,1024\}, and we found that any reasonably large KK results in the same general trend where self-correction improves the metrics and stability. We report our results for K=16K=16, which performs the best.

We present images synthesized using our trained models in Figure 3. These synthesized images demonstrate that iterative fine-tuning eventually generates many low quality and illegible digits, and this problem is solved by applying our self-correction operation. Further experiment details, including graphs of the FID metrics for each generation that provide rigorous evidence for this trend across augmentation percentages, can be found in Appendix D. Our empirical results demonstrate that applying self-correction improves performance during iterative fine-tuning for our MNIST image generation task across self-consuming generations, and this relative performance is amplified when the augmentation percentage is larger. The behavior that we observe is consistent with our theoretical results in Section 4, as well as our human motion experiments in Section 7.

7 Human Motion Synthesis

Refer to caption

Refer to caption

Figure 4: Results from our human motion experiments on iterative fine-tuning with self-correction. These graphs show evaluation metrics for the last checkpoint for every generation. This is the checkpoint used for sampling in the iterative fine-tuning experiments, and it is also the checkpoint where training is resumed with this new partially synthesized dataset. We can see that with self-correction, the iterative fine-tuning procedure more stably converges to a better FID score, and more quickly. When the dataset size is smaller (n=64n=64, above) we can see that iterative fine-tuning with no self-correction has a flat Matching score, as well as diverging FID and Diversity scores, indicating model collapse. And when the dataset size is larger (n=2794n=2794, below), there is less collapse for iterative fine-tuning with no self-correction, although the variance of the FID score is worse, as is the average FID across generations. In both cases, we see that iterative fine-tuning with self-correction outperforms iterative fine-tuning with no self-correction, and is competitive with the baseline after many generations.

Theorem 4.3 states that, in theory, iterative fine-tuning with correction should be more stable than iterative fine-tuning without correction. Crucially, the stability estimates that we prove rely on the dataset size, the synthetic augmentation percentage, how expressible the generative model class is, and having an idealized correction function. To validate how our theory works beyond toy examples, we conduct a case study on human motion synthesis with diffusion models (Tevet et al., 2023). We believe this is a natural setting to test our iterative fine-tuning with correction framework, because synthesizing natural motions is a challenging problem, but there is a natural and intuitive way to automatically correct them at scale–namely, using a physics simulator.

7.1 Generative Model

For our generative model, we use the Human Motion Diffusion Model (MDM) (Tevet et al., 2023). This is a classifier-free diffusion-based generative model for the text-to-motion generation task, where the model receives as input a description of a motion sequence (e.g. “get down on all fours and crawl across the floor”), and outputs a sequence of skeleton poses which attempt to embody that prompt. Synthesizing human motion is challenging not only for the diverse and compositional text prompts, but also due to failure of physics obeying-ness (e.g. feet skating, floating, penetrating a surface), which is not explicitly enforced by deep generative models.

7.2 Physics Simulator as Self-Correction Function

For our self-correction function, we use Universal Humanoid Control (UHC) (Luo et al., 2021), which is an imitation policy that operates inside the MuJoCo physics simulator (Todorov et al., 2012). Given an input sequence of humanoid skeleton poses, UHC attempts to imitate the motion sequence, constrained by the laws of physics imposed by the physics simulator, and it outputs a new motion sequence that is the closest possible approximation it can replace it with. For example, if an input motion sequence violates the laws of physics by having a foot penetrate through the floor, then the motion sequence output by UHC will attempt to remove that physically impossible artifact while maintaining the semantic integrity of the original input motion. We use VPoser (Pavlakos et al., 2019) and SMPL (Loper et al., 2015) to translate joint representations between the human motion generator and the physics simulator.

The physics simulator allows us to self-correct a synthesized motion automatically. Our underlying assumption is that by enforcing the physics obeying-ness (via the simulator) and closeness to the synthesized motion (via the imitation objective), the self-correction function would act as similar as an idealized corrector as possible.

Refer to caption

Figure 5: How does the self-correction operation affect iterative fine-tuning, qualitatively? Here we present some visualizations. The prompt which describes the ground truth motion, and which we use to generate the three other motions, is: “a person stands with feet wide, stretches both hands up over his head and then swings down by the waist and hangs arms down before standing up”. We can see that the iterative fine-tuning model produces a motion where the human moves closer to the camera than the others; this is evidence of model collapse, as moving feet is irrelevant to the prompt. Additionally, this motion produces single frames that suddenly snap to a physically impossible position–note the leg penetration through the ground plane. These negative artifacts do not exist in the motions synthesized from the ground truth, baseline model, or iterative fine-tuning with self-correction model. Lastly, we note that the iterative fine-tuning motion depicted here is semantically similar to crawling. We observe in our experiments with smaller dataset sizes that the iterative fine-tuning model generates less diverse outputs than the baseline model and the iterative fine-tuning with self-correction model, and that this crawling pattern appears more often in the latter. Each snapshot is taken at exactly frame 105 of their respective videos. The two motions on the right come from models that were iteratively fine-tuned for 50 generations, with a train set of size n=64n=64, and a synthetic augmentation percentage of 25%25\%. For all pictures of the human, the camera is fixed at the same position, and for consistency the image is not resized.

7.3 Experimental setup

We preprocess the MoVi (Ghorbani et al., 2021) subset of HumanML3D (Guo et al., 2022) using the official code implementation of HumanML3D. We filter out movements involving interactions with chairs, as UHC by default does not handle human-object interactions. We take as our train split the train split from HumanML3D, intersected with our filtered subset of MoVi, and likewise for the test split. This procedure yields a train set of size n=2794n=2794 and a test set of size 546546. We further randomly select a smaller training set of n{64,128,256}n\in\{64,128,256\} examples, to simulate the more challenging scenario when the initial generative model is sub-optimal (due to data scarcity). The smaller data also enables us to explore larger synthetic augmentation percentage due to compute constraints. From here, the iterative re-training procedure follows Algorithm 1. We spell it out in this concrete experimental setup.

We first train on the ground truth train split until the model is nearly converged, using all the default hyperparameters from MDM. We evaluate and save this last checkpoint from generation 0. From here, for each generation t{1,2,,50}t\in\{1,2,\dots,50\}, we run three sets of experiments.

  1. A.

    Baseline: fine-tune the latest checkpoint from generation t1t-1 for mm batches on ground truth dataset 𝒟real\mathcal{D}_{\mathrm{real}}.

  2. B.

    Iterative fine-tuning: fine-tune the latest checkpoint from generation t1t-1 on 𝒟real𝒟synth,t1\mathcal{D}_{\mathrm{real}}\cup\mathcal{D}_{\mathrm{synth},t-1} for mm batches. Here, 𝒟synth,t1\mathcal{D}_{\mathrm{synth},t-1} is a synthetic dataset of size λn\lfloor\lambda\cdot n\rfloor generated from the checkpoint for generation t1t-1, using randomly chosen prompts from the train split.

  3. C.

    Iterative fine-tuning with self-correction: fine-tune the latest checkpoint from generation t1t-1 on missingDrealUHC(𝒟synth,t1)\mathcal{\mathcal{missing}}D_{\mathrm{real}}\cup\mathrm{UHC}(\mathcal{D}_{\mathrm{synth},t-1}) for mm batches. Here, UHC(𝒟synth,t1)\mathrm{UHC}(\mathcal{D}_{\mathrm{synth},t-1}) denotes a synthetic dataset of size λn\lfloor\lambda\cdot n\rfloor generated from the latest checkpoint for generation t1t-1, using randomly chosen prompts from the train split, which is then corrected by UHC.

We experiment with synthetic augmentation percentages λ{0.05,0.10,0.15,0.20,0.25}\lambda\in\{0.05,0.10,0.15,0.20,0.25\} on the larger dataset; we set the number of batches seen during generation 0 to be 31253125, and the number of batches seen for each later generation to be m=625m=625. Separately, we experiment with synthetic augmentation percentages λ{0.25,0.50,0.75,1.00}\lambda\in\{0.25,0.50,0.75,1.00\} on the smaller datasets; we set the number of batches seen during generation 0 to be 78k78*k for dataset size 64k64*k, and the number of batches seen for each later generation t>0t>0 to be m=16m=16. We choose to control how many data points the model sees across each generation, rather than controlling some other quantity like the number of epochs, as this allows each experiment to compare against its baseline in a controlled way, which in turn allows them to compare against each other in a controlled way.

We compute every evaluation one time for each checkpoint using the evaluation script provided in the original MDM codebase. Regardless of the train split size, we perform sampling for evaluation using all 546 motion sequences from the test split, since the FID score is sensitive to generated dataset size. We use the same hyperparameters as those used for MDM, including batch size 6464, AdamW (Loshchilov & Hutter, 2019) with learning rate 1e41e-4, and classifier-free guidance parameter 2.52.5. And for UHC we used the uhc_explicit model for imitation.

7.4 Quantitative Analysis of Results

For each of these experiments we report the metrics from MDM, as used by (Guo et al., 2022): FID measures how similar the distribution of generated motions is to the ground truth distribution; Diversity measures the variance of the generated motions; and Matching Score measure how well the generated motions embody the given text prompt. In Figure 4 we present results from experiments on our 6464-size dataset with 100%100\% synthetic augmentation, as well as our 27942794-size dataset with 25%25\% synthetic augmentation.

Our experimental results confirm our theoretical results, that iterative fine-tuning with self-correction outperforms iterative fine-tuning without self-correction, in the sense that the graphs are generally more stable across generations, and approach better evaluation metric values. In particular, Theorem 4.3 and Corollary 4.5 claim that any amount of idealized self-correction will improve the stability bound during iterative fine-tuning. Our results in Figure 4 demonstrate that the FID score is lower and more stable across generations when applying self-correction, and generally higher and less stable than the baseline, where there is no self-consuming training at all. We conduct experiments across multiple seeds, and we find empirically that this general phenomenon holds consistently, where the self-correction technique consistently yields improved training dynamics over iterative fine-tuning with no correction. Graphs from these runs can be found in Appendix G.

Our experimental results also provide empirical evidence for Conjecture 4.7. Observe that in the baseline experiments in Figure 4, the FID score decreases across generations, which indicates that the initial model parameters θ0n\theta_{0}^{n} are not that close to the optimal model parameters θ\theta^{\star}; additionally, the augmentation percentages considered in the graph are 25%25\% and 100%100\%. Conjecture 4.7 claims that performing self-correction during iterative fine-tuning improves performance, even when the initial model weights are sub-optimal and simultaneously the synthetic augmentation percentage is large. This claim is confirmed by Figure 4. We direct the curious reader to Appendix F, where we present graphs for all of the above listed training set sizes and augmentation percentages, providing additional empirical evidence for Theorem 4.3, Corollary 4.5, and Conjecture 4.7.

7.5 Qualitative Analysis of Results

We visually inspect the generated human motion sequences in order to analyze what concrete effect the self-correction has on iterative fine-tuning. We find that the correctness and diversity of synthesized motions are improved by the self-correction procedure, in agreement with our quantitative analysis in Subsection 7.4. We present snapshots of our synthesized motions in Figure 5, and we analyze the motions in the caption. In short, we find that physics-disobeying artifacts such as floor penetration or floating become more pronounced without the self-correction. We also find that in the model without self-correction, the humanoid sometimes performs movements completely unrelated to the prompt; our model with self-correction fixes these negative phenomena. We direct the curious reader to Appendix E, where we present more examples from our qualitative analysis, as well as our project webpage, where we provide side-by-side video comparisons.

8 Conclusion

Our paper investigates the learning of generative models when the training data includes machine-generated contents. We investigate how self-correction functions, which automatically correct synthesized data points to be more likely under the true data distribution, can stabilize self-consuming generative model training. Our theoretical results show that self-correction leads to exponentially more stable model training and smaller variance, which we illustrate with a Gaussian toy example. We then demonstrate how physics simulators can serve as a self-correction function for the challenging human motion synthesis task, where models trained with our self-correcting self-consuming loops generate higher quality motions, and manage to avoid collapse even at a high synthetic data to real data ratio. Future work includes exploring self-correcting functions for more diverse applications, such as language modeling and text-to-image generation, and investigating when self-consuming training may lead to overall better generative models.

Acknowledgments

We would like to thank Stephen H. Bach, Quentin Bertrand, Carsten Eickhoff, Gauthier Gidel, Jeff Hoffstein, Zhengyi Luo, Singh Saluja, and Ye Yuan for useful discussions. We would also like to thank the anonymous reviewers. This work is supported by the Samsung Advanced Institute of Technology, Honda Research Institute, and a Richard B. Salomon Award for Chen Sun. Our research was conducted using computational resources at the Center for Computation and Visualization at Brown University.

Impact Statement

This paper presents work whose goal is to provide theoretical analysis and practical tools to address the data contamination issue caused by machine-generated content. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Alemohammad et al. (2024) Alemohammad, S., Casco-Rodriguez, J., Luzi, L., Humayun, A. I., Babaei, H., LeJeune, D., Siahkoohi, A., and Baraniuk, R. Self-consuming generative models go MAD. In The Twelfth International Conference on Learning Representations, 2024.
  • Azizi et al. (2023) Azizi, S., Kornblith, S., Saharia, C., Norouzi, M., and Fleet, D. J. Synthetic data from diffusion models improves imagenet classification. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
  • Bertrand et al. (2024) Bertrand, Q., Bose, A. J., Duplessis, A., Jiralerspong, M., and Gidel, G. On the stability of iterative retraining of generative models on their own data. In The Twelfth International Conference on Learning Representations, 2024.
  • Ghorbani et al. (2021) Ghorbani, S., Mahdaviani, K., Thaler, A., Kording, K., Cook, D. J., Blohm, G., and Troje, N. F. Movi: A large multi-purpose human motion and video dataset. Plos one, 16(6):e0253157, 2021.
  • Guo et al. (2022) Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., and Cheng, L. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  5152–5161, 6 2022.
  • He et al. (2023) He, R., Sun, S., Yu, X., Xue, C., Zhang, W., Torr, P., Bai, S., and Qi, X. Is synthetic data from generative models ready for image recognition? In ICLR, 2023.
  • Ho & Salimans (2021) Ho, J. and Salimans, T. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  • Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Jahanian et al. (2022) Jahanian, A., Puig, X., Tian, Y., and Isola, P. Generative models as a data source for multiview representation learning. In International Conference on Learning Representations, 2022.
  • Jakubovitz et al. (2019) Jakubovitz, D., Giryes, R., and Rodrigues, M. R. Generalization error in deep learning. In Compressed Sensing and Its Applications: Third International MATHEON Conference 2017, pp.  153–193. Springer, 2019.
  • Ji et al. (2021) Ji, K., Zhou, Y., and Liang, Y. Understanding estimation and generalization error of generative adversarial networks. IEEE Transactions on Information Theory, 67(5):3114–3129, 2021.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Loper et al. (2015) Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., and Black, M. J. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, October 2015.
  • Loshchilov & Hutter (2019) Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  • Luo et al. (2021) Luo, Z., Hachiuma, R., Yuan, Y., and Kitani, K. Dynamics-regulated kinematic policy for egocentric pose estimation. In Advances in Neural Information Processing Systems, 2021.
  • Martínez et al. (2024) Martínez, G., Watson, L., Reviriego, P., Hernández, J. A., Juarez, M., and Sarkar, R. Towards understanding the interplay of generative artificial intelligence and the internet. In Cuzzolin, F. and Sultana, M. (eds.), Epistemic Uncertainty in Artificial Intelligence, pp.  59–73, Cham, 2024. Springer Nature Switzerland. ISBN 978-3-031-57963-9.
  • Pavlakos et al. (2019) Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A. A., Tzionas, D., and Black, M. J. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Perdomo et al. (2020) Perdomo, J., Zrnic, T., Mendler-Dünner, C., and Hardt, M. Performative prediction. In International Conference on Machine Learning, pp.  7599–7609. PMLR, 2020.
  • Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  • Saharia et al. (2022) Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  • Saunders et al. (2022) Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., and Leike, J. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
  • Schuhmann et al. (2022) Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  • Shumailov et al. (2023) Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., and Anderson, R. The curse of recursion: Training on generated data makes models forget. arXiv preprint arxiv:2305.17493, 2023.
  • Tevet et al. (2023) Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D., and Bermano, A. H. Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2023.
  • Tian et al. (2023) Tian, Y., Fan, L., Chen, K., Katabi, D., Krishnan, D., and Isola, P. Learning vision from models rivals learning vision from data. arXiv preprint arXiv:2312.17742, 2023.
  • Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.  5026–5033, 2012. doi: 10.1109/IROS.2012.6386109.
  • Welleck et al. (2023) Welleck, S., Lu, X., West, P., Brahman, F., Shen, T., Khashabi, D., and Choi, Y. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations, 2023.
  • Wu et al. (2023) Wu, T.-H., Lian, L., Gonzalez, J. E., Li, B., and Darrell, T. Self-correcting llm-controlled diffusion models. arXiv preprint arXiv:2311.16090, 2023.
  • Xu et al. (2023) Xu, S., Li, Z., Wang, Y.-X., and Gui, L.-Y. Interdiff: Generating 3d human-object interactions with physics-informed diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  14928–14940, 2023.
  • Yuan et al. (2023) Yuan, Y., Song, J., Iqbal, U., Vahdat, A., and Kautz, J. Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.

Appendix A Mathematical Theory: The Proof of Theorem 4.3

In this appendix, we provide a full account of the mathematical details of the theorems and their proofs appearing in the main body of the paper. Our proof technique has the same framework as (Bertrand et al., 2024) because our theoretical analysis generalizes theirs to the case where you have a self-correction function in the self-consuming loop.

A.1 Mathematical Setup and Notation

Definition A.1.

Define the optimal model parameters to be

θargmaxθΘ𝔼xpdata[logpθ(x)],\theta^{\star}\in\operatorname*{arg\,max}_{\theta^{\prime}\in\Theta}\mathbb{E}_{x\sim p_{\mathrm{data}}}[\log p_{\theta^{\prime}}(x)], (9)

chosen so that θ\|\theta^{\star}\| has minimal norm within this set. Let θ\theta be any model parameters. Then the correction of strength γ\gamma of distribution pθp_{\theta} towards pθp_{\theta^{\star}} is a new distribution, denoted πγpθ\pi_{\gamma}p_{\theta}, defined according to the rule

πγpθ(x):=pθ(x)+γpθ(x)1+γ.\pi_{\gamma}p_{\theta}(x):=\frac{{p_{\theta}}(x)+\gamma p_{\theta^{\star}}(x)}{1+\gamma}.

This is illustrated in Figure 6. Let θt\theta_{t} be the parameters of the model trained after tt generations. We define the iterative fine-tuning with correction update mapping to be

πγ𝒢λ(θ)\displaystyle\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta) :=localargmaxθΘ(θ,θ):=localargmaxθΘ[𝔼xpdata[logpθ(x)]]+λ𝔼xπγpθ[logpθ(x)]]\displaystyle:=\operatorname*{local\,argmax}_{\theta^{\prime}\in\Theta}\mathcal{H}(\theta,\theta^{\prime}):=\operatorname*{local\,argmax}_{\theta^{\prime}\in\Theta}[\mathbb{E}_{x\sim p_{\mathrm{data}}}[\log p_{\theta^{\prime}}(x)]]+\lambda\mathbb{E}_{x\sim\pi_{\gamma}p_{\theta}}[\log p_{\theta^{\prime}}(x)]] (10)
πγ𝒢λn(θ)\displaystyle\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta) :=localargmaxθΘ^(θ,θ):=localargmaxθΘ[𝔼xp^data[logpθ(x)]]+λ𝔼xπγpθ^[logpθ(x)]].\displaystyle:=\operatorname*{local\,argmax}_{\theta^{\prime}\in\Theta}\hat{\mathcal{H}}(\theta,\theta^{\prime}):=\operatorname*{local\,argmax}_{\theta^{\prime}\in\Theta}[\mathbb{E}_{x\sim\hat{p}_{\mathrm{data}}}[\log p_{\theta^{\prime}}(x)]]+\lambda\mathbb{E}_{x\sim{\widehat{\pi_{\gamma}p_{\theta}}}}[\log p_{\theta^{\prime}}(x)]]. (11)

Notice that in the finite case, we’re optimizing by taking samples from an empirical distribution. In contrast, in the infinite case, there is zero statistical error, since the parameter update is done with access to an infinite sampling budget at each generation tt. The finite case is the more practical case, when we have some statistical error (so we only have access to finite sampling at each generation). Since the parameter space of the generative model class might be limited, there might be a small difference between the distribution corresponding to the optimal parameters and the target distribution pdatap_{\mathrm{data}}; we capture this difference via the Wasserstein-2 distance and denote

ε:=dW(pθ,pdata).\varepsilon:=d_{W}(p_{\theta^{\star}},p_{\mathrm{data}}). (12)

Let

1(θ):=𝔼xpdata[logpθ(x)],2(θ,θ):=𝔼xπγpθ[logpθ(x)].\displaystyle\mathcal{H}_{1}(\theta^{\prime}):=\mathbb{E}_{x\sim p_{\mathrm{data}}}[\log p_{\theta^{\prime}}(x)],\qquad\mathcal{H}_{2}(\theta,\theta^{\prime}):=\mathbb{E}_{x\sim\pi_{\gamma}p_{\theta}}[\log p_{\theta^{\prime}}(x)]. (13)

and note that (θ,θ)=1(θ)+λ2(θ,θ)\mathcal{H}(\theta,\theta^{\prime})=\mathcal{H}_{1}(\theta^{\prime})+\mathcal{\lambda}\mathcal{H}_{2}(\theta,\theta^{\prime}).

We first establish that the correction map is truly a mapping of probability distributions as well as some of its elementary properties.

Lemma A.2.

The correction map has the following properties.

  1. 1.

    πγpθ\pi_{\gamma}p_{\theta} is a probability distribution.

  2. 2.

    Strengths 0,1,0,1,\infty correspond to pθp_{\theta}, the average of pθp_{\theta} and pθp_{\theta^{\star}}, and pθp_{\theta^{\star}}, respectively.

  3. 3.

    For any xnx\in\mathbb{R}^{n}, if γ>1\gamma>1, then

    πγpθ(x)pθ(x)πγpθ(x)pθ(x),\|\pi_{\gamma}p_{\theta}(x)-p_{\theta^{\star}}(x)\|\leq\|\pi_{\gamma}p_{\theta}(x)-p_{\theta}(x)\|,

    and if γ<1\gamma<1, then the inequality is flipped. In other words, πγpθ\pi_{\gamma}p_{\theta} is a better estimate of the ideal distribution pθp_{\theta^{\star}} than pθp_{\theta} is, precisely when the projection strength is more than 11.

Proof.

For the first point, πγpθ\pi_{\gamma}p_{\theta} is a probability distribution because it is a convex combination of probability distributions. For example, we can compute that

dπγpθ𝑑x\displaystyle\int_{\mathbb{R}^{d}}\pi_{\gamma}p_{\theta}dx =11+γdpθ(x)𝑑x+γ1+γdpθ(x)𝑑x=11+γ1+γ1+γ1=1.\displaystyle=\frac{1}{1+\gamma}\int_{\mathbb{R}^{d}}{p_{\theta}}(x)dx+\frac{\gamma}{1+\gamma}\int_{\mathbb{R}^{d}}{p_{\theta^{\star}}}(x)dx=\frac{1}{1+\gamma}\cdot 1+\frac{\gamma}{1+\gamma}\cdot 1=1.

The second point follows immediately from the definition of πγpθ\pi_{\gamma}p_{\theta}. For the third point, we can estimate that

πγpθ(x)pθ(x)\displaystyle\|\pi_{\gamma}p_{\theta}(x)-p_{\theta^{\star}}(x)\| =pθ(x)+γpθ(x)1+γpθ(x)(1+γ)1+γ\displaystyle=\left\|\frac{p_{\theta}(x)+\gamma p_{\theta^{\star}}(x)}{1+\gamma}-\frac{p_{\theta^{\star}}(x)(1+\gamma)}{1+\gamma}\right\|
=11+γpθ(x)pθ(x)\displaystyle=\frac{1}{1+\gamma}\cdot\|p_{\theta}(x)-p_{\theta^{\star}}(x)\|
γ1+γpθ(x)pθ(x)\displaystyle\leq\frac{\gamma}{1+\gamma}\cdot\|p_{\theta^{\star}}(x)-p_{\theta}(x)\|
=pθ(x)+γpθ(x)1+γpθ(x)(1+γ)1+γ\displaystyle=\left\|\frac{p_{\theta}(x)+\gamma p_{\theta^{\star}}(x)}{1+\gamma}-\frac{p_{\theta}(x)(1+\gamma)}{1+\gamma}\right\|
=πγpθ(x)pθ(x)\displaystyle=\|\pi_{\gamma}p_{\theta}(x)-p_{\theta}(x)\|

when γ>1\gamma>1. The inequality flips when γ<1\gamma<1. ∎

Intuitively, it is clear that we cannot hope to prove general results about generative models without assuming something about the mapping θpθ\theta\mapsto p_{\theta}. We now state the two assumptions we require in order to make our theoretical arguments; note that they are precisely the same assumptions made in (Bertrand et al., 2024). The first assumption is a local Lipschitzness property that we will exploit via the Kantorovich-Rubenstein duality:

Assumption A.3.

For θ\theta close enough to θ\theta^{\star}, the mapping xθθlogpθ(x)x\mapsto\nabla_{\theta}\nabla_{\theta}\log p_{\theta}(x) is LL-Lipschitz.

The second assumption is a local regularity and concavity condition:

Assumption A.4.

The mapping θ𝔼xpdata[logpθ(x)]\theta\mapsto\mathbb{E}_{x\sim p_{\mathrm{data}}}[\log p_{\theta}(x)] is continuously twice differentiable locally around θ\theta^{\star} and 𝔼xpdata[θθlogpθ(x)]θαId0.\mathbb{E}_{x\sim p_{\text{data}}}\left[\nabla_{\theta}\nabla_{\theta}\log p_{\theta}(x)\right]_{\theta^{\star}}\preceq-\alpha I_{d}\prec 0.

We next show the existence and uniqueness of πγ𝒢λ()\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\infty) locally around θ\theta^{\star}.

Proposition A.5 (The Local Maximum Likelihood Solution is Unique).

The following are true:

  1. A.

    There exists an open neighborhood UdU\subset\mathbb{R}^{d} containing θ\theta^{\star} and a continuous function g:Udg:U\to\mathbb{R}^{d} such that g(θ)=θg(\theta^{\star})=\theta^{\star}, and

    θ(θ,θ)|θ,g(θ)=0\nabla_{\theta^{\prime}}\mathcal{H}(\theta,\theta^{\prime})|_{\theta,g(\theta)}=0 (14)

    for every θU\theta\in U.

  2. B.

    Given optimal model parameters θ\theta^{\star} as in (9) that follow Assumptions A.3 and A.4, we have that, if εL<α\varepsilon L<\alpha, then for all λ>0\lambda>0 and θ\theta in a small enough neighborhood UU around θ\theta^{\star}, there exists a unique local maximizer πγ𝒢λ(θ)\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta) in UU.

Proof.

We first prove part A. It suffices to apply the Implicit Function Theorem to the map

2dd:(θ,θ)θ(θ,θ)|θ,θ\displaystyle\mathbb{R}^{2d}\to\mathbb{R}^{d}:(\theta,\theta^{\prime})\mapsto\nabla_{\theta^{\prime}}\mathcal{H}(\theta,\theta^{\prime})|_{\theta,\theta^{\prime}} (15)

in an open neighborhood of (θ,θ)(\theta^{\star},\theta^{\star}). To do this, we need to show the following:

  1. i)

    The map vanishes at (θ,θ)(\theta^{\star},\theta^{\star}), i.e.

    θ(θ,θ)|θ,θ=0.\nabla_{\theta^{\prime}}\mathcal{H}(\theta,\theta^{\prime})|_{\theta^{\star},\theta^{\star}}=0. (16)
  2. ii)

    The Jacobian matrix at (θ,θ)(\theta^{\star},\theta^{\star}) is invertible, i.e.,

    θθ(θ,θ)|θ,θis invertible.\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}(\theta,\theta^{\prime})|_{\theta^{\star},\theta^{\star}}\qquad\text{is invertible.} (17)

We first prove i). Recall from the definition (10) that πγ𝒢λ(θ)=argmaxθΘ(θ,θ)\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta)=\operatorname*{arg\,max}_{\theta^{\prime}\in\Theta}\mathcal{H}(\theta,\theta^{\prime}). This means that for any θ\theta, πγ𝒢λ(θ)\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta) is the choice of θ\theta^{\prime} which maximizes (θ,θ)\mathcal{H}(\theta,\theta^{\prime}). In particular, for θ=θ\theta=\theta^{\star}, we have that θ=πγ𝒢λ(θ)\theta^{\prime}=\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta^{\star}) is the choice which maximizes (θ,θ)\mathcal{H}(\theta^{\star},\theta^{\prime}). But πγ𝒢λ(θ)=θ\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta^{\star})=\theta^{\star} by Proposition A.6. This implies that its derivative is zero at θ=θ\theta^{\prime}=\theta^{\star}, meaning θ(θ,θ)|θ,θ=0\nabla_{\theta^{\prime}}\mathcal{H}(\theta,\theta^{\prime})|_{\theta^{\star},\theta^{\star}}=0, as needed.

Now we prove ii). In order to show that the matrix (17) is invertible, it suffices to show it is close to another matrix which is invertible. A natural choice is the matrix

M=(1+λ)θθ𝔼xpdata[logpθ(x)]|θ.M=(1+\lambda)\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathbb{E}_{x\sim p_{\text{data}}}[\log p_{\theta^{\prime}}(x)]|_{\theta^{\star}}. (18)

First of all, note that this matrix indeed exists; by Assumption 2 A.4, we know the map θ𝔼xpdata[logpθ(x)]\theta^{\prime}\mapsto\mathbb{E}_{x\sim p_{\text{data}}}[\log p_{\theta^{\prime}}(x)] is continuously twice differentiable locally near θ\theta^{\star}. We can estimate that the matrices (17) and (18) are indeed close as follows:

θθ\displaystyle\|\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}} (θ,θ)|θ,θ(1+λ)θθ𝔼xpdata[logpθ(x)]θ\displaystyle\mathcal{H}(\theta,\theta^{\prime})|_{\theta^{\star},\theta^{\star}}-(1+\lambda)\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathbb{E}_{x\sim p_{\text{data}}}[\log p_{\theta^{\prime}}(x)]_{\theta^{\star}}\|
=θθ[𝔼xpdatalogpθ(x)+λ𝔼xπγpθpθ(x)]|θ,θ(1+λ)θθ𝔼xpdata[logpθ(x)]θ\displaystyle=\|\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}[\mathbb{E}_{x\sim p_{\mathrm{data}}}\log p_{\theta^{\prime}}(x)+\lambda\mathbb{E}_{x\sim\pi_{\gamma}p_{\theta}}p_{\theta^{\prime}}(x)]|_{\theta^{\star},\theta^{\star}}-(1+\lambda)\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathbb{E}_{x\sim p_{\text{data}}}[\log p_{\theta^{\prime}}(x)]_{\theta^{\star}}\|
=λ[θθ𝔼xπγpθlogpθ(x)]|θ,θθθ𝔼xpdata[logpθ(x)]θ\displaystyle=\lambda\|[\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathbb{E}_{x\sim\pi_{\gamma}p_{\theta}}\log p_{\theta^{\prime}}(x)]|_{\theta^{\star},\theta^{\star}}-\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathbb{E}_{x\sim p_{\text{data}}}[\log p_{\theta^{\prime}}(x)]_{\theta^{\star}}\|
=λ[θθ𝔼xpθlogpθ(x)]θ[θθ𝔼xpdatalogpθ(x)]θ\displaystyle=\lambda\|[\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathbb{E}_{x\sim p_{\theta^{\star}}}\log p_{\theta^{\prime}}(x)]_{\theta^{\star}}-[\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathbb{E}_{x\sim p_{\text{data}}}\log p_{\theta^{\prime}}(x)]_{\theta^{\star}}\|
=λ[𝔼xpθθθlogpθ(x)]θ[𝔼xpdataθθlogpθ(x)]θ\displaystyle=\lambda\|[\mathbb{E}_{x\sim p_{\theta^{\star}}}\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\log p_{\theta^{\prime}}(x)]_{\theta^{\star}}-[\mathbb{E}_{x\sim p_{\text{data}}}\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\log p_{\theta^{\prime}}(x)]_{\theta^{\star}}\|
Lλ𝔼(x,x)pθ×pdata[θθlogpθ(x)]θθθlogpθ(x)]θ\displaystyle\leq L\lambda\mathbb{E}_{(x,x^{\prime})\sim p_{\theta^{\star}}\times p_{\mathrm{data}}}\|[\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\log p_{\theta^{\prime}}(x)]_{\theta^{\star}}-\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\log p_{\theta^{\prime}}(x)]_{\theta^{\star}}\|
λεL\displaystyle\leq\lambda\varepsilon L

where the first equality follows from the definition of \mathcal{H} in (13); the second equality follows from some cancellation; the third equality follows the fact that the derivatives are constant with respect to θ\theta, and πγpθ=pθ\pi_{\gamma}p_{\theta^{\star}}=p_{\theta^{\star}} by Lemma A.2; we exchange the derivative and the expectation in equation 4 using the Dominated Convergence Theorem, since Assumption 1 A.3 says that xθθlogpθ(x)x\mapsto\nabla_{\theta}\nabla_{\theta}\log p_{\theta}(x) is LL-Lipschitz; the fifth estimate follows from Kantorovich-Rubinstein Duality; and the final estimate is the definition of Wasserstein distance (12).

Finally, we verify MM is indeed invertible. Assumption 2 A.4 implies that the largest eigenvalue of MM is at most (1+λ)α-(1+\lambda)\alpha. Therefore, since all eigenvalues of MM are nonzero, MM is invertible. We can now apply the implicit function theorem to (15), and part A follows immediately.

Next, we prove part B. Let dU=supθUdW(pθ,pθ)d_{U}=\sup_{\theta\in U}d_{W}(p_{\theta^{\star}},p_{\theta}). To verify that g(θ)g(\theta) is a local maximizer of (15), it suffices to show that θθ(θ,g(θ))0\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}(\theta,g(\theta))\prec 0. By Assumption 2 A.4, we know θθ1(θ)αId\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{1}(\theta^{\star})\prec-\alpha I_{d} and since θθθ1(θ)\theta^{\prime}\mapsto\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{1}(\theta^{\prime}) is continuously twice differentiable locally near θ\theta^{\star}, we also have θθ1(g(θ))αId\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{1}(g(\theta))\prec-\alpha I_{d}. Thus, we have

θθ(θ,g(θ))\displaystyle\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}(\theta,g(\theta)) =θθ1(g(θ))+λθθ2(θ,g(θ))\displaystyle=\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{1}(g(\theta^{\prime}))+\lambda\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{2}(\theta,g(\theta))
=(1+λ)θθ1(g(θ))+λ(θθ2(θ,g(θ))θθ1(g(θ)))\displaystyle=(1+\lambda)\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{1}(g(\theta))+\lambda(\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{2}(\theta,g(\theta))-\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{1}(g(\theta)))
α(1+λ)Id+λL(11+γdW(pθ,pθ)+ε)Id,\displaystyle\preceq-\alpha(1+\lambda)I_{d}+\lambda L\left(\frac{1}{1+\gamma}d_{W}(p_{\theta},p_{\theta^{\star}})+\varepsilon\right)I_{d},

where the last step follows from Kantorovich-Rubsenstein duality:

θθ\displaystyle\|\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}} 2(θ,θ)θθ1(θ)\displaystyle\mathcal{H}_{2}(\theta,\theta^{\prime})-\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{1}(\theta^{\prime})\|
θθ2(θ,θ)θθ2(θ,θ)+θθ2(θ,θ)θθ1(θ)\displaystyle\leq\|\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{2}(\theta,\theta^{\prime})-\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{2}(\theta^{\star},\theta^{\prime})\|+\|\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{2}(\theta^{\star},\theta^{\prime})-\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}_{1}(\theta^{\prime})\|
=dθθlogpθ(x)pθ(x)+γpθ(x)1+γ𝑑xdθθlogpθ(x)pθ(x)𝑑x\displaystyle=\|\int_{\mathbb{R}^{d}}\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\log p_{\theta^{\prime}}(x)\frac{p_{\theta}(x)+\gamma p_{\theta^{\star}}(x)}{1+\gamma}\,dx-\int_{\mathbb{R}^{d}}\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\log p_{\theta^{\prime}}(x)p_{\theta^{\star}}(x)\,dx\|
+𝔼xpdata[logpθ(x)]𝔼xpθ[logpθ(x)]\displaystyle\;\;\;+\|\mathbb{E}_{x\sim p_{\text{data}}}[\log p_{\theta^{\prime}}(x)]-\mathbb{E}_{x\sim p_{\theta^{\star}}}[\log p_{\theta^{\prime}}(x)]\|
11+γdθθlogpθ(x)(pθ(x)pθ(x))𝑑x+Lε\displaystyle\leq\frac{1}{1+\gamma}\|\int_{\mathbb{R}^{d}}\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\log p_{\theta^{\prime}}(x)\left(p_{\theta}(x)-p_{\theta^{\star}}(x)\right)\,dx\|+L\varepsilon
=11+γ𝔼xpθ[logpθ(x)]𝔼xpθ[logpθ(x)]+Lε\displaystyle=\frac{1}{1+\gamma}\|\mathbb{E}_{x\sim p_{\theta}}[\log p_{\theta^{\prime}}(x)]-\mathbb{E}_{x\sim p_{\theta^{\star}}}[\log p_{\theta^{\prime}}(x)]\|+L\varepsilon
L1+γdW(pθ,pθ)+Lε\displaystyle\leq\frac{L}{1+\gamma}d_{W}(p_{\theta},p_{\theta^{\star}})+L\varepsilon
L1+γdU+Lε\displaystyle\leq\frac{L}{1+\gamma}d_{U}+L\varepsilon

Thus, to have θθ(θ,g(θ))0\nabla_{\theta^{\prime}}\nabla_{\theta^{\prime}}\mathcal{H}(\theta,g(\theta))\prec 0, it is sufficient that

α(1+λ)+λL(11+γdU+ε)<0,\displaystyle-\alpha(1+\lambda)+\lambda L\left(\frac{1}{1+\gamma}d_{U}+\varepsilon\right)<0,

which is guaranteed for all λ>0\lambda>0 by α>Lε\alpha>L\varepsilon and dUα(1+γ)λd_{U}\leq\frac{\alpha(1+\gamma)}{\lambda}. This concludes the proof. ∎

Further, as we would expect, θ\theta^{\star} is a fixed point of πγ𝒢λ\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}:

Proposition A.6 (The optimal parametric generative model is a fixed point).

For any given data distribution pdatap_{\mathrm{data}}, any θ\theta^{\star} as defined by (9), and for all λ>0\lambda>0, we have πγ𝒢λ(θ)=θ\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta^{\star})=\theta^{\star}.

Proof.

Unpacking definition (10) shows that πγ𝒢λ(θ)=𝒢λ(θ)\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta^{\star})=\mathcal{G}_{\lambda}^{\infty}(\theta^{\star}), and we know by Proposition 4 from (Bertrand et al., 2024) that 𝒢λ(θ)=θ\mathcal{G}_{\lambda}^{\infty}(\theta^{\star})=\theta^{\star}. ∎

A.2 Convergence of Iterative Fine-tuning with Correction for Infinite Sampling

We now have the required setup to state and prove a convergence result for iterative fine-tuning assuming infinite access to underlying probablity distributions. We need the following result, which is a technical lemma that provides a computation of the Jacobian of πγGλ\pi_{\gamma}G_{\lambda}^{\infty} at θ\theta^{\star} as well as a spectral bound, both essential for the proof of Theorem A.8.

Lemma A.7.

We define the matrices

A\displaystyle A :=(θ,θ21(θ))|θ\displaystyle:=(\nabla_{\theta^{\prime},\theta^{\prime}}^{2}\mathcal{H}_{1}(\theta^{\prime}))|_{\theta^{\star}} (19)
B\displaystyle B :=θ,θ2𝔼xpθ[logpθ(x)]|θ,θ\displaystyle:=\nabla_{\theta,\theta^{\prime}}^{2}\mathbb{E}_{x\sim p_{\theta}}[\log p_{\theta^{\prime}}(x)]\big{|}_{\theta^{\star},\theta^{\star}} (20)
C\displaystyle C :=θ,θ2𝔼xpθ[logpθ(x)]|θ,θ\displaystyle:=\nabla_{\theta^{\prime},\theta^{\prime}}^{2}\mathbb{E}_{x\sim p_{\theta}}[\log p_{\theta^{\prime}}(x)]\big{|}_{\theta^{*},\theta^{*}} (21)

Recall the definition of πγ𝒢λ(θ)\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta) from (10). Since γ\gamma and λ\lambda are fixed, denote π𝒢(θ)=πγ𝒢λ(θ).\pi\mathcal{G}(\theta)=\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta). Finally, let 𝒥(π𝒢(θ)):=θπγ𝒢λ(θ)|θ\mathcal{J}(\pi\mathcal{G}(\theta)):=\nabla_{\theta}\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta)|_{\theta} denote the Jacobian of πγ𝒢λ(θ)\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta).

  1. I.

    There exists an open neighborhood UΘU\subseteq\Theta containing θ\theta^{\star} such that for all θU\theta\in U, we have

    𝒥(π𝒢(θ))=\displaystyle\mathcal{J}(\mathcal{\pi G}(\theta))= (θ,θ2(θ,π𝒢(θ)))1λθ,θ22(θ,π𝒢(θ)).\displaystyle\,\,-\left(\nabla^{2}_{\theta^{\prime},\theta^{\prime}}\mathcal{H}(\theta,\pi\mathcal{G}(\theta))\right)^{-1}\cdot\lambda\nabla^{2}_{\theta,\theta^{\prime}}\mathcal{H}_{2}(\theta,\pi\mathcal{G}(\theta)). (22)
  2. II.

    We have that θ,θ22(θ,θ)=B1+γ\nabla^{2}_{\theta,\theta^{\prime}}\mathcal{H}_{2}(\theta^{\star},\theta^{\star})=\frac{B}{1+\gamma}, and B=CB=-C, so the Jacobian of π𝒢\pi\mathcal{G} at θ\theta^{\star} is

    𝒥(π𝒢(θ))=(I+λA1C)1λ1+γA1C\mathcal{J}(\pi\mathcal{G}(\theta^{\star}))=(I+\lambda A^{-1}C)^{-1}\cdot\frac{\lambda}{1+\gamma}A^{-1}C (23)
  3. III.

    The spectral norm of A1CA^{-1}C can be bounded as

    A1C1+Lεα.\|A^{-1}C\|\leq 1+\frac{L\varepsilon}{\alpha}. (24)
Proof.

We first prove I. We apply Proposition A.5. Part A of that proposition gives us a function g:Udg:U\to\mathbb{R}^{d} such that θ(θ,θ)θ,g(θ)=0\nabla_{\theta^{\prime}}\mathcal{H}(\theta,\theta^{\prime})_{\theta,g(\theta)}=0. But part BB of that proposition says that there exists a unique local maximizer inside UU, and this local maximizer is πγ𝒢λ\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}. This implies that θ(θ,θ)θ,πγ𝒢λ(θ)=0\nabla_{\theta^{\prime}}\mathcal{H}(\theta,\theta^{\prime})_{\theta,\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta)}=0. Next, we implicitly differentiate this equation with respect to θ\theta. Recall that when you have an equation of the form f(x,y)=0f(x,y)=0, and implicitly differentiate it in the form f(x,g(x))=0f(x,g(x))=0 with respect to xx, you obtain fx+fygx=0\frac{\partial f}{\partial x}+\frac{\partial f}{\partial y}\frac{\partial g}{\partial x}=0, and solving for gx\frac{\partial g}{\partial x} yields gx=(fy)1fx\frac{\partial g}{\partial x}=-\left(\frac{\partial f}{\partial y}\right)^{-1}\frac{\partial f}{\partial x}. We apply this formula with

(x,f,g)=(θ,θθ(θ,θ)θ,πγ𝒢λ(θ),θπγ𝒢λ(θ))(x,f,g)=(\theta,\theta\mapsto\nabla_{\theta^{\prime}}\mathcal{H}(\theta,\theta^{\prime})_{\theta,\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta)},\theta\mapsto\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta))

and obtain (22), as desired.

Now we prove II. We can compute that

θ,θ22(θ,θ)\displaystyle\nabla^{2}_{\theta^{\prime},\theta}\mathcal{H}_{2}(\theta,\theta^{\prime}) =θθ𝔼xπγpθ[logpθ(x)]\displaystyle=\nabla_{\theta^{\prime}}\nabla_{\theta}\mathbb{E}_{x\sim\pi_{\gamma}p_{\theta}}[\log p_{\theta^{\prime}}(x)] (25)
=θθxdlogpθ(x)(pθ(x)+γpθ(x)1+γ)𝑑x\displaystyle=\nabla_{\theta^{\prime}}\nabla_{\theta}\int_{x\in\mathbb{R}^{d}}\log p_{\theta^{\prime}}(x)\left(\frac{p_{\theta}(x)+\gamma p_{\theta^{\star}}(x)}{1+\gamma}\right)dx (26)
=11+γθθxdlogpθ(x)pθ(x)𝑑x\displaystyle=\frac{1}{1+\gamma}\nabla_{\theta^{\prime}}\nabla_{\theta}\int_{x\in\mathbb{R}^{d}}\log p_{\theta^{\prime}}(x)p_{\theta}(x)dx (27)
=11+γθ,θ2𝔼xpθ[logpθ(x)]\displaystyle=\frac{1}{1+\gamma}\nabla^{2}_{\theta^{\prime},\theta}\mathbb{E}_{x\sim p_{\theta}}[\log p_{\theta^{\prime}}(x)] (28)
=11+γB\displaystyle=\frac{1}{1+\gamma}B (29)

where the third equality holds because the integral containing pθp_{\theta^{\star}} is constant with respect to θ\theta. Next, we can compute that

B\displaystyle B =Xθlogpθ(x)θpθ(x)𝑑x|θ,θ\displaystyle=\int_{X}\nabla_{\theta^{\prime}}\log p_{\theta^{\prime}}(x)\nabla_{\theta}p_{\theta}(x)dx\Big{|}_{\theta^{*},\theta^{*}} (30)
=X[θlogpθ(x)][θpθ(x)]𝑑x|θ,θ\displaystyle=\int_{X}[\nabla_{\theta}\log p_{\theta}(x)][\nabla_{\theta}p_{\theta}(x)]dx\Big{|}_{\theta^{*},\theta^{*}} (31)
=Xθ[pθ(x)θlogpθ(x)]dx|θ,θXpθ(x)(θθlogpθ(x))𝑑x|θ,θ\displaystyle=\int_{X}\nabla_{\theta}[p_{\theta}(x)\nabla_{\theta}\log p_{\theta}(x)]dx\Big{|}_{\theta^{*},\theta^{*}}-\int_{X}p_{\theta}(x)(\nabla_{\theta}\nabla_{\theta}\log p_{\theta}(x))dx\Big{|}_{\theta^{*},\theta^{*}} (32)
=Xθ[pθ(x)θpθ(x)pθ(x)]dx|θ,θθ,θ2𝔼xpθ[logpθ(x)]|θ,θ\displaystyle=\int_{X}\nabla_{\theta}\left[p_{\theta}(x)\frac{\nabla_{\theta}p_{\theta}(x)}{p_{\theta}(x)}\right]dx\Big{|}_{\theta^{*},\theta^{*}}-\nabla_{\theta^{\prime},\theta^{\prime}}^{2}\mathbb{E}_{x\sim p_{\theta}}[\log p_{\theta^{\prime}}(x)]\Big{|}_{\theta^{*},\theta^{*}} (33)
=C,\displaystyle=-C, (34)

where the third equality follows from the product rule for gradients,

θ[pθ(x)θlogpθ(x)]\displaystyle\nabla_{\theta}[p_{\theta}(x)\nabla_{\theta}\log p_{\theta}(x)] =pθ(x)(θθlogpθ(x))+[θpθ(x)][θlogpθ(x)].\displaystyle=p_{\theta}(x)(\nabla_{\theta}\nabla_{\theta}\log p_{\theta}(x))+[\nabla_{\theta}p_{\theta}(x)][\nabla_{\theta}\log p_{\theta}(x)]. (35)

Finally, we will prove the formula (23) by manipulating (22). We begin with the rightmost factor in (22). If we apply these equalities that we just obtained, then we get

𝒥(π𝒢(θ))\displaystyle\mathcal{J}(\pi\mathcal{G}(\theta^{\star})) =(θ,θ2(θ,θ))1λθ,θ22(θ,θ)\displaystyle=-\left(\nabla^{2}_{\theta^{\prime},\theta^{\prime}}\mathcal{H}(\theta^{\star},\theta^{\star})\right)^{-1}\cdot\lambda\nabla^{2}_{\theta^{\prime},\theta}\mathcal{H}_{2}(\theta^{\star},\theta^{\star})
=(A+λC)1λ1+γB\displaystyle=-(A+\lambda C)^{-1}\cdot\frac{\lambda}{1+\gamma}B
=(I+λA1C)1λ1+γA1B\displaystyle=-(I+\lambda A^{-1}C)^{-1}\cdot\frac{\lambda}{1+\gamma}A^{-1}B
=(I+λA1C)1λ1+γA1C\displaystyle=(I+\lambda A^{-1}C)^{-1}\cdot\frac{\lambda}{1+\gamma}A^{-1}C

where the first equality follows from (23) along with the fixed point Proposition A.6, and we are using that AA is invertible by Assumption 2 A.4, which implies all eigenvalues of AA are nonzero; in the fourth step we used that B=CB=-C. This proves part II.

Now we prove III. We can bound the operator norm A1C\|A^{-1}C\| as follows:

A1C=I+A1(CA)I+A1CA1+α1CA,\displaystyle\|A^{-1}C\|=\|I+A^{-1}(C-A)\|\leq\|I\|+\|A^{-1}\|\cdot\|C-A\|\leq 1+\alpha^{-1}\|C-A\|, (36)

where the first estimate comes from subadditivity and submultiplicativity, and the second comes from the fact that, since AA is symmetric, A=maxλσ(A)|λ|\|A\|=\max_{\lambda\in\sigma(A)}|\lambda|, where σ(A)\sigma(A) is the spectrum of AA. Formally, we know by Assumption A.4 that AA has eigenvalues e1<e2<<enα<0e_{1}<e_{2}<\dots<e_{n}\leq-\alpha<0 and so |en|>α|e_{n}|>\alpha. Therefore, A1A^{-1} has eigenvalues 1/en<1/en1<<1/e1<01/e_{n}<1/e_{n-1}<\dots<1/e_{1}<0 and thus 1/|en|>1/|en1|>>1/|e1|1/|e_{n}|>1/|e_{n-1}|>\dots>1/|e_{1}|, which gives us the bound A1=1/|en|<1/α\|A^{-1}\|=1/|e_{n}|<1/\alpha on the matrix norm. Next, we can estimate that

CA\displaystyle||C-A|| =θ,θ2𝔼xpθ[logpθ(x)]|θθ,θ2𝔼xpdata[logpθ(x)]|θ\displaystyle=\|\nabla^{2}_{\theta^{\prime},\theta^{\prime}}\mathbb{E}_{x\sim p_{\theta^{\star}}}[\log p_{\theta^{\prime}}(x)]|_{\theta^{\star}}-\nabla^{2}_{\theta^{\prime},\theta^{\prime}}\mathbb{E}_{x\sim p_{\text{data}}}[\log p_{\theta^{\prime}}(x)]|_{\theta^{\star}}\|
=𝔼xpθ[θ,θ2logpθ(x)]𝔼xpdata[θ,θ2logpθ(x)]\displaystyle=\|\mathbb{E}_{x\sim p_{\theta^{\star}}}[\nabla^{2}_{\theta^{\prime},\theta^{\prime}}\log p_{\theta^{\star}}(x)]-\mathbb{E}_{x\sim p_{\text{data}}}[\nabla^{2}_{\theta^{\prime},\theta^{\prime}}\log p_{\theta^{\star}}(x)]\|
LdW(pθ,pdata)\displaystyle\leq Ld_{W}(p_{\theta^{\star}},p_{\text{data}})
=Lε,\displaystyle=L\varepsilon,

where in the second equality we exchange the derivative and the expectation in equation 4 using the Dominated Convergence Theorem, since Assumption 1 A.3 says that xθθlogpθ(x)x\mapsto\nabla_{\theta}\nabla_{\theta}\log p_{\theta}(x) is LL-Lipschitz; and in the last estimate, we used Kantorovich-Rubenstein duality. This, combined with the estimate (36), yields the bound in (24). ∎

We are finally ready to prove our theorem that guarantees convergence to the optimal parameters in the infinite sampling case under certain assumptions, one being the that the initial model parameters θ0\theta_{0} are sufficiently close to θ\theta^{\star}:

Theorem A.8 (Convergence of Iterative Fine-tuning, Infinite Sampling Case).

Suppose we have an iterative fine-tuning procedure defined by the rule θt+1=πγ𝒢λ(θt)\theta_{t+1}^{\infty}=\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta_{t}^{\infty}). Let θ\theta^{\star} be the parameter vector for the optimal generative model, as in (9). We assume that θ\theta^{\star} follows Assumptions A.3 and A.4 from (Bertrand et al., 2024). Suppose also that λ(1+εLα)<1+γ2+γ\lambda\left(1+\frac{\varepsilon L}{\alpha}\right)<\frac{1+\gamma}{2+\gamma}. Then, the Jacobian of πγGλ\pi_{\gamma}G_{\lambda}^{\infty} satisfies the following bound:

θπγ𝒢λ(θ)2\displaystyle\|\nabla_{\theta}\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta^{\star})\|_{2} 11+γλ(α+εL)αλ(α+εL)<1.\displaystyle\leq\frac{1}{1+\gamma}\cdot\frac{\lambda(\alpha+\varepsilon L)}{\alpha-\lambda(\alpha+\varepsilon L)}<1. (37)

Consequently, there exists a δ>0\delta>0 such if θ0Θ\theta_{0}\in\Theta satisfies θ0θδ\|\theta_{0}-\theta^{\star}\|\leq\delta, then starting training at θ0\theta_{0} and having θt+1=πγ𝒢λ(θt)\theta_{t+1}=\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta_{t}), we have that limtθtθ\lim_{t\to\infty}\theta_{t}\to\theta^{\star}. Furthermore, if we define

ρ(λ)=λ(α+εL)αλ(α+εL),\displaystyle\rho(\lambda)=\frac{\lambda(\alpha+\varepsilon L)}{\alpha-\lambda(\alpha+\varepsilon L)}, (38)

then we obtain the asymptotic stability estimate333(Bertrand et al., 2024) could have presented their results in this stronger form, without the big OO notation, with very little extra work.

θtθ(ρ(λ)1+γ)tθ0θ.\displaystyle\|\theta_{t}-\theta^{\star}\|\leq\left(\frac{\rho(\lambda)}{1+\gamma}\right)^{t}\|\theta_{0}-\theta^{\star}\|. (39)
Proof.

We first prove the Jacobian bound (37). By hypothesis, we know λ(1+Lεα)<1\lambda(1+\frac{L\varepsilon}{\alpha})<1, so by Lemma A.7(III), we have λA1C<1\lambda||A^{-1}C||<1. Thus, we can write

(I+λA1C)1\displaystyle(I+\lambda A^{-1}C)^{-1} =k=0(λA1C)k\displaystyle=\sum_{k=0}^{\infty}(-\lambda A^{-1}C)^{k}

and so

(I+λA1C)1\displaystyle\|(I+\lambda A^{-1}C)^{-1}\| k=0λkA1Ck=11λA1C.\displaystyle\leq\sum_{k=0}^{\infty}\lambda^{k}||A^{-1}C||^{k}=\frac{1}{1-\lambda||A^{-1}C||}.

Applying Lemma A.7(2), we get

𝒥(G(θ))\displaystyle||\mathcal{J}(G(\theta^{\star}))|| (I+λA1C)1λ1+γA1Cλ1+γA1C1λA1C.\displaystyle\leq||(I+\lambda A^{-1}C)^{-1}||\cdot\frac{\lambda}{1+\gamma}||A^{-1}C||\leq\frac{\lambda}{1+\gamma}\cdot\frac{||A^{-1}C||}{1-\lambda||A^{-1}C||}.

Now, it is straightforward to see the RHS above is at most the bound in (37) if and only if αA1C<α+εL\alpha\|A^{-1}C\|<\alpha+\varepsilon L. But this bound holds because of Lemma A.7(III). This proves the Jacobian bound (37), but does not prove that the bound is less than 11. For this, we must show that

11+γλ(α+εL)αλ(α+εL)<1.\displaystyle\frac{1}{1+\gamma}\cdot\frac{\lambda(\alpha+\varepsilon L)}{\alpha-\lambda(\alpha+\varepsilon L)}<1. (40)

By clearing denominators and grouping like terms, we can see that this is equivalent to

λ(1+εLα)<1+γ2+γ,\displaystyle\lambda\left(1+\frac{\varepsilon L}{\alpha}\right)<\frac{1+\gamma}{2+\gamma}, (41)

which is precisely guaranteed by our hypothesis.

We now apply the the Jacobian bound (37) to prove the asymptotic stability estimate (39). Assume λ\lambda is sufficiently small so that ρ(λ)/(1+γ)<1\rho(\lambda)/(1+\gamma)<1. Then for every ρ(ρ(λ)/(1+γ),1)\rho^{\prime}\in(\rho(\lambda)/(1+\gamma),1), there exists δ>0\delta>0 sufficiently small so that every θ0Θ\theta_{0}\in\Theta which satisfies θ0θ<δ\|\theta_{0}-\theta^{\star}\|<\delta has the property that θπγGλ(θ0)2<ρ\|\nabla_{\theta}\pi_{\gamma}G_{\lambda}^{\infty}(\theta_{0})\|_{2}<\rho^{\prime}. Because the map πγGλ\pi_{\gamma}G_{\lambda}^{\infty} has Jacobian matrix norm less than 11 in the δ\delta-ball around θ\theta^{\star}, it is a contraction mapping in this neighborhood. Concretely, this means that

πγ𝒢λ(θ)πγ𝒢λ(θ)ρ(λ)1+γθθ,\|\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta)-\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta^{\prime})\|\leq\frac{\rho(\lambda)}{1+\gamma}\|\theta-\theta^{\prime}\|, (42)

for every θ,θ\theta,\theta^{\prime} in the δ\delta-ball around θ\theta^{\star}. In particular, for (θ,θ)=(θt,θ)(\theta,\theta^{\prime})=(\theta_{t},\theta^{\star}) we obtain

θt+1θ=πγθtθ\displaystyle\|\theta_{t+1}-\theta^{\star}\|=\|\pi_{\gamma}\theta_{t}-\theta^{\star}\| =πγ𝒢λ(θt)πγ𝒢λ(θ)ρ(λ)1+γθtθ.\displaystyle=\|\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta_{t})-\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta^{\star})\|\leq\frac{\rho(\lambda)}{1+\gamma}\cdot\|\theta_{t}-\theta^{\star}\|.

By induction, the above estimate implies that if θ0\theta_{0} is in a δ\delta-ball around θ\theta^{\star}, then so is every successive θt\theta_{t}. Therefore the desired estimate (39) now follows by induction on tt. ∎

Remark A.9.

Taking γ=0\gamma=0 recovers exactly the result in (Bertrand et al., 2024). Importantly, the correction function πγ\pi_{\gamma} provides leverage in determining how large the augmentation percentage λ\lambda can be: choosing a larger correction strength γ\gamma allows us to choose a larger augmentation percentage λ\lambda while still retaining theoretical guarantees for convergence. Additionally, for the same choice of augmentation percentage λ\lambda, a larger correction strength γ\gamma provides a guarantee for an improved rate of convergence. See Conjecture 4.7.

A.3 Stability of Iterative Fine-tuning with Correction for Finite Sampling

Finally, we prove a stability result for iterative fine-tuning with correction in the presence of statistical error. To do this, we require an assumption that essentially provides probabilistic guarantee that the chosen generative model learns the underlying distribution increasingly better if it has access to more samples:

Assumption A.10.

There exist a,b,εOPT0a,b,\varepsilon_{\text{OPT}}\geq 0 and a neighborhood UU of θ\theta^{\star} such that, for any δ(0,1)\delta\in(0,1), with probability 1δ1-\delta over the samplings, we have

(θU)(n)πγ𝒢λn(θ)πγ𝒢λ(θ)εOPT+anlogbδ.(\forall\theta\in U)(\forall n\in\mathbb{N})\qquad\|\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta)-\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta)\|\leq\varepsilon_{\text{OPT}}+\frac{a}{\sqrt{n}}\sqrt{\log\frac{b}{\delta}}. (43)

See Appendix B for a discussion about this assumption; we investigated whether to assume a similar bound to the one they assumed in (Bertrand et al., 2024), or prove our bound from theirs. In fact, we prove in Appendix B that you can in fact deduce something nearly as strong as Assumption A.10 from Assumption 3 in their paper, so we made Assumption A.10 for the sake of a cleaner, more parallel exposition.

Theorem A.11 (Iterative Fine-Tuning Stability Under Correction).

Suppose we have an iterative fine-tuning procedure defined by the rule θt+1n=πγ𝒢λn(θtn)\theta_{t+1}^{n}=\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta_{t}^{n}). In words, this means that the augmentation percentage is λ(0,)\lambda\in(0,\infty) and the correction strength is γ[0,)\gamma\in[0,\infty). Under the same assumptions of Theorem A.8 and Assumption A.10, there exist 0<ρ<10<\rho<1 and δ1>0\delta_{1}>0 such that if θ0nθδ1\|\theta_{0}^{n}-\theta^{\star}\|\leq\delta_{1}, then for any δ2(0,1)\delta_{2}\in(0,1), with probability 1δ21-\delta_{2}, we have

θtnθ(εOPT+anlogbtδ)i=0t(ρ(λ)1+γ)i+(ρ(λ)1+γ)tθ0nθ.\displaystyle\|\theta_{t}^{n}-\theta^{\star}\|\leq\left(\varepsilon_{\text{OPT}}+\frac{a}{\sqrt{n}}\sqrt{\log\frac{bt}{\delta}}\right)\sum_{i=0}^{t}\left(\frac{\rho(\lambda)}{1+\gamma}\right)^{i}+\left(\frac{\rho(\lambda)}{1+\gamma}\right)^{t}\|\theta_{0}^{n}-\theta^{\star}\|. (44)
Proof.

By the triangle inequality, we can estimate that

θtnθ\displaystyle\|\theta_{t}^{n}-\theta^{\star}\| θtnπγ𝒢λ(θt1n))+πγ𝒢λ(θt1n)θ\displaystyle\leq\|\theta_{t}^{n}-\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta_{t-1}^{n}))\|+\|\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta_{t-1}^{n})-\theta^{\star}\|
=πγ𝒢λn(θt1n)πγ𝒢λ(θt1n)+πγ𝒢λ(θt1n)πγ𝒢λ(θ),\displaystyle=\|\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta_{t-1}^{n})-\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta_{t-1}^{n})\|+\|\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta_{t-1}^{n})-\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta^{\star})\|, (45)

where we applied the fixed point Proposition A.6. By Assumption A.10, the left summand in (A.3) is at most εOPT+anlogbδ\varepsilon_{\text{OPT}}+\frac{a}{\sqrt{n}}\sqrt{\log\frac{b}{\delta}}, with probability 1δ1-\delta. Next, recall that in (42) in the proof of Theorem A.8, we proved that that πγGλ\pi_{\gamma}G_{\lambda}^{\infty} is a contraction mapping of factor ρ(λ)/(1+γ)\rho(\lambda)/(1+\gamma) sufficiently close to UU; this implies that the right summand in (A.3) is at most ρ(λ)1+γθt1nθ\frac{\rho(\lambda)}{1+\gamma}\|\theta_{t-1}^{n}-\theta^{\star}\|. Together, these yield the recurrence estimate

(θtnθεOPT+anlogbδ+ρ(λ)1+γθt1nθ)1δ.\displaystyle\mathbb{P}\left(\|\theta_{t}^{n}-\theta^{\star}\|\leq\varepsilon_{\text{OPT}}+\frac{a}{\sqrt{n}}\sqrt{\log\frac{b}{\delta}}+\frac{\rho(\lambda)}{1+\gamma}\|\theta_{t-1}^{n}-\theta^{\star}\|\right)\geq 1-\delta. (46)

Iterating this recurrence for successive time steps yields

(θtnθ(εOPT+anlogbδ)i=0t(ρ(λ)1+γ)i+(ρ(λ)1+γ)tθ0nθ)(1δ)t.\displaystyle\mathbb{P}\left(\|\theta_{t}^{n}-\theta^{\star}\|\leq\left(\varepsilon_{\text{OPT}}+\frac{a}{\sqrt{n}}\sqrt{\log\frac{b}{\delta}}\right)\sum_{i=0}^{t}\left(\frac{\rho(\lambda)}{1+\gamma}\right)^{i}+\left(\frac{\rho(\lambda)}{1+\gamma}\right)^{t}\|\theta_{0}^{n}-\theta^{\star}\|\right)\geq(1-\delta)^{t}. (47)

Note that (47) holds for any δ(0,1)\delta\in(0,1). In particular, we can apply (47) with δ:=δ/t\delta:=\delta/t. In this case, the Bernoulli inequality lets us estimate that (1δ/t)t1δ(1-\delta/t)^{t}\geq 1-\delta. This completes the proof, with δ2=δ\delta_{2}=\delta. ∎

Remark A.12.

Theorem A.11 recovers the result from (Bertrand et al., 2024) in the case where the correction strength is γ=0\gamma=0. But for a fixed augmentation percentage λ\lambda, for any correction strength γ>0\gamma>0, this gives stronger stability guarantees than in (Bertrand et al., 2024).

Remark A.13.

In a previous version of this manuscript, we claimed that there was an error in the statement of the corresponding theorem in (Bertrand et al., 2024). In this version, we retract that claim; we have corresponded with those authors, and they updated their manuscript with additional details to justify their statement.

A.4 Discussion: The Main Limitation

Our empirical results are for generative modeling tasks where we have access to some “self-correction” operation that is easy to compute, as well as automatic; see Sections 6 and  7 for more details about these correction functions. Therefore, the main limitation of our work is that one can only hope to use this self-correction procedure to stabilize training in scenarios where there is some “self-correction” function. For our MNIST experiments, we built a self-correction function from scratch using clustering statistics. And for our human motion experiments, we used an off-the-shelf human motion imitation model that other researchers built.

Appendix B Discussion about Assumption 4.2

In this section, we show how with a mild boundedness assumption on our generative model parameter update function, we can deduce our Assumption A.10 (which is the same as Assumption 4.2, part 3) from the following assumption used in (Bertrand et al., 2024).

Assumption B.1.

There exist a,b,εOPT0a,b,\varepsilon_{\text{OPT}}\geq 0 and a neighborhood UU of θ\theta^{\star} such that, for any δ(0,1)\delta\in(0,1), with probability 1δ1-\delta over the samplings, we have

(θU)(n)𝒢λn(θ)𝒢λ(θ)εOPT+anlogbδ.(\forall\theta\in U)(\forall n\in\mathbb{N})\qquad\|\mathcal{G}_{\lambda}^{n}(\theta)-\mathcal{G}_{\lambda}^{\infty}(\theta)\|\leq\varepsilon_{\text{OPT}}+\frac{a}{\sqrt{n}}\sqrt{\log\frac{b}{\delta}}. (48)

Now, if we make the additional assumption that our generative model parameter update function is locally bounded near θ\theta^{\star} then we obtain the following.

Proposition B.2.

Suppose Assumption B.1 holds. Suppose also that there exists B<B<\infty such that for all n>0n>0 and θ\theta sufficiently close to θ\theta^{\star},

𝒢λn(θ)𝒢λn(θ)<Bθθ.\displaystyle\|\mathcal{G}_{\lambda}^{n}(\theta)-\mathcal{G}_{\lambda}^{n}(\theta^{\star})\|<B\|\theta-\theta^{\star}\|.

Then there exist a,b,c,εOPT0a,b,c,\varepsilon_{\text{OPT}}\geq 0 and a neighborhood UU of θ\theta^{\star} such that, for any δ(0,1)\delta\in(0,1), with probability 1δ1-\delta over the samplings, we have

(θU)(n)πγ𝒢λn(θ)πγ𝒢λ(θ)cdU+εOPT+anlogbδ,(\forall\theta\in U)(\forall n\in\mathbb{N})\qquad\|\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta)-\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta)\|\leq c\cdot d_{U}+\varepsilon_{\text{OPT}}+\frac{a}{\sqrt{n}}\sqrt{\log\frac{b}{\delta}}, (49)

where dU=supθUθθ.d_{U}=\sup_{\theta\in U}\|\theta-\theta^{\star}\|.

Proof.

By the triangle inequality, we have

πγ𝒢λn(θ)πγ𝒢λ(θ)\displaystyle\|\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta)-\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta)\| πγ𝒢λn(θ)𝒢λn(θ)+𝒢λn(θ)𝒢λ(θ)+𝒢λ(θ)πγ𝒢λ(θ).\displaystyle\leq\|\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta)-\mathcal{G}_{\lambda}^{n}(\theta)\|+\|\mathcal{G}_{\lambda}^{n}(\theta)-\mathcal{G}_{\lambda}^{\infty}(\theta)\|+\|\mathcal{G}_{\lambda}^{\infty}(\theta)-\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta)\|. (50)

We bound each term in the RHS: firstly, note the middle term is bounded by Assumption B.1.The first term is bounded as follows:

𝒢λn(θ)πγ𝒢λn(θ)\displaystyle\|\mathcal{G}_{\lambda}^{n}(\theta)-\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta)\| 𝒢λn(θ)𝒢λn(θ)+πγ𝒢λn(θ)πγ𝒢λn(θ)\displaystyle\leq\|\mathcal{G}_{\lambda}^{n}(\theta)-\mathcal{G}_{\lambda}^{n}(\theta^{\star})\|+\|\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta^{\star})-\pi_{\gamma}\mathcal{G}_{\lambda}^{n}(\theta)\|
Bθθ+Bθθ\displaystyle\leq B\|\theta-\theta^{\star}\|+B\|\theta-\theta^{\star}\|
2BdU,\displaystyle\leq 2Bd_{U},

where in the first step we used that 𝒢λ(θ)=πγ𝒢λ(θ)\mathcal{G}_{\lambda}^{\infty}(\theta^{\star})=\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta^{\star}). Similarly, the last term is bounded as follows:

𝒢λ(θ)πγ𝒢λ(θ)\displaystyle\|\mathcal{G}_{\lambda}^{\infty}(\theta)-\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta)\| 𝒢λ(θ)𝒢λ(θ)+πγ𝒢λ(θ)πγ𝒢λ(θ)\displaystyle\leq\|\mathcal{G}_{\lambda}^{\infty}(\theta)-\mathcal{G}_{\lambda}^{\infty}(\theta^{\star})\|+\|\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta^{\star})-\pi_{\gamma}\mathcal{G}_{\lambda}^{\infty}(\theta)\|
ρ(λ)θθ+ρ(λ)1+γθθ\displaystyle\leq\rho(\lambda)\|\theta-\theta^{\star}\|+\frac{\rho(\lambda)}{1+\gamma}\|\theta-\theta^{\star}\|
=ρ(λ)2+γ1+γθθ\displaystyle=\rho(\lambda)\frac{2+\gamma}{1+\gamma}\|\theta-\theta^{\star}\|
ρ(λ)2+γ1+γdU,\displaystyle\leq\rho(\lambda)\frac{2+\gamma}{1+\gamma}d_{U},

where in the second step we applied (42). Using these bounds in (50) and taking c=2B+ρ(λ)2+γ1+γc=2B+\rho(\lambda)\frac{2+\gamma}{1+\gamma} completes the proof. ∎

Note that the constant cdU<cc\cdot d_{U}<c (for UU sufficiently small) can really be viewed as a part of the optimization constant εOPT\varepsilon_{\text{OPT}} since it is controlled by the choice of generative model class.

Appendix C Point-wise correction corresponds to distribution-wise correction

In this section we provide a sufficient condition under which you can associate a distribution-wise correction mapping (like the one we consider in the paper, πγ\pi_{\gamma}) to a point-wise correction mapping (which is the one you are more likely to find in the wild).

Definition C.1.

Let X={x1,,xn}mX=\{x_{1},\dots,x_{n}\}\subset\mathbb{R}^{m} and define the empirical cumulative distribution function ΦX\Phi_{X} by

ΦX(v):=ΦX(v;{x1,,xn}):=1ni=1nχv(xi),\displaystyle\Phi_{X}(v):=\Phi_{X}(v;\{x_{1},\dots,x_{n}\}):=\frac{1}{n}\sum_{i=1}^{n}\chi_{v}(x_{i}),

where for vmv\in\mathbb{R}^{m}, χv:m{0,1}\chi_{v}:\mathbb{R}^{m}\to\{0,1\} is the indicator function for the set i=1n(,vi]\prod_{i=1}^{n}(-\infty,v_{i}]. For a continuous distribution, the cumulative distribution function is defined in the usual way.

Definition C.2.

Suppose that we have a model pθp_{\theta} and an arbitrary function Π:mm\Pi:\mathbb{R}^{m}\to\mathbb{R}^{m}. Then we say that Π\Pi is a valid point-wise correction function for pθp_{\theta} if there exists a γ[0,]\gamma\in[0,\infty] such that

limn(𝔼XnpθnsupvmΦΠ(Xn)(v)Φπγpθ(v))0,\lim_{n\to\infty}\left(\mathbb{E}_{X^{n}\sim p_{\theta}^{n}}\sup_{v\in\mathbb{R}^{m}}\|\Phi_{\Pi(X^{n})}(v)-\Phi_{\pi_{\gamma}p_{\theta}}(v)\|\right)\to 0, (51)

almost surely, where the expectation is over all samplings Xn={x1,,xn}X^{n}=\{x_{1},\dots,x_{n}\} of size nn from pθp_{\theta}.

Intuition C.3.

This is saying that the CDFs for πγpθ\pi_{\gamma}p_{\theta} and Π(Xpθn)\Pi(X\sim p_{\theta}^{n}) are equal in expectation, for large enough nn. This is one way of saying that πγpθ\pi_{\gamma}p_{\theta} and Π(Xpθn)\Pi(X\sim p_{\theta}^{n}), for large enough nn, are nearly identical probability distributions.

Definition C.4.

If the limit in (51) exists, then we define the distribution-wise projection function corresponding to Π\Pi to be

πγpθ=11+γpθ+γ1+γpθ,\pi_{\gamma}p_{\theta}=\frac{1}{1+\gamma}p_{\theta}+\frac{\gamma}{1+\gamma}p_{\theta^{\star}}, (52)

and we define the projection strength of the point-wise correction function Π\Pi to be γ\gamma. Recall that πγpθ=11+γpθ+γ1+γpθ\pi_{\gamma}p_{\theta}=\frac{1}{1+\gamma}p_{\theta}+\frac{\gamma}{1+\gamma}p_{\theta^{\star}}. So intuitively, (51) implies that the projection function Π\Pi maps samples from pθp_{\theta} to a different space such that they look like they come from a combination of the original distribution pθp_{\theta} and pθp_{\theta^{\star}}, at least at the level of CDFs.

Remark C.5.

Such a γ\gamma, if it exists, is unique. Furthermore, if pθ=pθp_{\theta}=p_{\theta^{\star}}, then γ=\gamma=\infty.

The limit condition in Definition C.2 is abstract, and can be hard to swallow. We present an example of a simple point-wise correction for the Gaussian toy example that we consider in Section 5, whose corresponding distribution-wise correction is exactly one would expect it to be–the weighted average of the corresponding Gaussians. Recall that we demonstrated empirically in Figure 2 that Theorem 4.3 holds for that example. The projection function is depicted in Figure 6.

Example C.6.

Let G1(x)G_{1}(x) be the pdf of 𝒩(0,σ12Id)\mathcal{N}(0,\sigma_{1}^{2}I_{d}) (initial distribution, corresponds to θ\theta) and G2(x)G_{2}(x) the pdf of 𝒩(0,σ22Id)\mathcal{N}(0,\sigma_{2}^{2}I_{d}) (target distribution, corresponds to θ\theta^{\star}). Given x1,,xnG1x_{1},\dots,x_{n}\sim G_{1}, we define Πγ\Pi^{\gamma} as follows: Fix any γ0\gamma\in\mathbb{R}_{\geq 0}, and let y1,,yn(G^1(n)(x)+γG2(x))/(1+γ)y_{1},\dots,y_{n}\sim(\hat{G}_{1}^{(n)}(x)+\gamma G_{2}(x))/(1+\gamma), where G^1(n)\hat{G}_{1}^{(n)} is the PDF of the empirical distribution defined by {x1,,xn}\{x_{1},\dots,x_{n}\}; in practice we implement G^1(n)\hat{G}_{1}^{(n)} as a histogram. Then choose a random σSn\sigma\in S_{n} (SnS_{n} = group of permutations on nn symbols). Finally, we define Πγ(xi):=yσ(i)\Pi^{\gamma}(x_{i}):=y_{\sigma(i)} for 1in1\leq i\leq n.

Next, we define the projection set ΠX(n):={Πγ(xi)}1in\Pi X^{(n)}:=\{\Pi^{\gamma}(x_{i})\}_{1\leq i\leq n}, and define the PDF πγG^1(n)(x):=11+γG^1(n)(x)+γ1+γG2(x)\pi_{\gamma}\hat{G}_{1}^{(n)}(x):=\frac{1}{1+\gamma}\hat{G}_{1}^{(n)}(x)+\frac{\gamma}{1+\gamma}G_{2}(x), and let ΦπγG^1(n)\Phi_{\pi_{\gamma}\hat{G}_{1}^{(n)}} represent the cumulative distribution function of the Gaussian πγG^1(n)\pi_{\gamma}\hat{G}_{1}^{(n)}. Then, since Πγ(xi)πγG^1(n)\Pi^{\gamma}(x_{i})\sim\pi_{\gamma}\hat{G}_{1}^{(n)}, we have by the uniform law of large numbers that

limn(𝔼{xiG1}i=1nsupvmΦΠX(n)(v)ΦπγG1(v))0\lim_{n\to\infty}\left(\mathbb{E}_{\{x_{i}\sim G_{1}\}_{i=1}^{n}}\mathrm{sup}_{v\in\mathbb{R}^{m}}\left\|\Phi_{\Pi X^{(n)}}(v)-\Phi_{\pi_{\gamma}G_{1}}(v)\right\|\right)\to 0 (53)

almost surely. Therefore Πγ\Pi^{\gamma} is a valid point-wise correction function, and its corresponding distribution-wise projection function is πγ\pi_{\gamma}.

Remark C.7.

In the example we considered in Section 5, we could have included a total distance traveled minimization condition, but here for this proof we don’t even need to use that hypothesis. (In the proof, this would have corresponded to the additional assumption that we’ve chosen a σSn\sigma\in S_{n} such that i=1nxiyσ(i)\sum_{i=1}^{n}\|x_{i}-y_{\sigma(i)}\| is minimized.) This implies that different point-wise correction functions can correspond to the same distribution-wise correction function.


Refer to caption

Figure 6: Illustration of the distribution-wise projection function, like in our Gaussian toy example. Correcting one Gaussian in the direction of another, like we consider in Section 5, corresponds to finding the “(weighted) average Gaussian” that lives between the two.

Appendix D More MNIST Experiment Details

We train a Denoising Diffusion Probabilistic Model (DDPM) (Ho et al., 2020) on the 20%20\% of the MNIST dataset (LeCun et al., 1998). We use classifier-free guidance (Ho & Salimans, 2021) with guidance parameter 0.50.5, and 400 diffusion steps. We used a batch size of 256. We train generation 0 for 20 epochs, with a linear decay learning rate schedule starting at 1e41e-4 and ending at (1e4)/20.(1e-4)/20. We train each following generations for a single epoch, with a fixed learning rate of (1e4)/202(1e-4)/20^{2}.

To compute our metrics, we first train a LeNet model (LeCun et al., 1998) on MNIST, and then we sample an equal number of digits from each class using the checkpoint that we’re trying to evaluate. To compute the FID score, we extract embeddings from the last fully connected LeNet layer for the synthesized examples, as well as for the held out test examples, and compute FID score as normal, by computing the Wasserstein distance between the Gaussians. Note that we use embeddings for LeNet trained on MNIST, rather than the Inception network trained on ImageNet, because MNIST isn’t comprised of natural images. This is consistent with the convention in (Alemohammad et al., 2024).

For the self-correction operation, we compute the KK-means clusters, with K=16K=16, once at the start of training. And we “correct” a synthesized motion by mapping it to the nearest cluster mean corresponding to its digit. In Figure 7 we present the clusters, and we present graphs of our FID scores across augmentation percentages in Figure 8.


Refer to caption

Figure 7: For every digit, we perform KK-means clustering with KK=16. We show here the cluster centroids, which intuitively are anchor images within the manifold of all possible images.

Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 8: Results from MNIST experiments with iterative fine-tuning with and without self-correction. These graphs show the FID score on the last checkpoint for every generation; this is the checkpoint used for sampling in the self-consuming loop experiments, and it is also the checkpoint where training is resumed with this new partially synthesized dataset. These results demonstrate that iterative fine-tuning with self-correction generally outperforms iterative fine-tuning.

Appendix E Additional Human Motion Generation Qualitative Results

In Figures 9,  10, and  11, we present additional qualitative observations and analysis of our synthesized motions. We present more evidence that iterative fine-tuning with self-correction yields physically plausible motions comparable to the baseline, whereas iterative fine-tuning without self-correction yields motions that are incorrect for various reasons. See the captions of the referenced figures for analysis of some characteristic failure modes of the iterative fine-tuning loop without self-correction.

A technical note: for all figures, we render the motions from the same environment and camera position. We consolidate each render into the same image without resizing it. This means that if a figure appears larger relative to the others, the human moved closer to the camera. Some motions will have transparent frames of past positions; the more transparent the image, the farther back in the past it was in the motion sequence. Finally, in each figure, the text prompt for all generated motions was the same –the prompt being the one associated with the ground truth motion in the HumanML3D (Guo et al., 2022) training data, which we also visualize. Note that the coloring in the humanoid figures corresponds to the coloring in the graphs.

Refer to caption

Figure 9: Here we see the negative floating phenomenon exacerbated by iterative fine-tuning, whereas iterative fine-tuning with self-correction generates a motion with floor contact integrity comparable to the ground truth and baseline. The floatic metric is formally defined in (Yuan et al., 2023) as the distance between the lowest vertex on the human mesh and the floor plane. All three sequences were generated using the same prompt: person got down and is crawling across the floor. Each snapshot was taken at exactly frame 87. The green figure appears larger than the other two only because it is closer to the camera. The two motions on the right were synthesized after 50 generations training with 25%25\% synthetic augmentation, trained on n=64n=64 data points.

Refer to caption

Figure 10: All four of the above motions correspond to the prompt: a person raises right hand to face looks around and puts hand down back to side.. The model which is trained with iterative fine-tuning outputs spurious motion that slides the figure to the right. And in the video for this example, the human rotates their forearm unnaturally and forcefully. In contrast, the baseline and iterative fine-tuning with self-correction models’ motions both accurately embody the prompt. Each generated snapshot is taken at exactly frame 142 while the ground truth’s image is frame 70 in its sequence. The two motions on the right were synthesized after 42 generations with 10%10\% synthetic augmentation, where the ground truth dataset has size n=2794n=2794.

Refer to caption

Figure 11: Here we observe that iterative fine-tuning fails to produce any meaningful motion sequence, but the iterative fine-tuning with self-correction and baseline models generate results consistent with their prompt: walks side ways but back and forth. Each snapshot for the generated motions was taken at exactly frame 120 while the ground truth image is a snapshot from frame 69. These images were synthesized after 50 generation of the model that was trained on n=64n=64 data points at 25%25\% synthetic augmentation.

Appendix F Additional Human Motion Generation Quantitative Results

See Figures 12, 13, 14 for results when the dataset size is n{64,128,256}n\in\{64,128,256\} and the synthetic augmentation percentage is λ{0.25,0.50,0.75,1.00}\lambda\in\{0.25,0.50,0.75,1.00\}. And see Figures 15 and  16 for additional results on our iterative fine-tuning experiments when the dataset size is n=2794n=2794 and the synthetic augmentation percentage is λ{0.05,0.10,0.15,0.20,0.25}\lambda\in\{0.05,0.10,0.15,0.20,0.25\}. The graphs provide evidence across 1717 experiment settings that our iterative fine-tuning procedure with self-correction yields better training performance than iterative fine-tuning with no self-correction for the motion synthesis task, in accordance with Theorem 4.3.

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Figure 12: Results from our human motion experiments with iterative fine-tuning with and without self-correction, where the training set has size 6464. These are graphs for evaluation metrics on the last checkpoint for every generation; this is the checkpoint used for sampling in the self-consuming loop experiments, and it is also the checkpoint where training is resumed with this new partially synthesized dataset. These results demonstrate that iterative fine-tuning with self-correction generally outperforms iterative fine-tuning, and is sometimes even competitive with baseline performance.

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Figure 13: Results from our human motion experiments with iterative fine-tuning with and without self-correction, where the training set has size 128128. These are graphs for evaluation metrics on the last checkpoint for every generation; this is the checkpoint used for sampling in the self-consuming loop experiments, and it is also the checkpoint where training is resumed with this new partially synthesized dataset. These results demonstrate that iterative fine-tuning with self-correction generally outperforms iterative fine-tuning, and is sometimes even competitive with baseline performance. Notably, the performance gain of iterative fine-tuning with self-correction over iterative fine-tuning is less pronounced than when the dataset size is n=64n=64.

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Figure 14: Results from our human motion experiments with iterative fine-tuning with and without self-correction, where the training set has size 256256. These are graphs for evaluation metrics on the last checkpoint for every generation; this is the checkpoint used for sampling in the self-consuming loop experiments, and it is also the checkpoint where training is resumed with this new partially synthesized dataset. These results demonstrate that iterative fine-tuning with self-correction generally outperforms iterative fine-tuning, and is sometimes even competitive with baseline performance.

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Figure 15: Results from our human motion experiments on iterative fine-tuning with dataset size n=2794n=2794. These are graphs for evaluation metrics on the last checkpoint for every generation; this is the checkpoint used for sampling in the augmentation loop experiments, and it is also the checkpoint where training is resumed with this new synthesized dataset. In these results, it appears as though iterative fine-tuning with self-correction has less variance during training than iterative fine-tuning with with no self-correction, and generally has better FID scores later in training. Notably, the these two curves are closer together than they were in the cases n{64,128,256}n\in\{64,128,256\}.

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Refer to caption

Figure 16: Results from our human motion experiments on iterative fine-tuning with dataset size n=2794n=2794. These are graphs of the average evaluation metrics for every generation. Graphing the average evaluation metrics makes the training dynamics trend over time more clear. With this additional smoothing, it is more clear that iterative fine-tuning with self-correction outperforms iterative fine-tuning with no self-correction, and is competitive with the baseline after many generations; in fact, it appears to converge to the baseline (on average) for every synthetic augmentation percentage that we considered.

Appendix G Consistency Across Seeds: Additional Human Motion Generation Quantitative Results

In Figures 171819, and 20, we present experimental results from runs across three more seeds for our human motion experiments when the dataset size is n=64n=64. We find that the self-correction technique consistently yields improved training dynamics over iterative fine-tuning without correction.


Refer to caption

Figure 17: Results from our human motion experiments on iterative fine-tuning, with dataset size n=64n=64 and 25%25\% augmentation percentage. Each row corresponds to a different random seed. We can see that iterative fine-tuning with self-correction consistently outperforms iterative fine-tuning with no self-correction, and the FID score appears to converge to the baseline after many generations.

Refer to caption


Figure 18: Results from our human motion experiments on iterative fine-tuning, with dataset size n=64n=64 and 50%50\% augmentation percentage. Each row corresponds to a different random seed. We can see that iterative fine-tuning with self-correction consistently outperforms iterative fine-tuning with no self-correction, and the FID score appears to converge to the baseline after many generations.

Refer to caption

Figure 19: Results from our human motion experiments on iterative fine-tuning, with dataset size n=64n=64 and 75%75\% augmentation percentage. Each row corresponds to a different random seed. We can see that iterative fine-tuning with self-correction consistently outperforms iterative fine-tuning with no self-correction, and the FID score appears to converge near the baseline after many generations.

Refer to caption

Figure 20: Results from our human motion experiments on iterative fine-tuning, with dataset size n=64n=64 and 100%100\% augmentation percentage. Each row corresponds to a different random seed. We can see that iterative fine-tuning with self-correction consistently outperforms iterative fine-tuning with no self-correction. However, we see less stability than in the runs with a lower augmentation percentage. This is in accordance with Theorem 4.3.