This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Boosting Data Analytics with Synthetic Volume Expansion

Xiaotong Shenlabel=e1][email protected]\orcid0000-0003-1300-1451 [    Yifei Liulabel=e2][email protected] [    Rex Shenlabel=e3][email protected] [ School of Statistics, University of Minnesota, Twin Citiespresep=, ]e1,e2 Department of Statistics, Stanford Universitypresep=, ]e3
Abstract

Synthetic data generation, a cornerstone of Generative Artificial Intelligence, promotes a paradigm shift in data science by addressing data scarcity and privacy while enabling unprecedented performance. As synthetic data becomes more prevalent, concerns emerge regarding the accuracy of statistical methods when applied to synthetic data in contrast to raw data. This article explores the effectiveness of statistical methods on synthetic data and the privacy risks of synthetic data. Regarding effectiveness, we present the Synthetic Data Generation for Analytics framework. This framework applies statistical approaches to high-quality synthetic data produced by generative models like tabular diffusion models, which, initially trained on raw data, benefit from insights from pertinent studies through transfer learning. A key finding within this framework is the generational effect, which reveals that the error rate of statistical methods on synthetic data decreases with the addition of more synthetic data but may eventually rise or stabilize. This phenomenon, stemming from the challenge of accurately mirroring raw data distributions, highlights a “reflection point”—an ideal volume of synthetic data defined by specific error metrics. Through three case studies, sentiment analysis, predictive modeling of structured data, and inference in tabular data, we validate the superior performance of this framework compared to conventional approaches. On privacy, synthetic data imposes lower risks while supporting the differential privacy standard. These studies underscore synthetic data’s untapped potential in redefining data science’s landscape.

Generative Machine Intelligence,
Large Language Models,
Knowledge Transfer,
Pretrained Transformers,
Tabular Diffusion,
Unstructured,
keywords:
\startlocaldefs\endlocaldefs

, and

1 Introduction

The advent of synthetic data generation, fueled by generative artificial intelligence (GAI), has initiated a paradigm shift in data analytics, pivoting towards the adoption of synthetic data. Synthetic data, designed to simulate real-world scenarios, presents a viable alternative to the challenges associated with traditional data collection, sharing, and analysis, especially when data are scarce or sensitive. As predicted by Gartner, synthetic data will account for 60% of the data used in AI and analytics projects by 2024, surpassing the utilization of raw data in AI models by 2030 (Gartner, 2022; Eastwood, 2023). Highlighting its transformative impact, a Forbes article by Toews (2022) predicts that synthetic data technology ”will reshape the world of AI in the years ahead, scrambling competitive landscapes and redefining technology stacks.” Furthermore, Jordon et al. (2022) provides an exhaustive review of synthetic data, highlighting its considerable promise as a technology. This evolution signifies a shift away from the traditional reliance on raw data, opening a new chapter in data generation and its application of analytics.

This paper investigates the challenges of efficacy and privacy in synthetic data utilization, emphasizing the role of synthetic data in enhancing data analytics. Specifically, we introduce the Synthetic Data Generation for Analytics (Syn) framework, aimed at increasing the precision of statistical methods applied to high-fidelity synthetic data that accurately replicates raw data through transfer learning. This framework mirrors the statistical properties of raw data through advanced knowledge transfer techniques while offering a potential solution for data sharing without compromising data privacy. Synthetic data provides two main benefits for data analytics: it mitigates data scarcity and addresses privacy issues (Ghalebikesabi et al., 2023).

In the Syn framework, a generative model, informed by raw data, is used to produce synthetic data that replicates its distribution, incorporating insights from pre-trained models in relevant studies through transfer learning. The Syn framework enables the integration of both pre-trained and generative models of various kinds that share comparable latent structures (Akbar, Wang and Eklund, 2023). It employs an array of generative models tailored for different domains: image diffusion (Sohl-Dickstein et al., 2015; Wei et al., 2021), text diffusion models (Ho, Jain and Abbeel, 2020), text-to-image diffusion models (Zhang et al., 2023a), time-series diffusion models (Lin et al., 2023), spatio-temporal diffusion models (Yuan et al., 2023), and tabular diffusion models (Kotelnikov et al., 2023; Zheng and Charoenphakdee, 2022; Kim, Lee and Park, 2022; Lee, Kim and Park, 2023). Moreover, Syn embraces advanced models such as the Reversible Generative Models (Kingma and Dhariwal, 2018). These flow-based models capture the raw data distribution and can estimate both conditional and marginal distributions.

Through Syn exploration, we address a critical issue of efficacy: Can high-fidelity synthetic data boost the effectiveness of statistical methods solely reliant on raw data? If so, how may we implement such enhancements? Recent research offers diverging viewpoints on this issue. In some cases, synthetic X-ray images improve the accuracy of machine learning models (Gao et al., 2023), whereas, in others, training on synthetic data may compromise performance for some machine learning models (Kotelnikov et al., 2023).

Our investigation reveals that synthetic data, when accurately generated, can boost the accuracy of a statistical method by augmenting the sample size of raw data. However, a significant caveat emerges: a statistical method applied to low-fidelity synthetic data could yield unreliable outcomes. Often, a generational effect emerges, whereby as the size of synthetic data grows, the precision gain of a method may diminish or even plateau. This phenomenon has been illustrated in our study for structured data prediction in Section 3.2. This challenge arises from generation errors or discrepancies between the data-generation distributions of synthetic and raw data. Fundamentally, the generational effect underscores a key concern: regardless of the size of synthetic data, generation errors can compromise the accuracy of a statistical method.

While evaluating a supervised task is straightforward, hypothesis testing presents the challenge of regulating the Type-I error while enhancing the power of a test. To address this challenge, we introduce “Syn-Test,” a test that augments the sample size of raw data by applying synthetic data. For clarity, “Syn-A” denotes Method A within the Syn framework throughout this article. Syn-Test determines the ideal size of synthetic data required to manage the empirical Type-I error while performing a test for finite inference samples using Monte Carlo methods. Our research indicates that the ideal size of synthetic data can heighten the accuracy. Moreover, our theoretical investigation sheds light on the generational effect, precision, and the size of synthetic data.

Additionally, we introduce Syn-Slm, a streamlined approach enhancing Syn’s usability in various applications. This method bypasses the training of conventional predictive models in supervised learning tasks. Instead, it directly models the outcome’s conditional distribution through advanced generative models. The effectiveness of Syn-Slm is demonstrated with two examples: sentiment analysis using text data and regression analysis with tabular data.

To highlight the capabilities of the Syn framework, we explore sentiment analysis, predictive modeling for structured data, and tabular data inference. In these domains, methods using high-fidelity synthetic data outperform those with raw data, attributed to the synthetic data generation by diffusion models obtained by fine-tuning relevant pre-trained models on raw data. This fine-tuning boosts efficiency and accuracy by leveraging existing knowledge akin to a scientist solving specialized problems. Specifically, it updates a generative model’s architecture, like adjusting a neural network’s weights or training the entire network or a subnetwork, with the rest remaining unchanged. Furthermore, one may augment network architectures with adapters to accommodate various input dimensions and tasks. For example, a tabular diffusion model pre-trained on an Adult-Male dataset can be fine-tuned on an Adult-Female dataset with the same features, as discussed in Section 3.2.1. Similarly, a diffusion model initially trained with X-ray images (Akbar, Wang and Eklund, 2023) could undergo fine-tuning, including potential modifications to the network architecture, for CT-scan images. Despite originating from distinct domains, they exhibit visual resemblances.

In the first area, we contrast three models: OpenAI’s Generative Pre-trained Transformer (GPT)-3.5 for illustrating Syn-Slm, DistilBERT (Sanh et al., 2019)) using transfer learning, and LSTM in the traditional framework for analyzing consumer reviews from the IMDB movie dataset. In this context, Syn’s generative capability using GPT-3.5 significantly outperforms the LSTM approach in the traditional setup. Yet, DistilBERT in the transfer learning setup, although trailing, demonstrates notable competitiveness compared to GPT-3.5. To further validate Syn-Slm’s application, we provide a simulated example with tabular data in Appendix B.

In the second area, we introduce Syn-Boost, a version of CatBoost (Dorogush, Ershov and Gulin, 2018)—a gradient boosting algorithm (Schapire, 1990)—trained on synthetic data. Syn-Boost bolsters the precision of CatBoost for both regression and classification tasks across eight real-world datasets, utilizing a refined tabular diffusion model (Kotelnikov et al., 2023). Statistically, the error trajectory of Syn-Boost exhibits either a U-shaped or L-shaped pattern, determined by the generation errors and the volume of synthetic data used. Moreover, when employing the same knowledge transfer methods, Syn-Boost surpasses traditional feed-forward networks trained on raw data in six of eight cases. These observations highlight that the generative approach of Syn offers a predictive advantage even against a top predictive methodology given the same data input.

In the third domain, we explore feature significance tests in discerning feature relevance in regression and classification using CatBoost. This exploration within black-box models unveils largely uncharted territory. Recently, Dai, Shen and Pan (2024) introduces an asymptotic test through sample splitting. To augment its statistical prowess, we employ the Syn-Test, capitalizing on pre-trained generative models and knowledge transfer, as demonstrated in two distinct scenarios. In one scenario, we leverage a pre-trained model to ensure smooth knowledge transfer from the male data to the female data while improving generation fidelity and test accuracy for female data, especially when male and female data distributions present distinct characteristics. These observations emphasize the importance of knowledge transfer in mitigating disparities in domains such as healthcare and social science, particularly when data for specific subgroups, like minorities, are limited. Moreover, we shed light on the “generational effect”, accentuating the invaluable interplay of synthetic data generation and knowledge transfer in hypothesis testing—a domain rarely explored in current literature.

This article consists of six sections. Section 2 explores Syn’s role in enhancing the accuracy of statistical methods and emphasizes the importance of knowledge transfer in synthetic data generation. Section 3 provides illustrative examples, exploring the generational effect in predictive modeling and inference. Section 4 focuses on the privacy issue of synthetic data. In Section 5, we discuss the implications of generating synthetic data for data science. The Appendix contains technical details.

2 Enhancing Statistical Accuracy

2.1 Synthetic Data

The Syn framework empowers data analytics by applying statistical methods on a synthetic sample 𝒁~(m)=(𝒁~i)i=1m\tilde{\bm{Z}}^{(m)}=(\tilde{\bm{Z}}_{i})_{i=1}^{m}. This sample is generated by a generative model trained on raw data 𝒁(n)=(𝒁i)i=1n\bm{Z}^{(n)}=(\bm{Z}_{i})_{i=1}^{n} through fine-tuning a pre-trained model, leveraging insights from various similar studies. These models include GPT (Brown et al., 2020; Bubeck et al., 2023; OpenAI, 2023), diffusion models (Ho, Jain and Abbeel, 2020; Song, Meng and Ermon, 2020), normalizing flows (Dinh, Krueger and Bengio, 2014; Dinh, Sohl-Dickstein and Bengio, 2016; Kingma and Dhariwal, 2018), and GANs (Frid-Adar et al., 2018; Liu et al., 2020). For an in-depth understanding of the generation processes for diffusion models and flows, readers refer to (Liu, Shen and Shen, 2023). In this framework, the cumulative distribution function (CDF) F~\tilde{F} of 𝒁~(m)\tilde{\bm{Z}}^{(m)} estimates the CDF FF of 𝒁(n)\bm{Z}^{(n)}. To produce high-fidelity synthetic data, fine-tuning a pre-trained generative model is recommended, which involves transferring knowledge from previous studies. If pre-trained models are unsuitable, constructing a generative model from scratch is a viable, though less preferred, alternative. The quality of the generated data hinges on the choice of generative model and the effectiveness of knowledge transfer from similar studies. For an illustration, we detail the synthetic data generation process using a diffusion model in Figure 1. Furthermore, to demonstrate the importance of knowledge transfer, we fine-tune a pre-trained tabular diffusion model (Shwartz-Ziv and Armon, 2022) on the Adult-Male dataset to apply this knowledge to the Adult-Female dataset in Section 3.3, where male and female distributions exhibit distinct differences. For a detailed explanation of the impact of knowledge transfer, readers refer to Section 2.5.

Refer to caption
Figure 1: Illustration of denoising diffusion probabilistic model (Ho, Jain and Abbeel, 2020). In the forward process, noise ϵt\epsilon_{t} sequentially corrupts the sample 𝑿t\bm{X}_{t}, evolving from the original sample 𝑿0\bm{X}_{0} to a target, such as random noise, over t=0,,Tt=0,\cdots,T. Conversely, the backward process employs a neural network ϵθ(𝑿t,t)\epsilon_{\theta}(\bm{X}_{t},t) to predict ϵt\epsilon_{t}, starting from the random state. This network, fine-tuned from similar pre-trained models, denoises 𝑿t\bm{X}_{t}, from t=T,,0t=T,...,0, to generate a synthetic sample 𝑿~0\tilde{\bm{X}}_{0} replicating 𝑿0\bm{X}_{0}.

It is imperative to underscore the pivotal role of knowledge encapsulated in pre-trained models for improving generation accuracy through fine-tuning. Nevertheless, directly accessing the pre-training data for pre-trained models is often infeasible, impeded by privacy concerns, extensive storage needs, and data inconsistencies (Sweeney, 2002; Quionero-Candela et al., 2009; Labrinidis and Jagadish, 2012; Carroll, 2006), as seen in models like GPT-4. Furthermore, the distributions of pre-training datasets, especially from different sources, might not always mirror that of raw data, as illustrated by the Adult-Male and Adult-Female examples in Section 3.3. Considering these challenges and the nuances of real-world scenarios, we omit pre-training data from our raw data composition throughout this article.

To yield F~\tilde{F} directly, one can utilize some revertible generative models such as normalizing flows and Roundtrip GAN (Liu et al., 2020), acting as a nonparametric estimate of FF. For other generative models, such as diffusion models and GPT, one can typically obtain F~\tilde{F} from synthetic data employing Monte Carlo methods, which we elaborate on in Section 2.4. In the numerical examples presented in this paper, we utilize a tabular diffusion model (TDM, Kotelnikov et al. (2023)) and GPT for synthetic data generation.

Subsequently, we explore the precision advantages offered by the Syn framework.

2.2 Optimal Synthetic Size for Estimation and Prediction

In estimation and prediction, leveraging a statistical method on a synthetic sample gives rise to an estimate, denoted as 𝜽^(𝒁~(m))\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m)}), of a parameter vector 𝜽\bm{\theta}. The effectiveness of this method gauges through a specific error metric R(𝜽^(𝒁~(m)))=EL(𝜽^(𝒁~(m)),𝜽)\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m)}))=\operatorname{E}L(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m)}),\bm{\theta}), which would theoretically improve the statistical accuracy of 𝜽^(𝒁~(m))\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m)}) with an infinite amount of synthetic data, provided F~\tilde{F} perfectly replicates FF. Here, E\operatorname{E} symbolizes the expectation under FF, while L(,)L(\cdot,\cdot) represents a loss function quantifying the discrepancies between 𝜽^\widehat{\bm{\theta}} and 𝜽\bm{\theta}. However, numerical insights from Section 3.2 reveal the existence of a reflection point, denoted as m0=argminm1R(𝜽^(𝒁~(m)))m_{0}=\arg\min_{m\geq 1}\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m)})), which delineates a relationship between the synthetic sample size mm and the accuracy augmentation for this method. This point m0m_{0} is governed by the generation error measured by metrics such as the total variation distance between F~\tilde{F} and FF, defined as TV(F~,F)=supB|PF(B)PF~(B)|\text{TV}(\tilde{F},F)=\sup_{B}|P_{F}(B)-P_{\tilde{F}}(B)|, where PFP_{F} and PF~P_{\tilde{F}} are probabilities measures induced by FF and F~\tilde{F} and BB is any event.

To estimate m0m_{0}, we optimize its empirical risk measure across mm on an independent cross-validation sample from the original resources, which yields an optimizer m^\hat{m} as an estimate of m0m_{0}. For instance, in scenarios where the risk measure is the generalization error in binary classification, its empirical risk is the test error on a test dataset obtained from the original resources, approximating a classifier’s generalizability.

To investigate the theoretical aspects of the generational effect concerning the accuracy of 𝜽^(𝒁~(m))\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m)}), we clarify the notation: Let Gr(m)=R(𝜽^(𝒁~(m)))R(𝜽^(𝒁(m)))\text{Gr}^{(m)}=\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m)}))-\text{R}(\widehat{\bm{\theta}}(\bm{Z}^{(m)})) represent the discrepancy between the synthetic and raw errors for a sample size mm, where R(𝜽^(𝒁(m)))=EL(𝜽^(𝒁(m)))\text{R}(\widehat{\bm{\theta}}(\bm{Z}^{(m)}))=\operatorname{E}L(\widehat{\bm{\theta}}(\bm{Z}^{(m)})) denotes the risk incurred when employing a raw sample 𝒁(m)\bm{Z}^{(m)} of size mm.

Theorem 2.1 (Reflection Point).

Suppose R(𝛉^(𝐙(m)))=C𝛉mα\text{R}(\widehat{\bm{\theta}}(\bm{Z}^{(m)}))=C_{\bm{\theta}}m^{-\alpha} for some constant α>0\alpha>0. Assume that Gr(m)mf(TV(F~,F))\text{Gr}^{(m)}\geq mf(\text{TV}(\tilde{F},F)) if mmm\leq m^{*} and Gr(m)Gr(m)\text{Gr}^{(m)}\geq\text{Gr}^{(m^{*})} if m>mm>m^{*} for some finite index mm^{*}, where f()>0f(\cdot)>0 is a function and f(TV(F~,F))f(\text{TV}(\tilde{F},F)) indicates the magnitude of generation error. Then,

  1. (i)

    The optimal synthetic size m0<mm_{0}<m^{*} is finite.

  2. (ii)

    For any m>mm>m^{*}, R(𝜽^(𝒁~(m)))>R(𝜽^(𝒁~(m0)))\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m)}))>\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m_{0})})) provided that f(TV(F~,F))f(\text{TV}(\tilde{F},F)) is larger than R(𝜽^(𝒁~(m0)))/m\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m_{0})}))/m^{*}.

The premise regarding the growth of Gr(m)Gr^{(m)} in mm in Theorem 2.1 suggests that the divergence between the synthetic risk and the risk escalates more rapidly when mmm\leq m^{*}, beyond which it remains higher than Gr(m)Gr^{(m^{*})}. This condition implies a substantial generation error. According to Theorem 1, a considerable generation error results in the synthetic risk R(𝜽^(𝒁~(m)))\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m)})) reaching its minimum at a specific m0m_{0}, beyond which the synthetic error begins to worsen. This outcome indicates that an increase in synthetic sample size does not necessarily improve estimation or prediction accuracy in the presence of significant generation error, a phenomenon further illustrated in Section 3.2. In contrast, when the generation error is minimal, the synthetic risk remains controlled, and the optimal m0m_{0} tends to be substantially large or potentially infinite, as suggested by Theorem 2.2.

Next, we establish a bound on the synthetic risk to offer insight into the conditions under which accuracy improvements may arise. Assume that 𝒁(n)\bm{Z}^{(n)} is independently and identically distributed according to FF while 𝒁~(m)\tilde{\bm{Z}}^{(m)} is independently and identically distributed according to a conditional distribution F~F𝒁𝒁(n)\tilde{F}\equiv F_{\bm{Z}\mid\bm{Z}^{(n)}} given 𝒁(n)\bm{Z}^{(n)}.

Theorem 2.2 (Accuracy Gain).

Let LL be a nonnegative loss function upper-bounded by U>0U>0. For any m1m\geq 1,

R(𝜽^(𝒁~(m)))R(𝜽^(𝒁(m)))+2UmTV(F~,F).\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m)}))\leq\text{R}(\widehat{\bm{\theta}}(\bm{Z}^{(m)}))+2Um\text{TV}(\tilde{F},F). (1)

Moreover, if R(𝛉^(𝐙(m)))=C𝛉mα\text{R}(\widehat{\bm{\theta}}(\bm{Z}^{(m)}))=C_{\bm{\theta}}m^{-\alpha}, then

R(𝜽^(𝒁~(m0)))R(𝜽^(𝒁~(s0)))(αα1+α+α11+α)(2U)α1+αC𝜽11+αTV(F~,F)α1+α,\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m_{0})}))\leq\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(s_{0})}))\leq\left(\alpha^{\frac{-\alpha}{1+\alpha}}+\alpha^{\frac{1}{1+\alpha}}\right)(2U)^{\frac{\alpha}{1+\alpha}}C_{\bm{\theta}}^{\frac{1}{1+\alpha}}\cdot\text{TV}(\tilde{F},F)^{\frac{\alpha}{1+\alpha}}, (2)

when s0=(2UαC𝛉)α1+αTV(F~,F)11+αs_{0}=\left(\frac{2U}{\alpha C_{\bm{\theta}}}\right)^{\frac{\alpha}{1+\alpha}}\cdot\text{TV}(\tilde{F},F)^{-\frac{1}{1+\alpha}}. Hence, R(𝛉^(𝐙~(m0)))R(𝛉^(𝐙(n)))\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m_{0})}))\leq\text{R}(\widehat{\bm{\theta}}(\bm{Z}^{(n)})) achieves an accuracy gain under Syn when the total variation TV(F~,F)\text{TV}(\tilde{F},F) is sufficiently small. For example, this occurs when TV(F~,F)C𝛉(2U)1(αα1+α+α11+α)1+ααn(1+α)\text{TV}(\tilde{F},F)\leq C_{\bm{\theta}}(2U)^{-1}\left(\alpha^{\frac{-\alpha}{1+\alpha}}+\alpha^{\frac{1}{1+\alpha}}\right)^{-\frac{1+\alpha}{\alpha}}\cdot n^{-(1+\alpha)}.

Theorem 2.2 posits that training a method trained on synthetic data can lead to an accuracy gain, provided that the generation error that governs the synthetic risk R(𝜽^(𝒁~(m)))\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m)})) is small. Hence, high-fidelity data can mitigate the synthetic risk in a method, thereby enabling a large optimal synthetic size m0m_{0} for further amplifying the precision.

The existing literature provides insights into the magnitude of generation error as represented by TV(F~,F)\text{TV}(\tilde{F},F). For instance, Theorem 5.1 in Oko, Akiyama and Suzuki (2023) specifies bounds for a diffusion model, given the data-generating distribution is a member of a Besov space. It’s worth noting that the boundary defined in Oko, Akiyama and Suzuki (2023) pertains to ETV(F~,F)\operatorname{E}\text{TV}(\tilde{F},F), with E\operatorname{E} symbolizing the expectation relative to FF. However, one may extend this to the in-probability convergence rate for the random quantity TV(F~,F)\text{TV}(\tilde{F},F) by leveraging Markov’s inequality, which provides the convergence rates for TV(F~,F)\text{TV}(\tilde{F},F) based on the raw sample size nn and/or the pre-training sample size.

2.3 Optimal Synthetic Size for Hypothesis Testing

We now introduce Syn-Test, a novel inference tool using high-fidelity synthetic data to boost any test’s power by expanding the sample size of raw data. Syn-Test yields two distinct advantages. First, it employs synthetic data to gauge the null distribution of any test statistic by Monte Carlo methods as in the bootstrap approach (Efron, 1992), circumventing analytical derivations. This methodology proves particularly powerful for unstructured data inferences, including texts and images (Liu, Shen and Shen, 2023). Second, Syn-Test identifies the optimal synthetic data size, optimizing a test’s power while maintaining a suitable control of Type-I errors. For illustration, we refer to Section 3.3.

Given a raw sample, Syn-Test employs two nearly equal-sized subsamples 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2}, partitioned from a training sample for fine-tuning a pre-trained generative model. It also uses a separate validation sample 𝒮3\mathcal{S}_{3} for validating model training. One generative model generates synthetic data using 𝒮1\mathcal{S}_{1} for null distribution estimation, while the other uses 𝒮2\mathcal{S}_{2} for computing the test statistic. Figure 7 illustrates the splitting scheme and Syn-Test process. Syn-Test also empirically determines the optimal synthetic data size m0m_{0} to control the Type-I error. By swapping the roles of 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2}, Syn-Test can de-randomize the partition, transitioning the original inference sample size nn to a synthetic inference sample size of mm. This swapping mechanism proves especially advantageous in scenarios with low generation error. Crucially, abundant synthetic data can enhance the size of the raw data, even when sample splitting results in a reduced inference sample size (Wasserman and Roeder, 2009; Wasserman, Ramdas and Balakrishnan, 2020).

Syn-Test encompasses four steps, using a significance level α\alpha, a tolerance error ε\varepsilon, and a Monte Carlo size DD. Syn-Test goes as follows:

Step 1: Controlling Type-I Error. Generate DD distinct synthetic samples (T(𝒁~1(m,d)))d=1D(T(\tilde{\bm{Z}}_{1}^{(m,d)}))_{d=1}^{D} of size mm by refining a pre-trained generative model with 𝒮1\mathcal{S}_{1} under H0H_{0}, using 𝒮3\mathcal{S}_{3} as a validation set for model tuning to avoid overfitting. Compute the empirical distribution of the test statistic TT using (T(𝒁~1(m,d)))d=1D(T(\tilde{\bm{Z}}_{1}^{(m,d)}))_{d=1}^{D}. Define a rejection region CmC_{m} at a significance level where α>0\alpha>0.

Step 2: Optimizing Synthetic Size through Tuning. Execute Step 1, but use 𝒮2\mathcal{S}_{2} instead of 𝒮1\mathcal{S}_{1} to produce 𝒁~2(m,d)\tilde{\bm{Z}}^{(m,d)}_{2}. Utilize the empirical distribution from (T(𝒁~2(m,d)))d=1D(T(\tilde{\bm{Z}}_{2}^{(m,d)}))_{d=1}^{D} to find the empirical Type-I error, denoted P~(Cm)\tilde{P}(C_{m}), for the CmC_{m} created in Step 1. To effectively control the Type-I error, we propose two distinct strategies for identifying m^\hat{m}: an aggressive and a conservative approach. The aggressive approach selects the largest mm that maintains the estimated Type-I error within the desired limit. In contrast, the conservative one chooses the smallest mm about failing to control the estimated Type-I error. Mathematically,

  1. 1.

    Aggressive: m^=max{m:P~(Cm)α+ε}\hat{m}=\max\{m:\tilde{P}(C_{m})\leq\alpha+\varepsilon\}.

  2. 2.

    Conservative: m^=min{m:P~(Cm)α+ε and P~(Cm+1)>α+ε}\hat{m}=\min\{m:\tilde{P}(C_{m})\leq\alpha+\varepsilon\ \text{ and }\tilde{P}(C_{m+1})>\alpha+\varepsilon\}.

In practice, we recommend adopting a conservative approach, as it more effectively manages Type-I errors, although it may be less powerful compared to a more aggressive strategy.

Step 3: Calculating the P-value. With the determined m^\hat{m}, produce synthetic data 𝒁~2(m^)\tilde{\bm{Z}}_{2}^{(\hat{m})} by fine-tuning the pre-trained generative model using 𝒮2\mathcal{S}_{2} with 𝒮3\mathcal{S}_{3} as a validation set. Calculate the test statistic for T(𝒁~2(m^))T(\tilde{\bm{Z}}_{2}^{(\hat{m})}) and determine the P-value, P1P^{1}, leveraging the null CDF estimated from 𝒮1\mathcal{S}_{1} in Step 1.

Step 4: Combining the P-values. Repeat Step 3 by interchanging the roles of 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2} to compute the P-value P2P^{2}. Combine P-values via Hommel’s weighted average (Hommel, 1983):

P¯=min(Cmin1q22qP(q),1),\bar{P}=\min\Big{(}C\min_{1\leq q\leq 2}\frac{2}{q}P^{(q)},1\Big{)},

where C=q=121q=3/2C=\sum_{q=1}^{2}\frac{1}{q}=3/2 and P(q)P^{(q)} is the qq-th order statistic of {P1,P2}\{P^{1},P^{2}\}.

Hommel’s method excels in controlling the Type-I error relative to many of its peers, ensuring that (P¯α)α\mathbb{P}\big{(}\bar{P}\leq\alpha\big{)}\leq\alpha under H0H_{0}. While effective, there are also alternative strategies such as the Cauchy combination (Liu and Xie, 2020). To expedite the search of an estimated m0m_{0} m^\hat{m}, we may consider techniques such as Bisection (Burden and Faires, 2001) or Fibonacci Search (Kiefer, 1953).

Theorem 2.3 (Validity and power of Syn-Test).

Assume that 𝐙~(m,d)\tilde{\bm{Z}}^{(m,d)} is an i.i.d. sample of size mm following F~=F𝐙~𝐙(n)\tilde{F}=F_{\tilde{\bm{Z}}\mid\bm{Z}^{(n)}} given 𝐙(n)\bm{Z}^{(n)}. Let F~T\tilde{F}_{T} and FTF_{T} be the synthetic and raw distributions of TT calculated on a sample of size mm. Then, the estimation error of the null distribution is governed by the Monte Carlo error and the generation error:

supx|FT~(x)FT(x)|log2δ2D+TV(𝒁~(m),𝒁(m)).\sup_{x}\left|F_{\tilde{T}}(x)-F_{T}(x)\right|\leq\sqrt{\frac{\log\frac{2}{\delta}}{2D}}+\text{TV}(\tilde{\bm{Z}}^{(m)},\bm{Z}^{(m)}). (3)

As a result, Syn-Test offers a valid test as long as TV(𝐙~(m),𝐙(m))=mTV(F~,F)0\text{TV}(\tilde{\bm{Z}}^{(m)},\bm{Z}^{(m)})=m\cdot\text{TV}(\tilde{F},F)\rightarrow 0 111Assuming F~\tilde{F} is derived from pre-training data of size NN, then mTV(F~,F)m\text{TV}(\tilde{F},F) decreases as NN\rightarrow\infty, given certain regularity conditions. To meet the requirement, TV(F~,F)\text{TV}(\tilde{F},F) must reduce at a rate quicker than the growth of mm, underscoring the significance of extensive pre-training data size, a common practice in transfer learning. in probability and DD\rightarrow\infty. Moreover, let the power function ϕm,α\phi_{m,\alpha} be P(T(𝐙(m))Rα|Ha)P(T(\bm{Z}^{(m)})\in R_{\alpha}|H_{a}) for rejection region RαR_{\alpha} at significance level α\alpha, and ϕ~m,α\tilde{\phi}_{m,\alpha} analogously with 𝐙~(m)\tilde{\bm{Z}}^{(m)}. If for some m>nm>n, Δ=ϕm,αϕn,α>0\Delta=\phi_{m,\alpha}-\phi_{n,\alpha}>0, then ϕ~m,α>ϕn,α\tilde{\phi}_{m,\alpha}>\phi_{n,\alpha} when TV(F~,F)<Δ/m\text{TV}(\tilde{F},F)<\Delta/m, indicating that Syn-Test enhances power if the generation error is small.

Syn-Test facilitates valid inference without necessitating many model assumptions, specific data distributions, or an infinitely large inference sample. Its validity and power primarily hinge on the small generation error, a condition usually met by generative models trained on sufficiently large datasets. Additionally, Syn-Test permits a fine-tuned generator to estimate the bias of a test statistic through a Monte Carlo approach. This estimation can serve various purposes in downstream analysis, including debiasing. We defer this methodology in the future study.

2.4 Syn-Slm: Streamlined Approach

The Syn Framework enables synthetic data generation, mirroring raw data distributions. It is adept at tackling statistical challenges in both unsupervised and supervised realms. It can also derive F~\tilde{F} directly from some generative models such as normalizing flows (Dinh, Krueger and Bengio, 2014; Dinh, Sohl-Dickstein and Bengio, 2016; Kingma and Dhariwal, 2018). Subsequently, we introduce Syn-Slm, a streamlined approach for supervised learning tasks. Unlike the methods discussed in Section 2.2, Syn-Slm directly models the conditional distribution of the outcome given predictors without relying on an additional predictive model for synthetic data. This method simplifies the process by directly modeling the conditional distribution of the outcome.

Supervised learning aims to predict an outcome based on a set of predictors, denoted as 𝑿\bm{X}. Consider a scenario with a one-dimensional outcome variable YY and define 𝒁=(Y,𝑿)\bm{Z}=(Y,\bm{X}). In this context, the goal is to estimate a statistical functional, represented as ϕ(FY|𝑿)\phi(F_{Y|\bm{X}}), where FY|𝑿F_{Y|\bm{X}} is the conditional CDF of YY given 𝑿\bm{X}. Our Syn-Slm method introduces a plug-in estimate ϕ(F~Y|𝑿)\phi(\tilde{F}_{Y|\bm{X}}), where F~Y|𝑿\tilde{F}_{Y|\bm{X}} represents the conditional CDF of a synthetic YY given 𝑿\bm{X}. This approach, focusing on estimating F~Y|𝑿\tilde{F}_{Y|\bm{X}}, is notably distinct from conventional methods, which often estimate F~𝑿|Y\tilde{F}_{\bm{X}|Y}, such as generating images from a specific class. However, as hypothesized by Zhang et al. (2023b), this type of generative modeling could outperform direct predictive modeling, as demonstrated in their work using imputation techniques. To illustrate, if ϕ(FY|𝑿)=E(Y|𝑿)\phi(F_{Y|\bm{X}})=\operatorname{E}(Y|\bm{X}) signifies the conditional expectation of YY, then ϕ(F~Y|𝑿)=E(Y~|𝑿)\phi(\tilde{F}_{Y|\bm{X}})=\operatorname{E}(\tilde{Y}|\bm{X}), where Y~\tilde{Y} is derived from F~Y|𝑿\tilde{F}_{Y|\bm{X}}, resulting in a Syn-Slm estimate. An example is provided in Section 3.1, and a simulated example is presented in Appendix B.

In contrast, the Syn prediction method described in Section 2.2 employs a two-step process: it initially generates joint synthetic data 𝒁~=(Y~,𝑿~)\tilde{\bm{Z}}=(\tilde{Y},\tilde{\bm{X}}), followed by training a predictive model using 𝒁~\tilde{\bm{Z}}. For instance, Syn-Boost in 3.2 utilizes TDM (Kotelnikov et al., 2023) as the generator and CatBoost (Dorogush, Ershov and Gulin, 2018) for prediction. While straightforward in implementation, methods akin to Syn-Boost may incur prediction errors from predictive modeling and generation errors from generating 𝒁\bm{Z}. In contrast, Syn-Slm directly models the conditional generation of Y|𝑿Y|\bm{X} and estimates F~Y|𝑿\tilde{F}_{Y|\bm{X}} via a Monte Carlo approach, offering a fully non-parametric solution. Additionally, Syn-Slm enables the estimation of various statistical functionals ϕ(FY|𝑿)\phi(F_{Y|\bm{X}}), such as E(Y|𝑿)E(Y|\bm{X}), Var(Y|𝑿)\text{Var}(Y|\bm{X}), and quantiles of Y|𝑿Y|\bm{X}. In contrast, methods like Syn-Boost require training different predictive models under varying assumptions to estimate these quantities. A detailed comparison between Syn-Boost and Syn-Slm is presented in Table 1.

Table 1: Comparison between Syn-Boost and Syn-Slm in supervised learning.
Syn-Boost Syn-Slm
Generative modeling Joint distribution of (Y,𝑿)(Y,\bm{X}) Conditional distribution of Y|𝑿Y|\bm{X}
Estimation of ϕ(FY|𝑿)\phi(F_{Y|\bm{X}}) Predictive modeling on (Y~,X~)(\tilde{Y},\tilde{X}) MC computation with Y~|𝑿\tilde{Y}|\bm{X}
Pros
Generative modeling is easy;
Prediction is fast.
Fully non-parametric;
Estimate different ϕ\phi once and for all.
Cons
Additional prediction error;
Needs to tune synthetic size.
Generative modeling is challenging;
Prediction relies on MC simulation.

2.5 Generative Model and Knowledge Transfer

Knowledge transfer elevates generation accuracy by infusing task-specific generative models with pre-trained knowledge from relevant studies. From the perspective of dimension reduction, we dissect knowledge transfer in two scenarios. In the first situation, consider a generative model g𝜽g_{\bm{\theta}} parametrized by 𝜽\bm{\theta}. Originally trained on an extensive dataset for a generation task, the model undergoes subsequent fine-tuning on a smaller but similar dataset to account for distribution shift, resulting in model g𝜽g_{\bm{\theta}^{\prime}}, where the architecture remains consistent across both models, with the essence of knowledge transmission occurring via the transition from 𝜽\bm{\theta} to 𝜽\bm{\theta}^{\prime} amid the fine-tuning. In the other situation, a robust pre-trained model undergoes training across multiple tasks, characterized as (f1,,ft)h(f_{1},\ldots,f_{t})\circ h. Here, fif_{i} defines the output function tied to the ii-th task, hh is the shared representation function, and \circ denotes functional composition. Given a learned representation hh, one only fine-tune f0f_{0} during its optimization phase for f0f_{0} (Tripuraneni, Jordan and Jin, 2020). As the generative model refines its precursor, it absorbs the precisely calibrated representation hh. This knowledge transfer thus can augment the generation precision through fine-tuning with a heightened accuracy of the learned f0f_{0}, facilitating its dimension reduction. It is pivotal to acknowledge that within this configuration, f0hf_{0}\circ h and fihf_{i}\circ h; i=1,,ti=1,\ldots,t, only share the same architecture in hh. An alternate strategy entails concurrent fine-tuning of f0f_{0} and hh to derive a representation explicitly for f0f_{0}.

The Syn framework capitalizes on knowledge transfer to bolster its overall efficacy, streamlining the synthetic data generation process. For a visual representation of how a pre-trained generative model is fine-tuned on a specific dataset to facilitate knowledge transfer, refer to Figure 1. In Section 3.3, we illustrate knowledge transfer in generative models using a pre-trained model based on adult male data (Kohavi et al., 1996), subsequently fine-tuned with adult female data for downstream analysis. As demonstrated in Figures 3, 4, and Table 3, the fine-tuned model adeptly captures the data distribution, even with a limited size of raw samples.

3 Case Studies

3.1 Sentiment Analysis

This subsection presents sentiment classification applied to the benchmark dataset, IMDB (Maas et al., 2011). This task involves assigning emotions expressed in the text into positive or negative sentiments based on the opinions reflected in each text. The dataset comprises 50,00050,000 polarized movie reviews, categorized as “positive” or “negative” sentiments. These labels correspond to movie scores below four or above seven out of ten, where no movie has more than 30 reviews to prevent significant class imbalance. Here, we use 49,00049,000 of these reviews as our training data, reserving 1,0001,000 reviews for testing. We compare Syn’s generative approach against its conventional counterpart in a downstream prediction task, utilizing three models, GPT-3.5, DistilBERT, and LSTM models (Chen et al., 2023).

GPT-3.5 functions primarily as a text completion model, predicting the succeeding token as the sentiment label. Although essentially a completion model, GPT-3.5 is a conditional generative model that aligns with Syn-Slm, which we adapt for predictive tasks. In contrast, DistilBERT generates a fixed-size embedding of a review, which is then passed to an appended classification head to deduce sentiment likelihood. Unlike GPT-3.5’s token generation approach, DistilBERT’s technique aligns more closely with traditional predictive modeling with transfer learning for supervised tasks. Additionally, we train a traditional LSTM model from scratch, eschewing prior knowledge. Like DistilBERT, the LSTM processes an embedding and feeds it into a classification head, rendering it a predictive model.

Table 2 compares GPT-3.5, DistilBERT, and LSTM in six performance metrics. The extensive collection of pre-trained models likely contributes to GPT-3.5’s outstanding performance. On the other hand, LSTM’s underperformance stems from its inability to transfer knowledge. Knowledge transfer plays a crucial role in model performance. For details on training configurations, refer to the Supplementary Materials.

Table 2: Performance comparison of fine-tuned GPT-3.5 (Syn-classification with knowledge transfer), DistilBERT (Classification with knowledge transfer), and traditional LSTM (Classification without knowledge transfer) on IMDB sentiment analysis. Metrics evaluated include Accuracy, Precision, Recall, Area Under ROC, Area Under PRC, and F1-Score, all assessed on an independent test set.
Model Training approach Accuracy Precision Recall AUROC AUPRC F1-score
GPT-3.5 Fine-tuning, generative model 0.970 0.967 0.975 0.991 0.989 0.971
DistilBERT Fine-tuning, predictive model 0.939 0.930 0.954 0.985 0.984 0.942
LSTM Training, predictive model 0.854 0.885 0.819 0.939 0.943 0.851

3.2 Prediction for Structured Data

This subsection investigates the generational phenomenon and challenges associated with enhancing the precision of gradient-boosting for regression and classification (Breiman, 1997; Friedman, 2002). It also focuses on the implications for the quality of synthetic data generation. Within the Syn framework, we designate the boosting method tailored for synthetic data as Syn-Boost. Despite the surge of diverse predictive models, the capabilities of the Syn framework in predictive modeling remain largely untapped. To highlight this potential, we draw contrasts between Syn-Boost and its traditional supervised counterparts: specifically, the boosting algorithm — CatBoost (Dorogush, Ershov and Gulin, 2018) and FNN — a fully connected neural network that leverages insights from a pre-trained model, both of which are traditional. Syn-Boost presents a strategy to harness knowledge transfer in boosting, effectively addressing the transfer learning challenge inherent to the boosting method.

3.2.1 Real-Benchmark Examples

To closely emulate real-world scenarios, we investigate situations where available pre-trained models have incorporated insights from relevant studies. For this study, we utilize five classification and three regression benchmark datasets (Kotelnikov et al., 2023), each encompassing three subsets: pre-training, raw, and test data. A detailed description of these datasets can be found in Table 6.

In the Syn-Boost framework, we utilize a tabular diffusion model, TDM (Kotelnikov et al., 2023), to generate synthetic data of mixed types that closely match the distribution of the original data. TDM employs multinomial and Gaussian diffusion processes to simulate categorical and continuous attributes. The procedure starts by training a TDM model on pre-existing data and then fine-tuning it with raw data. Subsequently, we use CatBoost on the synthetic data of size mm, created by TDM for classification and regression tasks. To identify the best mm for Syn-Boost’s synthetic data, we evaluate the error relative to the synthetic-to-raw data ratio, ranging from 11 to 3030, with a step size of 11.

For FNN, we engage in transfer learning, starting with pre-existing data and later fine-tuning with raw data. This technique is consistent with TDM’s training approach, ensuring that both models have a harmonized foundation for effective knowledge transfer.

Knowledge Transfer from Identical Distributions. For each dataset mentioned, we utilize pre-training data to train pre-trained models, enabling the downstream knowledge transfer on raw data with either Syn-Boost or FNN. We use the test data solely for evaluating performance. It is important to note that in this study, the pre-training and raw data used for fine-tuning share the same distribution.

Refer to caption

Figure 2: Comparative error analysis of CatBoost, Syn-Boost, and FNN, with Syn-Boost and FNN applying transfer learning with the same distributions across eight benchmarks (Kotelnikov et al., 2023), measured at various synthetic-to-raw data ratios. The stars indicate the size of the pre-training data used to obtain pre-trained models and the tuned sample size for Syn-Boost. The performance for classification and regression tasks is measured by misclassification rate and RMSE, respectively. Point-wise standard errors, derived from smoothing spline models (Hastie et al., 2009), are also depicted to illustrate the variation in error.

Figure 2 underscores the contribution of Syn in bolstering CatBoost’s efficacy in classification and regression. While there is a minor boost in “FB comments”, the extent of improvement via Syn-Boost is diverse. Classification and regression enhancements span 0.6% to 17.4% and 0.03% to 12.3%, respectively, against CatBoost’s baseline. The magnitude of these enhancements varies by scenario. For instance, datasets like “Gesture” and “House”, which have a larger predictor count, show more significant leaps. In contrast, datasets like “Adult” and “California” with fewer predictors demonstrate modest gains.

Figure 2 also highlights the generational effect as the size of synthetic data increases. Accuracy gains plateau after reaching the estimated reflection point m^\hat{m}, an estimated optimal size of synthetic data. This point represents the peak of statistical accuracy and is consistently greater than raw sample sizes, often by at least five-fold. In scenarios like “Gesture”, “Adult”, “California”, “House”, and “Insurance”, Syn-Boost surpasses CatBoost even when raw and synthetic data sizes are equal (m=nm=n). This observation suggests that the efficacy of Syn-Boost is rooted in the increased sample size mm, as evidenced by the “California” and “House” datasets that utilized an optimized synthetic-to-raw data ratio of 25:1 or higher. Typically, error curves form a U-shape around a moderate m^\hat{m} but shift to an L-shape when m^\hat{m} is exceptionally large.

In a supervised setting, Syn-Boost, which employs CatBoost on synthetic data, typically outperforms FNN, except in the “FB comments” and “Abalone” datasets. When the generation errors are modest, the performance of Syn-Boost over FNN spans from 11.1% to 14.6% in classification and 7.2% to 16.3% in regression. The reduced performance on the “FB comments” and “Abalone” datasets, with a decline ranging from 1.4% to 6.6%, is chiefly attributed to non-negligible generation errors from the pre-trained TDM (Kotelnikov et al., 2023). In both scenarios, Syn-Boost may not surpass CatBoost when m=nm=n. A similar phenomenon also occurs for other models including decision trees, random forests, and logistic regression without knowledge transfer (Kotelnikov et al., 2023). We speculate that the generation error in the “FB comments” dataset arises from the model architecture’s inability to handle large pre-training instances, while the “Abalone” dataset’s underperformance could be due to insufficient pre-training size. Notably, while FNN outperforms CatBoost in the “Abalone”, “FB comments”, and “Gesture” datasets, it lags in others. These findings highlight Syn-Boost as a strong competitor against well-established predictive models.

Knowledge Transfer Across Distinct Distributions. The Adult dataset (Kohavi et al., 1996), derived from the 19941994 census, includes data on 32,65032,650 males and 16,19216,192 females, featuring six numerical and eight nominal attributes such as age, work class, education years, marital status, weekly working hours, and native country. Our goal is to predict whether an adult female’s income surpasses $50K annually using a generative model pre-trained on male data to facilitate knowledge transfer, despite their differing distributions but capitalizing on their similarities.

As Figure 3 depicts, the pronounced differences across genders exist in the distributions of categories like income, age, marital status, occupation, and relationship. A pertinent question arises: can adult male data augment the synthetic data generation and subsequent classification task on adult female data? To harness this knowledge transfer, we pre-train a TDM solely on the adult-male data with 32,65032,650 observations. We utilize a subset of size 1,3501,350 from the adult-female data as our raw data, and another independent subset of the same size as the test data for evaluation purposes.

Table 3: Distribution distance comparisons using FID-scores, 1- and 2-Wasserstein distances between the actual female sample and various synthetic samples. The terms ”raw,” ”pre-trained,” and ”fine-tuned” denote the raw data, synthetic data from a pre-trained model, and synthetic data from a fine-tuned pre-trained model, respectively. The evaluation sample size for ”Female (raw)” is 1,3501,350, while for ”Male (raw),” ”Male (pre-trained),” ”Female (pre-trained),” and ”Female (fine-tuned),” is 32,65032,650.
FID (Gaussian) 1-Wasserstein 2-Wasserstein
Female (raw) vs Male (raw) 1.971 1.968 2.125
Female (raw) vs Male (pre-trained) 2.051 1.967 2.127
Female (raw) vs Female (pre-trained) 0.457 1.292 1.536
Female (raw) vs Female (fine-tuned) 0.249 1.170 1.399

Figures 3 and 4 demonstrate that the synthetic female data produced by TDM aligns more with the adult-female dataset than with the adult-male dataset. This alignment is evident by the Fréchet Inception Distance (FID), which measures the distributional differences between the generated and raw data vectors under the Gaussian assumption. The 1- and 2-Wasserstein distances between the empirical distributions of the two datasets provide further evidence (note that FID is 2-Wasserstein distance under Gaussian assumption). Notably, as shown in Table 3, the male-focused pre-trained TDM, once fine-tuned with a smaller female dataset, crafts synthetic female data with a diminished margin of error compared to models trained solely on the adult-male data and adult-female data. This result confirms that leveraging pre-trained adult-male data with a somewhat distinct distribution can enhance the TDM’s generation precision for females. This empirical validation emphasizes the imperative of refining a pre-trained model to maximize knowledge transfer and achieve unparalleled accuracy.

Refer to caption
Figure 3: Marginal distributions of datasets categorized as female, synthetic female, and male are illustrated, with legends arranged from top to bottom. Normalized bar and kernel density plots represent categorical and numerical features, respectively.

The result of Syn-Boost prediction is illustrated in Figure 5. We observe that simply increasing the synthetic size does not necessarily improve the prediction performance, as depicted at the beginning of the Syn-Boost tuning. The optimal empirical ratio is obtained at m/n=23m/n=23, which is about the same size as the pre-training adult-male data. Moreover, Syn-Boost achieves 5.32%5.32\% improvement compared to CatBoost in misclassification error, and 11.00%11.00\% improvement compared to the FNN approach, indicating the potential of this method.

Both case studies highlight the effectiveness of the Syn framework in enhancing statistical accuracy through synthetic data generation. Syn’s robust performance primarily stems from the generative capabilities of diffusion models coupled with the application of knowledge transfer. These elements enhance the generative model’s generation accuracy by accurately estimating the distribution of raw data over low-dimensional manifolds (Oko, Akiyama and Suzuki, 2023). However, it is crucial to recognize that the Syn framework’s success also hinges on thoughtful modeling and predictor selection. The case studies also explain the results of Kotelnikov et al. (2023) regarding the potential pitfalls in a prediction task when employing synthetic data to train machine learning models, a concern mentioned in the Introduction. Low-fidelity synthetic data, resulting from substantial generation errors, can negatively impact statistical accuracy. The study suggests that employing knowledge transfer from relevant studies is a strategy to reduce generation errors.

Refer to caption
Figure 4: Pairwise correlation plots between raw and synthetic female datasets compare those between raw female and male datasets, accompanied by their differences. Dark cells in the difference plots signify pronounced deviations from the female distribution. Pearson’s correlation, Correlation Ratio, and Theil’s U measure continuous-continuous, categorical-continuous, and categorical-categorical correlations.
Refer to caption
Figure 5: Comparative error analysis of CatBoost, Syn-Boost, and FNN, with Syn-Boost and FNN utilizing transfer learning with distinct distributions on the Adult-Female dataset (Kohavi et al., 1996), with Adult-Male data serving as pre-training data, across various synthetic-to-raw data ratios. Stars indicate the pre-training data size and the tuned sample size for Syn-Boost. The vertical bars, calculated using smoothing spline models (Hastie et al., 2009), represent the pointwise standard error.

3.2.2 Simulation

To investigate how generation errors impact the efficacy of Syn-Boost, we conducted simulations with access to ground truth data. We consider a regression model:

𝒀=8+𝑿12+𝑿2𝑿3+cos(𝑿4)+exp(𝑿5𝑿6)+0.1𝑿7+ϵ,\bm{Y}=8+\bm{X}_{1}^{2}+\bm{X}_{2}\bm{X}_{3}+\cos(\bm{X}_{4})+\exp(\bm{X}_{5}\bm{X}_{6})+0.1\bm{X}_{7}+\bm{\epsilon}, (4)

where 𝑿=(𝑿1,,𝑿7)\bm{X}=(\bm{X}_{1},\ldots,\bm{X}_{7}) is uniformly distributed over [0,1]7[0,1]^{7} (Uniform(0,1)7\text{Uniform}(0,1)^{7}) and ϵ\bm{\epsilon} follows a normal distribution with mean zero and standard deviation 0.20.2, N(0,.22)N(0,.2^{2}). In (4), we generate a dataset of 700 samples, dividing it into 500 for training and 200 for validation. To demonstrate the impact of effective versus ineffective generators on downstream tasks, we pre-train tabular diffusion models (Kotelnikov et al., 2023)) with two sizes, 10001000 and 50005000. It is noteworthy that pre-trained models typically use considerably larger training sizes. To evaluate the distribution discrepancy between raw and synthetic samples, we employ the 2-Wasserstein distance, defined as their distributional distance and determined by solving an optimal transport problem using appropriate metrics222https://pythonot.github.io/quickstart.html#computing-wasserstein-distance, as detailed in Table 4 for reference.

We evaluate Syn-Boost’s root mean square error (RMSE) on the prediction performance of synthetic data generated from a pre-trained model, both with and without fine-tuning on raw training data. These scenarios represent the outcomes with ineffective and effective generators, respectively. For comparative purposes, we also assess the RMSE of CatBoost, trained on raw data, and provide the square root of Bayes error, which is 0.2 by design.

As depicted in Figure 6, Syn-Boost attains an RMSE closer to the Bayes error when employing an effective pre-trained generator, thus outperforming CatBoost trained on raw data. In contrast, with an ineffective pre-trained generator, the RMSE of Syn-Boost is similar to the CatBoost error, far from the Bayes error. In practice, we recommend fine-tuning a pre-trained model rather than using it directly. As illustrated in Figure 6, fine-tuning pre-trained generators can further improve the performance of Syn-Boost. Table 4 supports the observed generation effects, which supplements Figure 6.

Table 4: The impact of generation error on Syn-Boost’s RMSE in regression. The generation error is quantified using the 2-Wasserstein distance, comparing a synthetic sample of size 50,00050,000 with an independent sample of the same size from the raw distribution in (4). The terms ”CatBoost,” ”Syn-Boost,” and ”Bayes” represent the RMSE for CatBoost applied to raw data, CatBoost applied to synthetic data, and the Bayesian error (in square root), respectively. ”Syn-Raw Ratio” indicates the ratio of synthetic to raw data.
Pre-training size Fine-tuning 2-Wasserstein Syn-Raw ratio Syn-Boost CatBoost Bayes
1000 No 0.340 7 0.220 0.236 0.200
1000 Yes 0.259 28 0.211
5000 No 0.241 19 0.206
5000 Yes 0.233 27 0.203
Refer to caption
Figure 6: RMSEs of Syn-Boost as a function of the synthetic to raw data ratio in simulations described in (4). It contrasts scenarios with small (pre-training size of 1,000) and large (pre-training size of 5,000) generation errors, as well as the generation with (dotted lines) and without (solid lines) fine-tuning on raw data. The grey dashed lines denote the RMSE of CatBoost, while the black solid lines indicate the square root of the Bayes error.

3.3 Feature Relevance for Tabular Data

This subsection concerns testing the relevance of features for predicting the outcome of the response variable YY by a machine learner using a candidate feature vector 𝑿\bm{X}.

Define the subvector 𝑿S\bm{X}_{S} by 𝑿S=(Xj:jS)\bm{X}_{S}=(X_{j}:j\in S), where SS is a subset of the features. Our objective is significance testing of 𝑿S\bm{X}_{S} in its functional relevance to YY. To assess the influence of 𝑿S\bm{X}_{S}, we use the differenced risk R(f)R(fSc)R(f^{*})-R(f_{S^{c}}^{*}). Here, fSc=f(𝑿Sc)f_{S^{c}}=f(\bm{X}_{S^{c}}), and ff^{*} and fScf^{*}_{S^{c}} are the optimal prediction functions in the population, defined as f=argminfR(f)f^{*}=\operatorname*{\arg\min}_{f}R(f) and fSc=argminfScR(fSc)f^{*}_{S^{c}}=\operatorname*{\arg\min}_{f_{S^{c}}}R(f_{S^{c}}). The risks, R(f)R(f) and R(fSc)R(f_{S^{c}}), are given by R(f)=E(L(f(𝑿),Y))R(f)=\operatorname{E}\big{(}L(f(\bm{X}),Y)\big{)} and R(fSc)=E(L(f(𝑿Sc),Y))R(f_{S^{c}})=\operatorname{E}\big{(}L(f(\bm{X}_{S^{c}}),Y)\big{)}, where E\operatorname{E} represents the expectation. Now, we introduce the null and its alternative hypotheses H0H_{0} and HaH_{a}:

H0:R(f)R(fSc)=0,versusHa:R(f)R(fSc)<0.H_{0}:R(f^{*})-R(f^{*}_{S^{c}})=0,\ \text{versus}\ H_{a}:R(f^{*})-R(f^{*}_{S^{c}})<0. (5)

Rejecting H0H_{0} at a significance level implies feature relevance of 𝑿S\bm{X}_{S} for predicting YY. It is worth mentioning that we target the population-level functions ff^{*} and fScf^{*}_{S^{c}} in (5).

In (5), Dai, Shen and Pan (2024) developed an asymptotic test tailored to black-box learning models. Building upon this foundation, we illustrate how Syn-Test can bolster the power of this traditional test on raw samples by enlarging the synthetic data size while circumventing the necessity to derive the asymptotic distribution of a test statistic.

For Syn-Test, we adhere to Steps 1-4, as delineated in Section 2.3, to examine the relevance of feature set 𝑿S\bm{X}_{S} to outcome YY, employing CatBoost (Dorogush, Ershov and Gulin, 2018) as the learning algorithm. Here, we engage a diffusion model, TDM (Kotelnikov et al., 2023), to generate synthetic data. Initially, we adapt the original test statistic from (Dai, Shen and Pan, 2024) to suit synthetic data as follows:

T=Rm(f~Sc)Rm(f~)SE(Rm(f~Sc)Rm(f~)),\displaystyle T=\frac{R_{m}(\tilde{f}_{S^{c}})-R_{m}(\tilde{f})}{SE(R_{m}(\tilde{f}_{S^{c}})-R_{m}(\tilde{f}))}, (6)

where Rm()R_{m}(\cdot) denotes the empirical risk, evaluated on an inference sample 𝒁~1(m,d)\tilde{\bm{Z}}_{1}^{(m,d)} in Step 1 and 𝒁~1(m)\tilde{\bm{Z}}_{1}^{(m)} with m=m^m=\hat{m} in Step 3 of Syn-Test, f~\tilde{f} and f~Sc\tilde{f}_{S^{c}} denote the estimated predictive function function with and without SS, and SESE denotes the standard error. Note that a large value of TT indicates the relevance of tested features to the response.

In (6), we reject H0H_{0} if TT manifests as large. To compute the test statistic values in Steps 1 and 3, we generate an additional synthetic sample of size 2N2N and split it evenly into two subsamples. Using the first subsample, we train a CatBoost model E(Y|𝑿)=f(𝑿)\operatorname{E}(Y|\bm{X})=f(\bm{X}) to forecast YY employing all features 𝑿\bm{X}, resulting in f~\tilde{f}. In parallel, we train another CatBoost model E(Y|𝑿Sc)=f(𝑿Sc)\operatorname{E}(Y|\bm{X}_{S^{c}})=f(\bm{X}_{S^{c}}) using features 𝑿Sc\bm{X}_{S^{c}}, yielding f~Sc\tilde{f}_{S^{c}}. By employing the synthetic sample to compute full and null predictive models, yielding f~\tilde{f} and f~Sc\tilde{f}_{S^{c}}, we can mitigate the intrinsic bias and asymptotics highlighted in (Dai, Shen and Pan, 2024) stemming from a limited inference size. This behavior is demonstrated in Figure 8 (middle row).

To refine a generative model under the null hypothesis, researchers often employ permutation by replacing redundant predictor vectors 𝑿S\bm{X}_{S} with irrelevant values (Dai, Shen and Pan, 2024). However, this approach may not preserve the correlation structures between 𝑿S\bm{X}_{S} and 𝑿Sc\bm{X}_{S^{c}}. Addressing this issue, we introduce a novel method that maintains these correlation structures while ensuring the feature irrelevance of 𝑿S\bm{X}_{S} on YY given the rest of the features. Our procedure involves two steps:

  1. 1.

    We first train a predictive model to estimate E(𝑿S|𝑿Sc)=g(𝑿Sc)\operatorname{E}(\bm{X}_{S}|\bm{X}_{S^{c}})=g(\bm{X}_{S^{c}}).

  2. 2.

    We generate synthetic data tuples (Y,𝑿S,𝑿Sc)(Y,\bm{X}_{S},\bm{X}_{S^{c}}) using this model. Then, we modify these tuples by replacing 𝑿S\bm{X}_{S} with the predicted values g(𝑿Sc)g(\bm{X}_{S^{c}}), resulting in new tuples (Y,g(𝑿Sc),𝑿Sc)(Y,g(\bm{X}_{S^{c}}),\bm{X}_{S^{c}}). This process ensures compliance with a specific subclass under the risk invariance of H0H_{0} by creating conditionally independent tuples, as described by Dai, Shen and Pan (2024).

Consequently, we obtain modified tuples 𝒁~(m)=(Y~i,𝑿~iS,𝑿~iSc)i=1m\tilde{\bm{Z}}^{(m)}=(\tilde{Y}_{i},\tilde{\bm{X}}_{iS},\tilde{\bm{X}}_{iS^{c}})_{i=1}^{m}, which adhere to the feature irrelevance hypothesis H0H_{0}.

Finally, we designate an MC size of D=1,000D=1,000 for estimating both the null distribution and the Type-I error in Step 1 of Syn-Test. The parameters are set as α=0.05\alpha=0.05, ε=0.01\varepsilon=0.01, and the optimal m^\hat{m} will be tuned based on the ratios m/n{1,2,,20}m/n\in\{1,2,\ldots,20\}.

3.3.1 Real-Benchmark Examples

Knowledge transfer profoundly impacts the behavior of synthetic data, affecting critical downstream tasks, including inference. To illuminate this relationship, we employ Syn-Test to assess feature relevance using the gradient boosting method, CatBoost (Dorogush, Ershov and Gulin, 2018). We explore this in a regression context with the California dataset (Pace and Barry, 1997) and a classification context using the Adult dataset (Kohavi et al., 1996), maintaining the experimental setup detailed in Section 3.2. Within these frameworks, we examine the influence of knowledge transfer on synthetic data generation. Concurrently, we evaluate the efficacy of Syn-Test in identical scenarios and those that are distinct yet closely related.

To contrast Syn-Test with its traditional counterpart, consider the significance test in (6). When the finite-sample null distribution of TT is unknown, the asymptotic distribution of the test statistic may require stringent assumptions (Dai, Shen and Pan, 2024). To circumvent this, we use synthetic samples generated from TDM to approximate the null distribution, as in (Liu, Shen and Shen, 2023). Contrary to that approach, we refrain from using data perturbation, thus eliminating the requirement to maintain the rank property for privacy protection.

Knowledge Transfer from Identical Distributions. Drawing from the 1990 U.S. census, the California dataset offers a glimpse into median house values through eight specific attributes. These include the longitude and latitude of the property, its median age, the total room count, bedroom count, block population, household count within the block, and the median household income.

To facilitate knowledge transfer, we initially pre-train a TDM using the pre-training split with 13,20913,209 observations. For significance testing in (5), we adapt the one-split black-box test statistic in (Dai, Shen and Pan, 2024) with a training sample of size 6,6056,605 and an inference sample of size 826826. To perform the Syn-Test, we follow the splitting scheme illustrated in Figure 7. In detail, we divide the training sample equally into two subsets, 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2}, for fine-tuning purposes. Additionally, we utilize 𝒮3\mathcal{S}_{3} as a validation sample for both model training and fine-tuning. More details on the Syn-Test method can be found in Section 2.3.

Refer to caption
Figure 7: Illustration of Syn-Test in comparison with traditional feature significance test for black-box models. For the traditional approach, one needs a training split to train the black box model independently (Dai, Shen and Pan, 2024), which we split evenly into two subsamples for fine-tuning. More details can be found in Section 2.3.

As depicted in Figure 8 (top left), the empirical Type-I error initially descends, then rises, ultimately surpassing the α=0.05\alpha=0.05 level as the ratio of synthetic-to-raw size m^/n\hat{m}/n grows. The estimated maximum ratio m^/n=6\hat{m}/n=6 preserves the Type-I error control. In other words, we can augment the sample size to sixfold the raw sample size nn. This observation aligns with the generational effect we observed in predictive modeling for classification and regression. Consequently, m^=6n\hat{m}=6n notably enhances the power of Syn-Test, as indicated in Figure 8 (bottom left), where the test statistic distribution shifts to the right, increasing the power to reject H0H_{0} when it is false.

Knowledge Transfer Across Distinct Distributions. In the study of the Adult dataset (Kohavi et al., 1996), we conduct a significance test to assess the relevance of three features—age, education years, and weekly working hours—in predicting whether an adult female’s income exceeds $50K annually.

To enhance the test’s power for females, we generate synthetic data for females through knowledge transfer from the male dataset. This methodology aligns with the approach in the latter part of Section 3.2.1. Here, we pre-train a TDM on a dataset comprising 32,65032,650 adult male records. For our hypothesis testing in (5) focused on adult females, we use a training set of 2,7002,700, split equally between 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2}. Additionally, we use an inference sample of 300300 that also serves as a validation set 𝒮3\mathcal{S}_{3} in the Syn-Test. This approach, as outlined in Section 2.3 and illustrated in Figure 7, involves fine-tuning the pre-trained TDM with 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2} while utilizing 𝒮3\mathcal{S}_{3} to prevent overfitting during model training and refinement. Table 3 and Figures 3 and 4 show the effectiveness of the fine-tuned TDM and address the importance of knowledge transfer for reducing generation error.

Refer to caption
Figure 8: Evaluation of Syn-Test for feature relevance using CatBoost, with parameters α=0.05\alpha=0.05, ε=0.01\varepsilon=0.01, and D=1000D=1000. Left: Analysis on California dataset (Pace and Barry, 1997) using a sample of n=826n=826, focuses on the feature MedInc for predicting MedHouseVal. Right: Study on Adult dataset (Kohavi et al., 1996) with n=300n=300, examines features Age, Education-num, Hours-per-week for classifying income over 50K. Top: Type-I error curves display Syn-Test errors (blue) and the threshold of α=0.05\alpha=0.05 (red dashed). Middle: Null distributions, shown with five synthetic-to-raw data ratios (1,5,10,15,20), suggest small bias in estimating ff^{*} and fScf^{*}_{S^{c}}. Bottom: Power analysis compares the Syn-Test statistic distribution (blue kernel, size m^\hat{m}) with the traditional test’s statistic value (red cross, size nn) under alternative hypothesis HaH_{a}, highlighting greater power.

Figure 8 (right column) highlights two key observations. First, the top figure demonstrates that m^=6n\hat{m}=6n consistently controls the Type-I error estimated from Syn-Test. Second, the bottom plot exhibits a pronounced rightward shift in the test statistic distribution when comparing synthetic data to those derived from raw data. This shift indicates an enhancement in the test’s power, directly attributable to the augmentation of the sample size.

3.3.2 Simulation

In this section, we evaluate the effectiveness of Syn-Test in controlling Type-I errors through simulation studies. We use the model in (4) with a modification: an additional feature, 𝑿8\bm{X}_{8}, is included but does not contribute to the model. Here, 𝑿=(𝑿1,,𝑿7,𝑿8)\bm{X}=(\bm{X}_{1},\ldots,\bm{X}_{7},\bm{X}_{8}) is distributed uniformly on [0,1]8[0,1]^{8} (Uniform(0,1)8\text{Uniform}(0,1)^{8}), and ϵ\bm{\epsilon} follows a normal distribution N(0,0.22)N(0,0.2^{2}). We assess the relevance of the feature 𝑿8\bm{X}_{8} in prediction, thereby examining the control capability of Type-I error by Syn-Test. We will use the one-split test statistic proposed by (Dai, Shen and Pan, 2024) in Syn-Test, although we don’t rely on their asymptotic theory for the test statistic.

For our experiments, we split a raw sample into a training set and an inference set with 10001000 and 200200 samples, respectively. Additionally, we use a distinct pre-training sample of 1000010000 to train the TDM on (𝒀,𝑿)(\bm{Y},\bm{X}). This pre-trained model is subsequently fine-tuned on the raw training set according to the Syn-Test’s procedure. The inference sample, consisting of 200 instances, is utilized to validate the training of f~\tilde{f} and f~Sc\tilde{f}_{S^{c}} for the test statistic in (6).

Moreover, we employ an MC size of D=1000D=1000 with parameters α=0.05\alpha=0.05 and ϵ=0.01\epsilon=0.01, and explore synthetic-to-raw ratios from 1,2,,20{1,2,\ldots,20} to fine-tune the synthetic size mm.

As shown in Figure 9, the tuning curve of Syn-Test with synthetic data demonstrates a comparable performance to that of the same test when using independent raw data, particularly in terms of generational effect, while exhibiting similar patterns of variation. This finding indicates that Syn-Test effectively controls the Type-I error with synthetic data, aligning with observations from raw data. Finally, we adopt the conservative approach in selecting mm, choosing m^/n=18\hat{m}/n=18, in accordance with the guidelines in Section 2.3.

Refer to caption
Figure 9: Tuning curves of Syn-Test using synthetic data versus the same test using independent raw data for selecting the synthetic size m^\hat{m} in testing the null feature 𝑿8\bm{X}_{8} with α=0.05\alpha=0.05 and ϵ=0.01\epsilon=0.01.
Refer to caption
Figure 10: Comparative distributions of the test statistic (left) and P-values (right) under H0H_{0} for testing 𝑿8\bm{X}_{8} using Syn-Test with synthetic data versus raw data. Note that the P-value distributions approximate the uniform distribution U[0,1]U[0,1], apart from the point mass at 11 due to Hommel’s combination technique.

As illustrated in Figure 10, the estimated null distribution curve derived from synthetic data with m^/n=18\hat{m}/n=18 closely resembles that based on independent raw data, albeit with a slight shift. This observation suggests minor generation errors by our generators. Furthermore, the distribution of P-values from Syn-Test using synthetic data under the null hypothesis H0H_{0} with m^/n=18\hat{m}/n=18 aligns well with that based on raw data. These plots demonstrate the effectiveness of Syn-Test in controlling Type-I error.

4 Data Privacy

The Syn framework can address privacy concerns using synthetic data generated by generative models trained to mimic the distribution of raw data. Unlike raw data, synthetic data imposes fewer privacy risks. However, it’s not entirely immune to reverse engineering attacks. These vulnerabilities mainly stem from the model’s parameters, which attackers could potentially infer from the synthetic data. In real-world applications like healthcare and finance, where data sensitivity is paramount, ensuring robust privacy measures is crucial.

To bolster data privacy, one may consider a privacy protection standard known as (ε,δ)(\varepsilon,\delta)-differential privacy (Dwork et al., 2006), recognized as a gold standard in data privacy. Notably, it was implemented in the 2020 U.S. decennial census, demonstrating its practical applicability and effectiveness. This approach effectively safeguards against various privacy threats, including reverse engineering, re-identification, and inference attacks.

The definition of (ε,δ)(\varepsilon,\delta)-differential privacy considers an adjacent realization 𝒛\bm{z}^{\prime} differing from a realization of an original sample 𝒁=(𝒁i)i=1n\bm{Z}=(\bm{Z}_{i})_{i=1}^{n} by just one observation. It revolves around a privatization mechanism 𝑴\bm{M}, mapping from original dataset 𝒁=(𝒁i)i=1n\bm{Z}=(\bm{Z}_{i})_{i=1}^{n} into a privatized version 𝒁~(m)=(𝒁~i)i=1m\tilde{\bm{Z}}^{(m)}=(\tilde{\bm{Z}}_{i})_{i=1}^{m}. For 𝑴\bm{M} to be (ε,δ)(\varepsilon,\delta)-differential private (Dwork et al., 2006), it satisfies: For any small ε0\varepsilon\geq 0 and δ>0\delta>0 and any measurable set BB:

P(𝑴(𝒁)B|𝒁=𝒛)eεP(𝑴(𝒁)B|𝒁=𝒛)+δ,\displaystyle P\big{(}\bm{M}(\bm{Z})\in B|\bm{Z}=\bm{z}\big{)}\leq e^{\varepsilon}P\big{(}\bm{M}(\bm{Z})\in B|\bm{Z}=\bm{z}^{\prime}\big{)}+\delta, (7)

where ε>0\varepsilon>0 is the privacy budget, controlling the strength of privacy protection. Smaller ε\varepsilon values indicate stronger privacy, while δ\delta is a small probability allowance for the privacy guarantee, acknowledging minimal inherent risk. This definition accommodates ε\varepsilon-differential privacy with δ=0\delta=0.

To generate synthetic data satisfying (7), we may employ techniques like Differentially Private Stochastic Gradient Descent (DP-SGD). DP-SGD injects a certain amount of Gaussian noise into the gradient updates during model training, ensuring that the model satisfies the desired privacy guarantees (Abadi et al., 2016). Differentially-private diffusion models (Ghalebikesabi et al., 2023) is an example of a point. Further exploration is warranted.

Unlike traditional differentially private samples, differentially private synthetic variants through these methods preserve the distributional characteristics of the original data. This preservation is advantageous as it can boost statistical analysis through sample size argumentation while enhancing privacy. However, this approach often requires more extensive model training due to the added noise, which can be a computational challenge.

5 Discussion: Future of Data Science

This article unveils the Syn paradigm—a novel approach to data analytics using high-fidelity synthetic data derived from real-world insights. By addressing challenges in traditional data analytics, such as data scarcity and privacy issues, this paradigm underscores that high-fidelity synthetic data can amplify the precision and efficiency of data analytics with sample size augmentation, as evidenced by the case studies herein. However, it is crucial to acknowledge the generational effect inherent to the Syn framework. Consequently, a fusion of raw and synthetic data is essential for unlocking their potential, offering fresh perspectives for both scientific and engineering domains.

In this study, although our primary focus is on tabular data—the predominant form in statistical analysis—we emphasize that the Syn framework’s utility extends across disparate types of data given available tools such as multimodal diffusion models (Xu et al., 2023). Its utility is reinforced by recent progress in generative modeling for vision, natural language, and multimodal data processing. This adaptability allows the Syn framework to perform comprehensive analyses of complex, multimodal data, such as electronic health records, which typically include a mix of standard tabular data, image scans, and unstructured text.

The importance of synthetic data in driving progress in data science and AI is increasingly recognized. However, a comprehensive evaluation of the Syn framework across a broad spectrum of applications is imperative. One key strategy for assessing the synthetic sample’s fidelity or reliability is comparing it with an independent validation sample that reflects the original data’s distribution. This assessment can be conducted through (i) exploratory analysis, like histograms and correlation matrices for visual inspection, and (ii) distributional distances, such as the Wasserstein distance, for a quantitative evaluation. Furthermore, in supervised learning tasks, the model’s performance can be scientifically validated using cross-validation metrics. Overall, validating generative models for more complex applications presents unique challenges, making the exploration of model robustness and data privacy critical areas for future research.

The future of data science may pivot on our capability to harness raw and synthetic data. Large pre-trained generative models, equipped with extensive knowledge, offer a promising pathway. These frameworks distill domain knowledge, a testament being the successes of Generative Pre-trained Transformers in text and imagery contexts. As developing domain-centric generative models is gaining momentum, these generative models promise significant enhancements in synthetic data generation, paving the way for breakthroughs across a wide range of disciplines.

Appendix A Proofs

A.1 Proof of Theorem 2.1

Proof.

Let LTVf(TV(F~,F))\text{LTV}\doteq f(\text{TV}(\tilde{F},F)). Let g(m)=C𝜽mα+mLTVg(m)=C_{\bm{\theta}}m^{-\alpha}+m\text{LTV}. The assumption that LTV>R(𝜽^(𝒁~(m0)))/m\text{LTV}>\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m_{0})}))/m^{*} implies that g(m)C𝜽(m)α>R(𝜽^(𝒁~(m0)))g(m^{*})-C_{\bm{\theta}}(m^{*})^{-\alpha}>\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m_{0})})). Moreover, note that Gr(m)Gr(m)\text{Gr}^{(m)}\geq\text{Gr}^{(m^{*})} if m>mm>m^{*}. Hence, when m>mm>m^{*},

R(𝜽^(𝒁~(m)))\displaystyle\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m)})) C𝜽(m)α+Gr(m)+C𝜽(mα(m)α)\displaystyle\geq C_{\bm{\theta}}(m^{*})^{-\alpha}+\text{Gr}^{(m)}+C_{\bm{\theta}}(m^{-\alpha}-(m^{*})^{-\alpha})
C𝜽(m)α+Gr(m)+C𝜽(mα(m)α)\displaystyle\geq C_{\bm{\theta}}(m^{*})^{-\alpha}+\text{Gr}^{(m^{*})}+C_{\bm{\theta}}(m^{-\alpha}-(m^{*})^{-\alpha})
g(m)C𝜽(m)α>R(𝜽^(𝒁~(m0))).\displaystyle\geq g(m^{*})-C_{\bm{\theta}}(m^{*})^{-\alpha}>\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m_{0})})).

This completes the proof of (ii). Moreover, by the definition of m0m_{0} and (ii), m0m<+m_{0}\leq m^{*}<+\infty. This completes the proof of (i). ∎

A.2 Proof of Theorem 2.2

Proof.

By definition,

|Grm|\displaystyle|\text{Gr}_{m}| =|L(𝜽^m(𝒕),𝜽)(dF(𝒕)dF~(𝒕))|\displaystyle=\left|\int L(\widehat{\bm{\theta}}_{m}(\bm{t}),\bm{\theta})~{}(dF(\bm{t})-d\tilde{F}(\bm{t}))\right|
2UTV(F𝒁(m),F𝒁~(m))2mUTV(F~,F),\displaystyle\leq 2U\text{TV}(F_{\bm{Z}^{(m)}},F_{\tilde{\bm{Z}}^{(m)}})\leq 2mU\text{TV}(\tilde{F},F),

yielding (1). This, together with the assumption that R(𝜽^(𝒁(m)))=C𝜽mα\text{R}(\widehat{\bm{\theta}}(\bm{Z}^{(m)}))=C_{\bm{\theta}}m^{-\alpha} yields that

R(𝜽^(𝒁~(m0)))R(𝜽^(𝒁~(s0)))R(𝜽^(𝒁~(m0)))R(𝜽^(𝒁~(s0)))\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m_{0})}))\leq\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(s_{0})}))\leq\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m_{0})}))\leq\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(s_{0})}))

when s0=(2UαC𝜽)α1+αTV(F~,F)11+αs_{0}=\left(\frac{2U}{\alpha C_{\bm{\theta}}}\right)^{\frac{\alpha}{1+\alpha}}\cdot\text{TV}(\tilde{F},F)^{-\frac{1}{1+\alpha}}, yielding the right hand of (1). This gives an optimal upper bound of R(𝜽^(𝒁~(m0)))\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m_{0})})) in (2). Moreover, (2) implies that R(𝜽^(𝒁~(m0)))R(𝜽^(𝒁(n)))\text{R}(\widehat{\bm{\theta}}(\tilde{\bm{Z}}^{(m_{0})}))\leq\text{R}(\widehat{\bm{\theta}}(\bm{Z}^{(n)})) provided that TV(F~,F)\text{TV}(\tilde{F},F) is sufficiently small, sufficiently, TV(F~,F)C𝜽,α,Un1α\text{TV}(\tilde{F},F)\leq C_{\bm{\theta},\alpha,U}\cdot n^{-1-\alpha} where C𝜽,α,U=C𝜽(2U)1(αα1+α+α11+α)1+ααC_{\bm{\theta},\alpha,U}=C_{\bm{\theta}}(2U)^{-1}\left(\alpha^{\frac{-\alpha}{1+\alpha}}+\alpha^{\frac{1}{1+\alpha}}\right)^{-\frac{1+\alpha}{\alpha}}. This completes the proof. ∎

A.3 Proof of Theorem 2.3

Proof.

Let 𝒁~\tilde{\bm{Z}} be a random vector following F~=F𝒁~𝒁(n)\tilde{F}=F_{\tilde{\bm{Z}}\mid\bm{Z}^{(n)}} and 𝒁\bm{Z} be one following FF. First, we bound the empirical distribution FT~D(x)=D1d=1DX(m,d)F^{D}_{\tilde{T}}(x)=D^{-1}\sum_{d=1}^{D}X^{(m,d)} in (8), where X(m,d)=I(T(𝒁~(m,d))x)[0,1]X^{(m,d)}=I(T(\tilde{\bm{Z}}^{(m,d)})\leq x)\in[0,1] for any xx\in\mathcal{R}. Let FT~(x)=E𝒁~(m,1)𝒁(n)X(m,1)F_{\tilde{T}}(x)=\operatorname{E}_{\tilde{\bm{Z}}^{(m,1)}\mid\bm{Z}^{(n)}}X^{(m,1)}. Note that 𝒁~(m,d)\tilde{\bm{Z}}^{(m,d)} is a conditionally independent sample of size mm given 𝒁(n)\bm{Z}^{(n)} following F~=F𝒁~𝒁(n)\tilde{F}=F_{\tilde{\bm{Z}}\mid\bm{Z}^{(n)}}. By Hoeffding’s Lemma, E𝒁~(m,d)𝒁(n)exp(s(X(m,d)E𝒁~(m,d)𝒁(n)X(m,d)))exp(s2/8)\operatorname{E}_{\tilde{\bm{Z}}^{(m,d)}\mid\bm{Z}^{(n)}}\exp(s(X^{(m,d)}-\operatorname{E}_{\tilde{\bm{Z}}^{(m,d)}\mid\bm{Z}^{(n)}}X^{(m,d)}))\leq\exp(s^{2}/8) a.s. for any s>0s>0, where E𝒁~(m,d)𝒁(n)\operatorname{E}_{\tilde{\bm{Z}}^{(m,d)}\mid\bm{Z}^{(n)}} is the conditional expectation with respect to 𝒁~(m,d)\tilde{\bm{Z}}^{(m,d)} given 𝒁(n)\bm{Z}^{(n)}. By Markov’s inequality and the conditional independence between 𝒁~(m,1),,𝒁~(m,d)\tilde{\bm{Z}}^{(m,1)},\ldots,\tilde{\bm{Z}}^{(m,d)} given 𝒁(n)\bm{Z}^{(n)}, for any t>0t>0 and s=4ts=4t,

Pr(|FT~D(x)FT~(x)|t)\displaystyle\Pr\Big{(}\left|F^{D}_{\tilde{T}}(x)-F_{\tilde{T}}(x)\right|\geq t\Big{)}
=Pr(|D1d=1D(X(m,d)E𝒁~(m,d)𝒁(n)X(m,d))|t)\displaystyle=\Pr\Big{(}\big{|}D^{-1}\sum_{d=1}^{D}(X^{(m,d)}-\operatorname{E}_{\tilde{\bm{Z}}^{(m,d)}\mid\bm{Z}^{(n)}}X^{(m,d)})\big{|}\geq t\Big{)}
2exp(sDt)E𝒁(n)d=1DE𝒁~(m,d)𝒁(n)(exp(s(X(m,d)E𝒁~(m,d)𝒁(n)X(m,d))))\displaystyle\leq 2\exp(-sDt)\operatorname{E}_{\bm{Z}^{(n)}}\prod_{d=1}^{D}\operatorname{E}_{\tilde{\bm{Z}}^{(m,d)}\mid\bm{Z}^{(n)}}\big{(}\exp\big{(}s(X^{(m,d)}-\operatorname{E}_{\tilde{\bm{Z}}^{(m,d)}\mid\bm{Z}^{(n)}}X^{(m,d)})\big{)}\big{)}
2exp(sDt)exp(Ds2/8)2exp(2Dt2),\displaystyle\leq 2\exp(-sDt)\exp(Ds^{2}/8)\leq 2\exp(-2Dt^{2}),

where Pr\Pr denotes the probability, taking into account all sources of randomness.

For any δ(0,1)\delta\in(0,1), by choosing t=log2δ2Dt=\sqrt{\frac{\log\frac{2}{\delta}}{2D}}, we obtain that |FT~D(x)FT~(x)|log2δ2D\left|F^{D}_{\tilde{T}}(x)-F_{\tilde{T}}(x)\right|\leq\sqrt{\frac{\log\frac{2}{\delta}}{2D}}, with probability at least 1δ1-\delta. On the other hand, |FT~(x)FT(x)|TV(𝒁~(m),𝒁(m))\left|F_{\tilde{T}}(x)-F_{T}(x)\right|\leq\text{TV}(\tilde{\bm{Z}}^{(m)},\bm{Z}^{(m)}) given 𝒁(n)\bm{Z}^{(n)} for any xx\in\mathcal{R}, where 𝒁(m)\bm{Z}^{(m)} is a sample of size mm from FF. Using the union bound to combine these results, we obtain that

sup𝒙|FT~(𝒙)FT(𝒙)|log2δ2D+TV(𝒁~(m),𝒁(m)),\sup_{\bm{x}}\left|F_{\tilde{T}}(\bm{x})-F_{T}(\bm{x})\right|\leq\sqrt{\frac{\log\frac{2}{\delta}}{2D}}+\text{TV}(\tilde{\bm{Z}}^{(m)},\bm{Z}^{(m)}), (8)

with probability at least 1δ1-\delta. Note that TV(𝒁~(m),𝒁)=mTV(F~,F)\text{TV}(\tilde{\bm{Z}}^{(m)},\bm{Z})=m\cdot\text{TV}(\tilde{F},F).

Consequently, as mTV(F~,F)0m\cdot\text{TV}(\tilde{F},F)\rightarrow 0 in probability and DD\rightarrow\infty, sup𝒙|F~T(𝒙)FT(𝒙)|0\sup_{\bm{x}}\left|\tilde{F}_{T}(\bm{x})-F_{T}(\bm{x})\right|\rightarrow 0 in probability. Using Syn-Test, the control of the empirical Type-I error at the α\alpha level can be obtained as if we were using the true null distribution.

Concerning the power gain, note that ϕ~m,αϕn,α=ϕ~m,αϕm,α+Δ\tilde{\phi}_{m,\alpha}-\phi_{n,\alpha}=\tilde{\phi}_{m,\alpha}-\phi_{m,\alpha}+\Delta. By definition and assumption, |ϕ~m,αϕm,α|TV(𝒁~(m),𝒁(m))=mTV(F~,F)<Δ|\tilde{\phi}_{m,\alpha}-\phi_{m,\alpha}|\leq\text{TV}(\tilde{\bm{Z}}^{(m)},\bm{Z}^{(m)})=m\text{TV}(\tilde{F},F)<\Delta, indicating that ϕ~m,αϕn,α>0\tilde{\phi}_{m,\alpha}-\phi_{n,\alpha}>0. This completes the proof. ∎

Appendix B Additional Experiment

In this section, we conduct a simulation study on traditional tabular data to demonstrate the efficacy of the Syn-Slm method, as discussed in Section 2.4. Utilizing the model in equation (4) as the basis for our data generation, we create training data of 1,0001,000 instances and test data of 500500 instances. For traditional analysis, we train CatBoost regression using the training dataset. For the Syn-Slm approach, a pre-trained TDM, initially pre-trained with pre-training data of size 2,0002,000, is fine-tuned on the training set. This fine-tuned model is then used for conditional generation (Zhang et al., 2023b) to approximate E(𝒀|𝑿)\operatorname{E}(\bm{Y}|\bm{X}) via a Monte Carlo method. Specifically, for any given input 𝒙\bm{x}, we generate a series of (𝒀~(d))d=1D(\tilde{\bm{Y}}^{(d)})_{d=1}^{D} from the conditional distribution 𝒀~|𝑿=𝒙\tilde{\bm{Y}}|\bm{X}=\bm{x} as learned by the TDM, and compute the estimate as D1d=1D𝒀~(d)D^{-1}\sum_{d=1}^{D}\tilde{\bm{Y}}^{(d)}. A notable advantage of Syn-Slm is its flexibility in handling various supervised learning tasks simultaneously without needing to retrain for different loss functions. For instance, one may use 𝒀~|𝑿=𝒙\tilde{\bm{Y}}|\bm{X}=\bm{x} to estimate prediction intervals (Liu, Shen and Shen, 2023), quantiles, or other statistical measures with relative ease. The performance of both methods is assessed on the test dataset using RMSE as the metric. As indicated in Table 5, the Syn-Slm approach consistently outperforms the traditional CatBoost method across various noise levels σ\sigma, showcasing its superior predictive capability.

Table 5: RMSE of CatBoost and Syn-Slm across different noise magnitude σ{0.1,0.2,0.5,1.0}\sigma\in\{0.1,~{}0.2,~{}0.5,~{}1.0\}. The number inside the parenthesis represents standard error from repeated conditional generations.
σ=0.1\sigma=0.1 σ=0.2\sigma=0.2 σ=0.5\sigma=0.5 σ=1.0\sigma=1.0
CatBoost 0.1227 0.2245 0.5271 1.0437
Syn-Slm 0.1202 (0.0005) 0.2171 (0.0008) 0.5148 (0.0020) 1.0161 (0.0038)
{funding}

This work was supported in part by NSF grant DMS-1952539 and NIH grants R01AG069895, R01AG065636, R01AG074858, U01AG073079 (corresponding author: Xiaotong Shen, [email protected]).

{supplement}

The code to reproduce the results presented in this paper is available at https://github.com/yifei-liu-stat/syn.

\stitle

Datasets \sdescription Information concerning datasets in Section 3: Case studies.

Table 6: Summaries of eight Benchmark Datasets (Kotelnikov et al., 2023).
Dataset FB comments Adult House California Churn2 Gesture Aabalone Insurance
Source 2015 1996 OpenML 1997 Kaggle 2013 OpenML Kaggle
Sample size 197080 48842 22784 20640 10000 9873 4177 1338
Pre-training size 157638 (80%) 26048 (53%) 14581 (64%) 13209 (64%) 6400 (64%) 6318 (64%) 2672 (64%) 856 (64%)
Raw size 19720 (10%) 16281 (34%) 4557 (20%) 4128 (20%) 2000 (20%) 1975 (20%) 836 (20%) 268 (20%)
Test size 19722 (10%) 6513 (13%) 3646 (16%) 3303 (16%) 1600 (16%) 1580 (16%) 669 (16%) 214 (16%)
# Numerical features 36 6 16 8 7 32 7 3
# Nominal features 15 8 0 0 4 0 1 3
{supplement}\stitle

Training Configurations for Sentiment Analysis \sdescriptionGPT-3.5 is fine-tuned using the text-embedding-Ada-002 configuration, following OpenAI’s recommended practices333OpenAI GPT fine-tuning: https://platform.openai.com/docs/guides/fine-tuning. DistilBERT is fine-tuned with the distilbert-base-uncased model, available from HuggingFace444HuggingFace: https://huggingface.co/distilbert-base-uncased, adopting a batch size of 16, a duration of 10 epochs, and employing the Adam optimizer with default decay parameters and a learning rate of 1×1051\times 10^{-5}. The LSTM setup is adapted from a Kaggle notebook555Kaggle: https://www.kaggle.com/code/pawan2905/imbd-sentiment-analysis-using-pytorch-lstm, utilizing the same model architecture and hyperparameters for our custom-split datasets.

{supplement}\stitle

Training Configurations for Syn-Boost and Syn-Test \sdescriptionIn accordance with the guidelines of Kotelnikov et al. (2023), we process data, and design model architecture and configurations to train our TDMs. Training is conducted on a single TITAN RTX GPU.

Prior to input into the diffusion model, tabular data undergoes a transformation process. Continuous variables are normalized through quantile transformation, while categorical variables receive one-hot encoding. These processed variables are then concatenated to form the diffusion model’s initial input vector.

For training the models in our examples, we employ a FNN as the backbone of the diffusion model, along with a cosine scheduler (Nichol and Dhariwal, 2021) and the AdamW optimizer (Loshchilov and Hutter, 2017) for optimization. In real-world scenarios involving eight benchmark datasets in Sections 3.2.1 and 3.3.1, we develop pre-trained models using the configurations in Table 7. Model fine-tuning followed the same architectural design but typically involved lighter optimization — a reduced learning rate and fewer training iterations. For the simulation examples in Sections 3.2.1 and 3.3.1, we derive pre-trained models using the same pre-training configuration as that on the California dataset but with 50,00050,000 training iterations. Fine-tuning in these cases was carried out with 1,0001,000 iterations and a learning rate of 3×1063\times 10^{-6}.

Table 7: Model design and training details on eight real-world benchmark datasets. Adult-Male uses the same pre-training configuration as Adult does.
Dataset FNN dimensions Diffusion timestamps Iterations Learning rate Batch size
FB comments (512, 1024) 1000 30000 6.3×1046.3\times 10^{-4} 4096
Adult (256, 1024, 1024, 1024, 1024, 256) 100 30000 2.0×1032.0\times 10^{-3} 4096
House (128, 512, 512, 512, 512, 256) 1000 30000 1.4×1031.4\times 10^{-3} 4096
California (512, 256, 256, 256, 256, 128) 1000 30000 1.3×1041.3\times 10^{-4} 4096
Churn2 (512, 1024, 1024, 1024, 1024, 512) 100 30000 8.1×1048.1\times 10^{-4} 4096
Gesture (128, 512, 512, 1024) 1000 30000 2.8×1032.8\times 10^{-3} 4096
Abalone (512, 128) 100 30000 3.4×1043.4\times 10^{-4} 4096
Insurance (256, 512, 512, 512, 512, 256) 100 30000 1.1×1031.1\times 10^{-3} 4096

References

  • Abadi et al. (2016) {binproceedings}[author] \bauthor\bsnmAbadi, \bfnmMartin\binitsM., \bauthor\bsnmChu, \bfnmAndy\binitsA., \bauthor\bsnmGoodfellow, \bfnmIan\binitsI., \bauthor\bsnmMcMahan, \bfnmH Brendan\binitsH. B., \bauthor\bsnmMironov, \bfnmIlya\binitsI., \bauthor\bsnmTalwar, \bfnmKunal\binitsK. and \bauthor\bsnmZhang, \bfnmLi\binitsL. (\byear2016). \btitleDeep learning with differential privacy. In \bbooktitleProceedings of the 2016 ACM SIGSAC conference on computer and communications security \bpages308–318. \endbibitem
  • Akbar, Wang and Eklund (2023) {barticle}[author] \bauthor\bsnmAkbar, \bfnmMuhammad Usman\binitsM. U., \bauthor\bsnmWang, \bfnmWuhao\binitsW. and \bauthor\bsnmEklund, \bfnmAnders\binitsA. (\byear2023). \btitleBeware of Diffusion Models for Synthesizing Medical Images-a Comparison with Gans in Terms of Memorizing Brain MRI and Chest X-Ray Images. \bjournalAvailable at SSRN 4611613. \endbibitem
  • Breiman (1997) {btechreport}[author] \bauthor\bsnmBreiman, \bfnmLeo\binitsL. (\byear1997). \btitleArcing the edge \btypeTechnical Report, \bpublisherCiteseer. \endbibitem
  • Brown et al. (2020) {barticle}[author] \bauthor\bsnmBrown, \bfnmTom\binitsT., \bauthor\bsnmMann, \bfnmBenjamin\binitsB., \bauthor\bsnmRyder, \bfnmNick\binitsN., \bauthor\bsnmSubbiah, \bfnmMelanie\binitsM., \bauthor\bsnmKaplan, \bfnmJared D\binitsJ. D., \bauthor\bsnmDhariwal, \bfnmPrafulla\binitsP., \bauthor\bsnmNeelakantan, \bfnmArvind\binitsA., \bauthor\bsnmShyam, \bfnmPranav\binitsP., \bauthor\bsnmSastry, \bfnmGirish\binitsG., \bauthor\bsnmAskell, \bfnmAmanda\binitsA. \betalet al. (\byear2020). \btitleLanguage models are few-shot learners. \bjournalAdvances in neural information processing systems \bvolume33 \bpages1877–1901. \endbibitem
  • Bubeck et al. (2023) {barticle}[author] \bauthor\bsnmBubeck, \bfnmSébastien\binitsS., \bauthor\bsnmChandrasekaran, \bfnmVarun\binitsV., \bauthor\bsnmEldan, \bfnmRonen\binitsR., \bauthor\bsnmGehrke, \bfnmJohannes\binitsJ., \bauthor\bsnmHorvitz, \bfnmEric\binitsE., \bauthor\bsnmKamar, \bfnmEce\binitsE., \bauthor\bsnmLee, \bfnmPeter\binitsP., \bauthor\bsnmLee, \bfnmYin Tat\binitsY. T., \bauthor\bsnmLi, \bfnmYuanzhi\binitsY., \bauthor\bsnmLundberg, \bfnmScott\binitsS. \betalet al. (\byear2023). \btitleSparks of artificial general intelligence: Early experiments with gpt-4. \bjournalarXiv preprint arXiv:2303.12712. \endbibitem
  • Burden and Faires (2001) {bmisc}[author] \bauthor\bsnmBurden, \bfnmRL\binitsR. and \bauthor\bsnmFaires, \bfnmJD\binitsJ. (\byear2001). \btitleNumerical analysis 7th ed., brooks/cole, thomson learning. \endbibitem
  • Carroll (2006) {barticle}[author] \bauthor\bsnmCarroll, \bfnmMichael W\binitsM. W. (\byear2006). \btitleThe movement for open access law. \bjournalLaw Library Journal \bvolume92 \bpages315. \endbibitem
  • Chen et al. (2023) {barticle}[author] \bauthor\bsnmChen, \bfnmXuanting\binitsX., \bauthor\bsnmYe, \bfnmJunjie\binitsJ., \bauthor\bsnmZu, \bfnmCan\binitsC., \bauthor\bsnmXu, \bfnmNuo\binitsN., \bauthor\bsnmZheng, \bfnmRui\binitsR., \bauthor\bsnmPeng, \bfnmMinlong\binitsM., \bauthor\bsnmZhou, \bfnmJie\binitsJ., \bauthor\bsnmGui, \bfnmTao\binitsT., \bauthor\bsnmZhang, \bfnmQi\binitsQ. and \bauthor\bsnmHuang, \bfnmXuanjing\binitsX. (\byear2023). \btitleHow Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks. \bjournalarXiv preprint arXiv:2303.00293. \endbibitem
  • Dai, Shen and Pan (2024) {barticle}[author] \bauthor\bsnmDai, \bfnmBen\binitsB., \bauthor\bsnmShen, \bfnmXiaotong\binitsX. and \bauthor\bsnmPan, \bfnmWei\binitsW. (\byear2024). \btitleSignificance tests of feature relevance for a black-box learner. \bjournalIEEE Transactions on Neural Networks and Learning Systems \bvolume35 \bpages1898-1911. \endbibitem
  • Dinh, Krueger and Bengio (2014) {barticle}[author] \bauthor\bsnmDinh, \bfnmLaurent\binitsL., \bauthor\bsnmKrueger, \bfnmDavid\binitsD. and \bauthor\bsnmBengio, \bfnmYoshua\binitsY. (\byear2014). \btitleNice: Non-linear independent components estimation. \bjournalarXiv preprint arXiv:1410.8516. \endbibitem
  • Dinh, Sohl-Dickstein and Bengio (2016) {barticle}[author] \bauthor\bsnmDinh, \bfnmLaurent\binitsL., \bauthor\bsnmSohl-Dickstein, \bfnmJascha\binitsJ. and \bauthor\bsnmBengio, \bfnmSamy\binitsS. (\byear2016). \btitleDensity estimation using real nvp. \bjournalarXiv preprint arXiv:1605.08803. \endbibitem
  • Dorogush, Ershov and Gulin (2018) {barticle}[author] \bauthor\bsnmDorogush, \bfnmAnna V\binitsA. V., \bauthor\bsnmErshov, \bfnmVadim\binitsV. and \bauthor\bsnmGulin, \bfnmAndrey\binitsA. (\byear2018). \btitleCatBoost: unbiased boosting with categorical features. \bjournalarXiv preprint arXiv:1810.11363. \endbibitem
  • Dwork et al. (2006) {binproceedings}[author] \bauthor\bsnmDwork, \bfnmCynthia\binitsC., \bauthor\bsnmKenthapadi, \bfnmKrishnaram\binitsK., \bauthor\bsnmMcSherry, \bfnmFrank\binitsF., \bauthor\bsnmMironov, \bfnmIlya\binitsI. and \bauthor\bsnmNaor, \bfnmMoni\binitsM. (\byear2006). \btitleOur data, ourselves: Privacy via distributed noise generation. In \bbooktitleAnnual International Conference on the Theory and Applications of Cryptographic Techniques \bpages486–503. \endbibitem
  • Eastwood (2023) {barticle}[author] \bauthor\bsnmEastwood, \bfnmBrian\binitsB. (\byear2023). \btitleWhat is synthetic data — and how can it help you competitively? \bjournalMIT Sloan School. \endbibitem
  • Efron (1992) {bincollection}[author] \bauthor\bsnmEfron, \bfnmBradley\binitsB. (\byear1992). \btitleBootstrap methods: another look at the jackknife. In \bbooktitleBreakthroughs in statistics \bpages569–593. \bpublisherSpringer. \endbibitem
  • Frid-Adar et al. (2018) {barticle}[author] \bauthor\bsnmFrid-Adar, \bfnmMaayan\binitsM., \bauthor\bsnmDiamant, \bfnmIdit\binitsI., \bauthor\bsnmKlang, \bfnmEyal\binitsE., \bauthor\bsnmAmitai, \bfnmMichal\binitsM., \bauthor\bsnmGoldberger, \bfnmJacob\binitsJ. and \bauthor\bsnmGreenspan, \bfnmHayit\binitsH. (\byear2018). \btitleGAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. \bjournalNeurocomputing \bvolume321 \bpages321–331. \endbibitem
  • Friedman (2002) {barticle}[author] \bauthor\bsnmFriedman, \bfnmJerome H\binitsJ. H. (\byear2002). \btitleStochastic gradient boosting. \bjournalComputational statistics & data analysis \bvolume38 \bpages367–378. \endbibitem
  • Gao et al. (2023) {barticle}[author] \bauthor\bsnmGao, \bfnmCong\binitsC., \bauthor\bsnmKilleen, \bfnmBenjamin D\binitsB. D., \bauthor\bsnmHu, \bfnmYicheng\binitsY., \bauthor\bsnmGrupp, \bfnmRobert B\binitsR. B., \bauthor\bsnmTaylor, \bfnmRussell H\binitsR. H., \bauthor\bsnmArmand, \bfnmMehran\binitsM. and \bauthor\bsnmUnberath, \bfnmMathias\binitsM. (\byear2023). \btitleSynthetic data accelerates the development of generalizable learning-based algorithms for X-ray image analysis. \bjournalNature Machine Intelligence \bvolume5 \bpages294–308. \endbibitem
  • Gartner (2022) {barticle}[author] \bauthor\bsnmGartner (\byear2022). \btitleIs Synthetic Data the Future of AI? \bjournalGartner Newsroom Q&A. \endbibitem
  • Ghalebikesabi et al. (2023) {barticle}[author] \bauthor\bsnmGhalebikesabi, \bfnmSahra\binitsS., \bauthor\bsnmBerrada, \bfnmLeonard\binitsL., \bauthor\bsnmGowal, \bfnmSven\binitsS., \bauthor\bsnmKtena, \bfnmIra\binitsI., \bauthor\bsnmStanforth, \bfnmRobert\binitsR., \bauthor\bsnmHayes, \bfnmJamie\binitsJ., \bauthor\bsnmDe, \bfnmSoham\binitsS., \bauthor\bsnmSmith, \bfnmSamuel L\binitsS. L., \bauthor\bsnmWiles, \bfnmOlivia\binitsO. and \bauthor\bsnmBalle, \bfnmBorja\binitsB. (\byear2023). \btitleDifferentially private diffusion models generate useful synthetic images. \bjournalarXiv preprint arXiv:2302.13861. \endbibitem
  • Hastie et al. (2009) {bbook}[author] \bauthor\bsnmHastie, \bfnmTrevor\binitsT., \bauthor\bsnmTibshirani, \bfnmRobert\binitsR., \bauthor\bsnmFriedman, \bfnmJerome H\binitsJ. H. and \bauthor\bsnmFriedman, \bfnmJerome H\binitsJ. H. (\byear2009). \btitleThe elements of statistical learning: data mining, inference, and prediction \bvolume2. \bpublisherSpringer. \endbibitem
  • Ho, Jain and Abbeel (2020) {barticle}[author] \bauthor\bsnmHo, \bfnmJonathan\binitsJ., \bauthor\bsnmJain, \bfnmAjay\binitsA. and \bauthor\bsnmAbbeel, \bfnmPieter\binitsP. (\byear2020). \btitleDenoising diffusion probabilistic models. \bjournalAdvances in Neural Information Processing Systems \bvolume33 \bpages6840–6851. \endbibitem
  • Hommel (1983) {barticle}[author] \bauthor\bsnmHommel, \bfnmGerhard\binitsG. (\byear1983). \btitleTests of the overall hypothesis for arbitrary dependence structures. \bjournalBiometrical Journal \bvolume25 \bpages423–430. \endbibitem
  • Jordon et al. (2022) {barticle}[author] \bauthor\bsnmJordon, \bfnmJames\binitsJ., \bauthor\bsnmSzpruch, \bfnmLukasz\binitsL., \bauthor\bsnmHoussiau, \bfnmFlorimond\binitsF., \bauthor\bsnmBottarelli, \bfnmMirko\binitsM., \bauthor\bsnmCherubin, \bfnmGiovanni\binitsG., \bauthor\bsnmMaple, \bfnmCarsten\binitsC., \bauthor\bsnmCohen, \bfnmSamuel N\binitsS. N. and \bauthor\bsnmWeller, \bfnmAdrian\binitsA. (\byear2022). \btitleSynthetic Data–what, why and how? \bjournalarXiv preprint arXiv:2205.03257. \endbibitem
  • Kiefer (1953) {barticle}[author] \bauthor\bsnmKiefer, \bfnmJack\binitsJ. (\byear1953). \btitleSequential minimax search for a maximum. \bjournalProceedings of the American mathematical society \bvolume4 \bpages502–506. \endbibitem
  • Kim, Lee and Park (2022) {barticle}[author] \bauthor\bsnmKim, \bfnmJayoung\binitsJ., \bauthor\bsnmLee, \bfnmChaejeong\binitsC. and \bauthor\bsnmPark, \bfnmNoseong\binitsN. (\byear2022). \btitleStasy: Score-based tabular data synthesis. \bjournalarXiv preprint arXiv:2210.04018. \endbibitem
  • Kingma and Dhariwal (2018) {barticle}[author] \bauthor\bsnmKingma, \bfnmDurk P\binitsD. P. and \bauthor\bsnmDhariwal, \bfnmPrafulla\binitsP. (\byear2018). \btitleGlow: Generative flow with invertible 1x1 convolutions. \bjournalAdvances in neural information processing systems \bvolume31. \endbibitem
  • Kohavi et al. (1996) {binproceedings}[author] \bauthor\bsnmKohavi, \bfnmRon\binitsR. \betalet al. (\byear1996). \btitleScaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In \bbooktitleKdd \bvolume96 \bpages202–207. \endbibitem
  • Kotelnikov et al. (2023) {binproceedings}[author] \bauthor\bsnmKotelnikov, \bfnmAkim\binitsA., \bauthor\bsnmBaranchuk, \bfnmDmitry\binitsD., \bauthor\bsnmRubachev, \bfnmIvan\binitsI. and \bauthor\bsnmBabenko, \bfnmArtem\binitsA. (\byear2023). \btitleTabddpm: Modelling tabular data with diffusion models. In \bbooktitleInternational Conference on Machine Learning \bpages17564–17579. \bpublisherPMLR. \endbibitem
  • Labrinidis and Jagadish (2012) {barticle}[author] \bauthor\bsnmLabrinidis, \bfnmAlexandros\binitsA. and \bauthor\bsnmJagadish, \bfnmHV\binitsH. (\byear2012). \btitleChallenges and opportunities with big data. \bjournalProceedings of the VLDB Endowment \bvolume5 \bpages2032–2033. \endbibitem
  • Lee, Kim and Park (2023) {barticle}[author] \bauthor\bsnmLee, \bfnmChaejeong\binitsC., \bauthor\bsnmKim, \bfnmJayoung\binitsJ. and \bauthor\bsnmPark, \bfnmNoseong\binitsN. (\byear2023). \btitleCoDi: Co-evolving Contrastive Diffusion Models for Mixed-type Tabular Synthesis. \bjournalarXiv preprint arXiv:2304.12654. \endbibitem
  • Lin et al. (2023) {barticle}[author] \bauthor\bsnmLin, \bfnmLequan\binitsL., \bauthor\bsnmLi, \bfnmZhengkun\binitsZ., \bauthor\bsnmLi, \bfnmRuikun\binitsR., \bauthor\bsnmLi, \bfnmXuliang\binitsX. and \bauthor\bsnmGao, \bfnmJunbin\binitsJ. (\byear2023). \btitleDiffusion models for time series applications: A survey. \bjournalarXiv preprint arXiv:2305.00624. \endbibitem
  • Liu, Shen and Shen (2023) {barticle}[author] \bauthor\bsnmLiu, \bfnmYifei\binitsY., \bauthor\bsnmShen, \bfnmRex\binitsR. and \bauthor\bsnmShen, \bfnmXiaotong\binitsX. (\byear2023). \btitlePerturbation-Assisted Sample Synthesis: A Novel Approach for Uncertainty Quantification. \bjournalRevised for IEEE Transactions on Pattern Analysis and Machine Intelligencei. arXiv preprint arXiv:2305.18671. \endbibitem
  • Liu and Xie (2020) {barticle}[author] \bauthor\bsnmLiu, \bfnmYaowu\binitsY. and \bauthor\bsnmXie, \bfnmJun\binitsJ. (\byear2020). \btitleCauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. \bjournalJournal of the American Statistical Association \bvolume115 \bpages393–402. \endbibitem
  • Liu et al. (2020) {barticle}[author] \bauthor\bsnmLiu, \bfnmQiao\binitsQ., \bauthor\bsnmXu, \bfnmJiaze\binitsJ., \bauthor\bsnmJiang, \bfnmRui\binitsR. and \bauthor\bsnmWong, \bfnmWing Hung\binitsW. H. (\byear2020). \btitleRoundtrip: A Deep Generative Neural Density Estimator. \bjournalarXiv preprint arXiv:2004.09017. \endbibitem
  • Loshchilov and Hutter (2017) {barticle}[author] \bauthor\bsnmLoshchilov, \bfnmIlya\binitsI. and \bauthor\bsnmHutter, \bfnmFrank\binitsF. (\byear2017). \btitleDecoupled weight decay regularization. \bjournalarXiv preprint arXiv:1711.05101. \endbibitem
  • Maas et al. (2011) {binproceedings}[author] \bauthor\bsnmMaas, \bfnmAndrew\binitsA., \bauthor\bsnmDaly, \bfnmRaymond E\binitsR. E., \bauthor\bsnmPham, \bfnmPeter T\binitsP. T., \bauthor\bsnmHuang, \bfnmDan\binitsD., \bauthor\bsnmNg, \bfnmAndrew Y\binitsA. Y. and \bauthor\bsnmPotts, \bfnmChristopher\binitsC. (\byear2011). \btitleLearning word vectors for sentiment analysis. In \bbooktitleProceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies \bpages142–150. \endbibitem
  • Madeo, Lima and Peres (2013) {binproceedings}[author] \bauthor\bsnmMadeo, \bfnmRenata CB\binitsR. C., \bauthor\bsnmLima, \bfnmClodoaldo AM\binitsC. A. and \bauthor\bsnmPeres, \bfnmSarajane M\binitsS. M. (\byear2013). \btitleGesture unit segmentation using support vector machines: segmenting gestures from rest positions. In \bbooktitleProceedings of the 28th Annual ACM Symposium on Applied Computing \bpages46–52. \endbibitem
  • Nichol and Dhariwal (2021) {binproceedings}[author] \bauthor\bsnmNichol, \bfnmAlexander Quinn\binitsA. Q. and \bauthor\bsnmDhariwal, \bfnmPrafulla\binitsP. (\byear2021). \btitleImproved denoising diffusion probabilistic models. In \bbooktitleInternational Conference on Machine Learning \bpages8162–8171. \bpublisherPMLR. \endbibitem
  • Oko, Akiyama and Suzuki (2023) {barticle}[author] \bauthor\bsnmOko, \bfnmKazusato\binitsK., \bauthor\bsnmAkiyama, \bfnmShunta\binitsS. and \bauthor\bsnmSuzuki, \bfnmTaiji\binitsT. (\byear2023). \btitleDiffusion Models are Minimax Optimal Distribution Estimators. \bjournalarXiv preprint arXiv:2303.01861. \endbibitem
  • OpenAI (2023) {bmisc}[author] \bauthor\bsnmOpenAI (\byear2023). \btitleGPT-4 Technical Report. \endbibitem
  • Pace and Barry (1997) {barticle}[author] \bauthor\bsnmPace, \bfnmR Kelley\binitsR. K. and \bauthor\bsnmBarry, \bfnmRonald\binitsR. (\byear1997). \btitleSparse spatial autoregressions. \bjournalStatistics & Probability Letters \bvolume33 \bpages291–297. \endbibitem
  • Quionero-Candela et al. (2009) {bbook}[author] \beditor\bsnmQuionero-Candela, \bfnmJoaquin\binitsJ., \beditor\bsnmSugiyama, \bfnmMasashi\binitsM., \beditor\bsnmSchwaighofer, \bfnmAnton\binitsA. and \beditor\bsnmLawrence, \bfnmNeil D\binitsN. D., eds. (\byear2009). \btitleDataset shift in machine learning. \bpublisherThe MIT Press. \endbibitem
  • Sanh et al. (2019) {barticle}[author] \bauthor\bsnmSanh, \bfnmVictor\binitsV., \bauthor\bsnmDebut, \bfnmLysandre\binitsL., \bauthor\bsnmChaumond, \bfnmJulien\binitsJ. and \bauthor\bsnmWolf, \bfnmThomas\binitsT. (\byear2019). \btitleDistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. \bjournalarXiv preprint arXiv:1910.01108. \endbibitem
  • Schapire (1990) {barticle}[author] \bauthor\bsnmSchapire, \bfnmRobert E\binitsR. E. (\byear1990). \btitleThe strength of weak learnability. \bjournalMachine learning \bvolume5 \bpages197–227. \endbibitem
  • Shwartz-Ziv and Armon (2022) {barticle}[author] \bauthor\bsnmShwartz-Ziv, \bfnmRavid\binitsR. and \bauthor\bsnmArmon, \bfnmAmitai\binitsA. (\byear2022). \btitleTabular data: Deep learning is not all you need. \bjournalInformation Fusion \bvolume81 \bpages84–90. \endbibitem
  • Singh, Sandhu and Kumar (2015) {binproceedings}[author] \bauthor\bsnmSingh, \bfnmKamaljot\binitsK., \bauthor\bsnmSandhu, \bfnmRanjeet Kaur\binitsR. K. and \bauthor\bsnmKumar, \bfnmDinesh\binitsD. (\byear2015). \btitleComment volume prediction using neural networks and decision trees. In \bbooktitleIEEE UKSim-AMSS 17th International Conference on Computer Modelling and Simulation, UKSim2015 (UKSim2015). \endbibitem
  • Sohl-Dickstein et al. (2015) {binproceedings}[author] \bauthor\bsnmSohl-Dickstein, \bfnmJascha\binitsJ., \bauthor\bsnmWeiss, \bfnmEric\binitsE., \bauthor\bsnmMaheswaranathan, \bfnmNiru\binitsN. and \bauthor\bsnmGanguli, \bfnmSurya\binitsS. (\byear2015). \btitleDeep unsupervised learning using nonequilibrium thermodynamics. In \bbooktitleInternational Conference on Machine Learning \bpages2256–2265. \bpublisherPMLR. \endbibitem
  • Song, Meng and Ermon (2020) {barticle}[author] \bauthor\bsnmSong, \bfnmJiaming\binitsJ., \bauthor\bsnmMeng, \bfnmChenlin\binitsC. and \bauthor\bsnmErmon, \bfnmStefano\binitsS. (\byear2020). \btitleDenoising Diffusion Implicit Models. \bjournalarXiv:2010.02502. \endbibitem
  • Sweeney (2002) {barticle}[author] \bauthor\bsnmSweeney, \bfnmLatanya\binitsL. (\byear2002). \btitlek-anonymity: A model for protecting privacy. \bjournalInternational Journal of Uncertainty, Fuzziness and Knowledge-Based Systems \bvolume10 \bpages557–570. \endbibitem
  • Toews (2022) {bmisc}[author] \bauthor\bsnmToews, \bfnmRob\binitsR. (\byear2022). \btitleSynthetic data is about to transform artificial intelligence. \endbibitem
  • Tripuraneni, Jordan and Jin (2020) {barticle}[author] \bauthor\bsnmTripuraneni, \bfnmNilesh\binitsN., \bauthor\bsnmJordan, \bfnmMichael\binitsM. and \bauthor\bsnmJin, \bfnmChi\binitsC. (\byear2020). \btitleOn the theory of transfer learning: The importance of task diversity. \bjournalAdvances in neural information processing systems \bvolume33 \bpages7852–7862. \endbibitem
  • Wasserman, Ramdas and Balakrishnan (2020) {barticle}[author] \bauthor\bsnmWasserman, \bfnmLarry\binitsL., \bauthor\bsnmRamdas, \bfnmAaditya\binitsA. and \bauthor\bsnmBalakrishnan, \bfnmSivaraman\binitsS. (\byear2020). \btitleUniversal inference. \bjournalProceedings of the National Academy of Sciences \bvolume117 \bpages16880–16890. \endbibitem
  • Wasserman and Roeder (2009) {barticle}[author] \bauthor\bsnmWasserman, \bfnmLarry\binitsL. and \bauthor\bsnmRoeder, \bfnmKathryn\binitsK. (\byear2009). \btitleHigh dimensional variable selection. \bjournalAnnals of statistics \bvolume37 \bpages2178. \endbibitem
  • Wei et al. (2021) {barticle}[author] \bauthor\bsnmWei, \bfnmDonglai\binitsD., \bauthor \betalet al. (\byear2021). \btitleDiffusion Models Beat GANs on Image Synthesis. \bjournalPreprint arXiv. \endbibitem
  • Xu et al. (2023) {binproceedings}[author] \bauthor\bsnmXu, \bfnmXingqian\binitsX., \bauthor\bsnmWang, \bfnmZhangyang\binitsZ., \bauthor\bsnmZhang, \bfnmGong\binitsG., \bauthor\bsnmWang, \bfnmKai\binitsK. and \bauthor\bsnmShi, \bfnmHumphrey\binitsH. (\byear2023). \btitleVersatile diffusion: Text, images and variations all in one diffusion model. In \bbooktitleProceedings of the IEEE/CVF International Conference on Computer Vision \bpages7754–7765. \endbibitem
  • Yuan et al. (2023) {barticle}[author] \bauthor\bsnmYuan, \bfnmYuan\binitsY., \bauthor\bsnmDing, \bfnmJingtao\binitsJ., \bauthor\bsnmShao, \bfnmChenyang\binitsC., \bauthor\bsnmJin, \bfnmDepeng\binitsD. and \bauthor\bsnmLi, \bfnmYong\binitsY. (\byear2023). \btitleSpatio-temporal Diffusion Point Processes. \bjournalarXiv preprint arXiv:2305.12403. \endbibitem
  • Zhang et al. (2023a) {barticle}[author] \bauthor\bsnmZhang, \bfnmChenshuang\binitsC., \bauthor\bsnmZhang, \bfnmChaoning\binitsC., \bauthor\bsnmZhang, \bfnmMengchun\binitsM. and \bauthor\bsnmKweon, \bfnmIn So\binitsI. S. (\byear2023a). \btitleText-to-image diffusion model in generative ai: A survey. \bjournalarXiv preprint arXiv:2303.07909. \endbibitem
  • Zhang et al. (2023b) {barticle}[author] \bauthor\bsnmZhang, \bfnmHengrui\binitsH., \bauthor\bsnmZhang, \bfnmJiani\binitsJ., \bauthor\bsnmSrinivasan, \bfnmBalasubramaniam\binitsB., \bauthor\bsnmShen, \bfnmZhengyuan\binitsZ., \bauthor\bsnmQin, \bfnmXiao\binitsX., \bauthor\bsnmFaloutsos, \bfnmChristos\binitsC., \bauthor\bsnmRangwala, \bfnmHuzefa\binitsH. and \bauthor\bsnmKarypis, \bfnmGeorge\binitsG. (\byear2023b). \btitleMixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space. \bjournalarXiv preprint arXiv:2310.09656. \endbibitem
  • Zheng and Charoenphakdee (2022) {barticle}[author] \bauthor\bsnmZheng, \bfnmShuhan\binitsS. and \bauthor\bsnmCharoenphakdee, \bfnmNontawat\binitsN. (\byear2022). \btitleDiffusion models for missing value imputation in tabular data. \bjournalarXiv preprint arXiv:2210.17128. \endbibitem