\externaldocument

[supp-]supp

Nonparametric Estimation of Repeated Densities with Heterogeneous Sample Sizes

Jiaming Qiu Xiongtao Dai Zhengyuan Zhu
Department of Statistics, Iowa State University

Abstract

We consider the estimation of densities in multiple subpopulations, where the available sample size in each subpopulation greatly varies. This problem occurs in epidemiology, for example, where different diseases may share similar pathogenic mechanism but differ in their prevalence. Without specifying a parametric form, our proposed method pools information from the population and estimate the density in each subpopulation in a data-driven fashion. Drawing from functional data analysis, low-dimensional approximating density families in the form of exponential families are constructed from the principal modes of variation in the log-densities. Subpopulation densities are subsequently fitted in the approximating families based on likelihood principles and shrinkage. The approximating families increase in their flexibility as the number of components increases and can approximate arbitrary infinite-dimensional densities. We also derive convergence results of the density estimates with discrete observations. The proposed methods are shown to be interpretable and efficient in simulation as well as applications to electronic medical record and rainfall data.

Keywords: Sparse functional data analysis; Functional principal component analysis; Shrinkage estimation; Dimension reduction; Object data analysis

1 Introduction

Density estimation is one of the most fundamental tasks in statistics and machine learning. We consider the problem of repeated density estimation, namely, to estimate the distributions of a variable in multiple subpopulations with similar nature. For example, the distributions of interest can be the age distributions of patients with different diseases, or the precipitation distributions in multiple regions. Repeated density functions can be modeled as the realizations of a random density element that is subject to the positivity and unit integral constraints. Moreover, density functions are in practice unobserved and are accessed through discrete samples, rendering additional difficulties for data analysis. The primary objective of this work is to address these difficulties and derive reliable yet flexible density estimates for the subpopulations given discrete observations, in a fully data-driven and efficient manner.

Most of the existing work consider random densities that are either completely observed or are reconstructed from densely sampled observations. This type of density objects was first analyzed by Kneip and Utikal (2001) by directly applying functional principal component analysis (FPCA), a technique that has later been applied to analyze time-varying densities (Huynh et al. 2011; Tsay 2016; Chu et al. 2018). Recognizing the constraints in density functions, Delicado (2011) and Petersen and Müller (2016) proposed to analyze densities by utilizing one-to-one transformations to map them to an unconstrained space, where the transformed densities are then represented using FPCA and subsequently back transformed into densities. Distinct geometries are generated on the space of densities by different transformations, such as the log-hazard and log-quantile density transformations (Petersen and Müller 2016; Kim et al. 2020), log transformations for analyzing densities as compositional data (Egozcue et al. 2006; Hron et al. 2016); and the square root transformation with the Riemannian logarithm map (Srivastava et al. 2007; Dai 2020) which generates a Hilbert sphere geometry.

We instead consider fitting sparsely observed random densities with as few as a handful of realizations available. Subpopulations with small sample sizes arise in a number of possible situations when the sampling mechanism dictates heterogeneous sample sizes. A first situation is when the availability and cost of observation highly varies among subpopulations (Wakefield and Elliott 1999), with some groups easy but the rest hard to reach (Bonevski et al. 2014). This scenario is illustrated by the Medical Information Mart for Intensive Care (MIMIC) data (Johnson et al. 2016) in our first data application, where patient records are ample for common diseases such as diabetes and coronary heart disease but scarce for uncommon conditions. A second generating mechanism is that the data collection process had been established for different duration in different subpopulations (Fithian and Wager 2015), which is illustrated by our second data application considering rainfall recorded by weather stations around the Haihe River basin in northern China that differ in the number of years when data are available.

The literature on sparsely observed random densities is scant. A closely related topic is replicated point processes, of which the intensity functions can be normalized into densities. Panaretos and Zemel (2016) considered time-warping densely sampled Poisson point processes under a transportation geometry. Given sparsely sampled replicated point processes, Wu et al. (2013) proposed to pool information over multiple realizations and directly model the intensity functions using FPCA. Gervini (2016) and Gervini and Khanal (2019) considered semi-parametric models to decompose the intensity and the log-intensity functions, respectively.

Our proposal is a nonparametric approach that directly targets the repeated densities, which are modeled as realizations of a density-valued random element, and the discrete observations from each subpopulation are modeled as i.i.d. realizations from the corresponding density. Information is pooled over all subpopulations in a two-step fashion. First, we identify low-dimensional structures to parsimoniously approximate the densities using data from subpopulations densely sampled by applying FPCA to the centralized log-transformed densities that lie in an unconstrained $L^{2}$ space. When transformed back to densities, the low-dimensional structures make up exponential families, where the sufficient statistics are the eigenfunctions of the transformed densities, and each density element is compactly represented by its natural parameter or, equivalently, moment coordinate (Amari 2016). Second, a new and potentially sparsely observed subpopulation is fitted within the approximating family, obtaining maximum likelihood estimate (MLE), and the coordinate estimate is then shrunk towards the population using distributional information, further enforcing information borrowing. Two shrinkage estimates, namely maximum a posteriori (MAP) and best linear unbiased prediction (BLUP), are proposed based on the natural parameter and moment coordinate, respectively.

To illustrate the idea, suppose that random densities are generated from the normal distribution family $\mathcal{P}=\left\{p_{\mu,\sigma}:\mu\in\mathds{R},\sigma>0\right\}$ with densities $p_{\mu,\sigma}(x)\propto\exp\left(x(\mu/\sigma^{2})-x^{2}/(2\sigma^{2})\right)$ truncated to $x\in[-1,1]$ . Observe that a random element $p$ taking values in $\mathcal{P}$ satisfies $\log p\subset\operatorname{span}\left\{1,x,x^{2}\right\}$ and thus lies in a low-dimensional space. The low-dimensional structure can be recovered by applying dimension reduction techniques such as FPCA on $\log p$ . When inferring the distribution of a subpopulation, information borrowing is performed within the low-dimensional space and is analogous to that for the classical linear mixed effects model. Remarkably, the proposed methods estimate the low-dimensional structure and pool information fully nonparametrically without imposing parametric shape assumptions. A worked simulation example is included in LABEL:supp-example:rand.intercept of the Supplementary Materials.

The proposed methods combine the flexibility from non-parametric density estimate and the statistical efficiency and interpretability offered by parametric inference and information borrowing. Ultimately, fitting a density with sparse observations follows likelihood principles in an estimated exponential family, so the required amount of observations by our proposed methods could be substantially fewer than what would otherwise be required for non-parametric estimates. Moreover, the exponential family approximation has advantages in computation and interpretability. The approximating family increases in flexibility as its dimension increases and can approximate an arbitrarily complex density. Rates of convergence of the proposed MLE under the Kullback–Leibler (KL) divergence and $L^{1}$ norm are established under the asymptotics that the number of training subpopulations, the sample size of each subpopulation, and the sample size of the new subpopulation diverge to infinity. This result holds for general infinite-dimensional random densities under smoothness and boundedness conditions. If the random density actually lies in a finite-dimensional exponential family, then the proposed method converges under KL divergence to the truth with a near-parametric rate. If the sample size in the new subpopulation is fixed, the proposed MLE converges to the MLE in the best approximating exponential family.

The proposed models and estimation are introduced in Section 2. Numerical properties of the proposed methods are investigated and compared to alternative parametric and nonparametric density estimates in Section 3. Data illustrations with age-at-admission of ICU patients and rainfall data in the Haihe river basin are included in Section 4. Theoretical results are stated in Section 5. Additional simulation, discussion, and detailed proofs are included in the Supplemental Materials.

2 Models and Methods

2.1 Data Setup and Goals

Let $\left\{X_{ij}:j=1,\dots,N_{i},i=1,\dots,n\right\}$ be the observations of a continuous variable in subpopulation $i=1,\dots,n$ ; e.g. the age-at-admission to the Intensive Care Unit (ICU) as described in Subsection 4.1 grouped by primary diagnosis. Here $X_{i1},\dots,X_{iN_{i}}$ are i.i.d. observations from the $i$ th subpopulation with density $p_{i}$ . Densities $p_{1},\dots,p_{n}$ are modeled as i.i.d. realizations of a random density $p$ taking values in family $\mathcal{P}$ , where $\mathcal{P}$ consists of densities supported on a common compact interval $\mathcal{T}\subset\mathds{R}$ . For example, there may exist latent characteristics that determine the susceptibility age profiles of different diseases, so the age distribution of a disease can be regarded the realization of a density-valued random element. To emphasis the grouping structure, we refer to $p_{i}$ as the density of the $i$ th subpopulation, while the distribution of the random density $p$ as the population.

Our goal is to estimate the density $p_{0}$ in a subpopulation with a few i.i.d. observations $X_{0j}$ , $j=1,\ldots,N$ , where the number of observations $N$ can potentially be small. For example, this problem occurs when one want to estimate the age distribution for an uncommon disease. We will first obtain low-dimensional structures of $\mathcal{P}$ from subpopulations where ample observations are available, e.g. from records for more common conditions, and apply the structure to estimate the density in the subpopulation with a small sample size.

2.2 Proposed Repeated Density Estimation

2.2.1 Constructing Approximating Family

We propose to approximate distribution family $\mathcal{P}$ with low rank exponential families by utilizing a functional data analytic approach. The scenario where the distribution of the random density $p$ is given is discussed here, and the scenario where the distribution is not directly available is described in Subsubsection 2.2.4.

The collection of densities form an infinite-dimensional constraint manifold instead of a Hilbert space. To borrow from the rich literature studying function data as objects in a Hilbert space, we follow a transformation approach (Petersen and Müller 2016) and analyze the transformed densities as Hilbertian elements. The centralized log-transformation (Hron et al. 2016) $\operatorname{\psi}:\mathcal{P}\rightarrow L^{2}(\mathcal{T})$ maps a positive density $q\in\mathcal{P}$ into the $L^{2}$ space, obtaining

(\operatorname{\psi}q)(t)=\log q(t)-\frac{1}{\left\lvert\mathcal{T}\right\rvert}\int_{\mathcal{T}}\log q(x)dx,\,t\in\mathcal{T}.

(2.1)

To derive low rank approximations to the densities, we apply the Karhunen–Loève expansion to the transformed trajectories $f=\operatorname{\psi}p$ in $L^{2}(\mathcal{T})$ , obtaining

f(t)=(\operatorname{\psi}p)(t)=\mu(t)+\sum_{k=1}^{\infty}\eta_{k}\varphi_{k}(t),t\in\mathcal{T},

(2.2)

where $\mu(t)=Ef(t)$ is the mean function, $\eta_{k}=\int_{\mathcal{T}}(f(t)-\mu(t))\varphi_{k}(t)dt$ is the $k$ th component score, and $(\lambda_{k},\varphi_{k})$ is the $k$ th eigenvalue–eigenfunction pair of the covariance function $G(s,t)=\operatorname{cov}(f(s)-\mu(s),f(t)-\mu(t))$ satisfying $\lambda_{k}\varphi_{k}(t)=\int_{\mathcal{T}}G(s,t)\varphi_{k}(s)ds$ on $t\in\mathcal{T}$ , for $k=1,2,\dots$ . The eigenfunctions are orthonormal, and the eigenvalues are non-negative and non-increasing, satisfying $\lambda_{1}\geq\lambda_{2}\geq\dots\geq 0$ .

Low rank approximating density families to $\mathcal{P}$ are then obtained by the back-transformation $\operatorname{\psi}^{-1}$ and take the form of exponential families. For $K=1,2,\dots$ , the $K$ -dimensional approximation $\mathcal{P}_{K}$ to $\mathcal{P}$ is the exponential family

	$\displaystyle\mathcal{P}_{K}=\left\{p_{\theta}:\theta\in\mathds{R}^{K},B_{K}(\theta)<\infty\right\}\quad\text{ with}$		(2.3)
	$\displaystyle p_{\theta}(t)=\exp\left(\mu(t)+\sum_{k=1}^{K}\theta_{k}\varphi_{k}(t)-B_{K}(\theta)\right),\;t\in\mathcal{T},$		(2.4)

where $B_{K}(\theta)=\log\left(\int_{\mathcal{T}}\exp\left(\mu(t)+\sum_{k=1}^{K}\theta_{k}\varphi_{k}(t)\right)dt\right)$ is the normalizing constant. Family $\mathcal{P}_{K}$ is identifiable as will be shown in LABEL:supp-prop:identifiable of supplement. The random density $p$ is thus approached by its truncated version

p_{K}(t)\propto\exp(\mu(t)+\sum_{k\leq K}\eta_{k}\varphi_{k}(t)),\,t\in\mathcal{T}

(2.5)

lying in $\mathcal{P}_{K}$ .

The model components in (2.4) enjoy interpretation within an exponential family: $\theta_{k}$ is the natural parameter which compactly describes the density, the eigenfunctions $\varphi_{k}$ is the sufficient statistic, and mean $\mu$ is the baseline measure. Analogous to the case for FPCA, the leading eigenfunctions $\varphi_{k}$ , $k=1,\dots,K$ encode the principal modes of variation in the log-densities, and the $\eta_{k}$ are the scores explaining the most variation; $\mathcal{P}_{K}$ provides the most parsimonious $K$ -dimensional description of the random (log-)densities. Hence, performing density estimation within $\mathcal{P}_{K}$ will effectively borrow information from the typical shapes of the densities displayed in the leading eigenfunctions or sufficient statistic.

2.2.2 Fitting Densities within Approximating Families

Given a small sample $X_{0j}$ , $j=1,\ldots,N$ from a new subpopulation, an estimate for the underlying density $p_{0}$ is then obtained as a best fit within the approximating family $\mathcal{P}_{K}$ . To assess the goodness of fit, various approaches are available from information theory (e.g. Amari and Nagaoka 2000; Amari 2016), and we adopt the information loss as quantified by the Kullback–Leibler (KL) divergence, which is defined for two positive densities $q_{1}$ , $q_{2}$ on $\mathcal{T}$ as

D(q_{1}\;\|\;q_{2})=\int_{\mathcal{T}}q_{1}(t)\log\left(\frac{q_{1}(t)}{q_{2}(t)}\right)dt.

(2.6)

KL divergence $D(q_{1}\;\|\;q_{2})$ from $q_{1}$ to $q_{2}$ quantifies the information loss when one approximates $q_{1}$ by $q_{2}$ . To estimate $p_{0}$ , we propose to use the best approximating element in $\mathcal{P}_{K}$ minimizing the KL divergence from the empirical distribution. This leads to the maximum likelihood estimate (MLE) defined by $p_{\mathring{\theta}}\in\mathcal{P}_{K}$ , where

\displaystyle\mathring{\theta}

\displaystyle=\operatorname*{arg\,max}_{\theta:B_{K}(\theta)<\infty}\left(\theta^{T}\bar{\varphi}_{0}-B_{K}(\theta)\right),

(2.7)

and $\bar{\varphi}_{0}\in\mathds{R}^{K}$ is the sufficient statistic constructed from the sample $X_{0j}$ , in which the $k$ th element is $N^{-1}\sum_{j\leq N}\varphi_{k}(X_{0j})$ , $k=1,\ldots,K$ .

2.2.3 Incorporating Population-level Information

In addition to the typical shapes of random densities, we borrow information from the distribution of these random densities in the population to reduce variation when fitting a new density. This will result in a significant improvement especially if the sample size in the new subpopulation is small.

The distribution of $p$ is captured by those of the component scores $\eta_{k}$ , $k=1,\dots,K$ . Incorporating such distributional information into the likelihood function, we propose maximum a posteriori (MAP) estimate $\mathring{p}_{\text{MAP}}=p_{\mathring{\theta}_{\text{MAP}}}\in\mathcal{P}_{K}$ , where the parameter estimate $\mathring{\theta}_{\text{MAP}}$ is defined as

\displaystyle\mathring{\theta}_{\text{MAP}}

\displaystyle=\operatorname*{arg\,max}_{\theta:B_{K}(\theta)<\infty}\left(\theta^{T}\bar{\varphi}_{0}-B_{K}(\theta)+\log\pi(\theta)\right),

(2.8)

and $\pi$ is the unconditional density of the $\eta_{k}$ . Analogous to the PACE procedure (Yao et al. 2005), $\pi$ is constructed using independent marginal distributions being normal distributions with mean $E\eta_{k}=0$ and variance $\operatorname{Var}\eta_{k}=\lambda_{k}$ , $k=1,\dots,K$ . Here $\mathring{\theta}_{\text{MAP}}$ is a shrinkage of the MLE $\mathring{\theta}$ towards the population $\pi$ .

To accommodate non-normal unconditional distributions for the $\eta_{k}$ , we propose a second approach to incorporate the population-level knowledge by working with the moment parameterization. The moment parameter, also referred to as the moment coordinate, for an element $p_{\theta}$ in exponential family $\mathcal{P}_{K}$ is $\xi=(\xi_{1},\dots,\xi_{K})\in\mathds{R}^{K}$ where $\xi_{k}=\int_{\mathcal{T}}\varphi_{k}(t)p_{\theta}(t)dt$ . The moment parameter is in one-to-one correspondence with natural parameters (see, e.g., Amari 2016), noting

\xi=\partial_{\theta}B_{K}(\theta).

(2.9)

For brevity, we use $\xi(\cdot):\theta\mapsto\xi(\theta)$ and $\theta(\cdot):\xi\mapsto\theta(\xi)$ to denote the one-to-one mappings derived from (2.9).

Motivated by the fact that the MLE of the natural parameter is the method of moments estimate under moment parameterization, we consider shrinkage estimate based on the the moment parametrization. Let $\eta_{0}$ be the natural parameter of the truncated random densities $p_{0,K}\in\mathcal{P}_{K}$ of $p_{0}$ according to (2.5) and $\tau_{0}=\xi(\eta_{0})\in\mathds{R}^{K}$ the corresponding moment coordinate. We propose to estimate the moment parameter $\tau_{0}$ by the best linear unbiased predictor (BLUP) of $\tau_{0}$ given the sample sufficient statistic $\bar{\varphi}_{0}$ , defined as

	$\displaystyle\mathring{\xi}_{\text{BLUP}}$	$\displaystyle=\operatorname{BLUP}(\tau_{0}\mid\bar{\varphi}_{0})\overset{\triangle}{=}\operatorname{cov}(\tau_{0},\bar{\varphi}_{0})\operatorname{var}(\bar{\varphi}_{0})^{-1}(\bar{\varphi}_{0}-E\bar{\varphi}_{0})+E\tau_{0}$
		$\displaystyle=\operatorname{var}(\tau_{0})\operatorname{var}(\bar{\varphi}_{0})^{-1}(\bar{\varphi}_{0}-E\tau_{0})+E\tau_{0},$		(2.10)

where the last equality is due to $E(\bar{\varphi}_{0}\mid\tau_{0})=\tau_{0}$ and double expectation. Note that here the covariance and expectation are computed under the joint distribution of $(\tau_{0},\bar{\varphi}_{0})$ . We thus obtain a plug-in estimate $\mathring{\theta}_{\text{BLUP}}=\theta(\mathring{\xi}_{\text{BLUP}})$ of the natural parameter.

The MAP (2.8) and BLUP (2.2.3) estimates not only borrow typical shapes of densities via the sufficient statistic $\varphi$ analogous to the MLE (2.7), but also take advantage of the distribution of the random density through the component scores. We refer to the information embedded in the component scores of log-densities as population-level information. The MAP (2.8) and BLUP (2.2.3) are, respectively, a shrinkage of the MLE (2.7) estimate towards the mean under the natural and the moment coordinate. As will be demonstrated in our numerical investigation later, this extra shrinkage is precious for reducing information loss especially given small samples.

2.2.4 Pre-smoothing and Unknown Distribution of $p$

Oftentimes, the random densities $p_{1},\dots,p_{n}$ and their distribution are unavailable but only discrete samples $\left\{X_{ij}:j=1,\dots N_{i};i=1\dots,n\right\}$ are accessible. In this case, we utilize pre-smoothing to construct pilot density estimates for the samples, analogous to the approach taken by Petersen and Müller (2016); Han et al. (2020). While many density estimators may be applied, for example kernel density estimate (KDE) and logspline (Kooperberg and Stone 1991), for theoretical consideration we adopt the weighted KDE

\check{p}_{i}(t)=\left(\int_{\mathcal{T}}\sum_{j=1}^{N_{i}}\kappa\left(\frac{x-X_{ij}}{h}\right)w(x,h)dx\right)^{-1}\sum_{j=1}^{N_{i}}\kappa\left(\frac{t-X_{ij}}{h}\right)w(t,h),\;t\in\mathcal{T}

(2.11)

to handle boundary bias, where $\kappa$ is a kernel function and $w(t,h)=1/\int_{(t-b)/h}^{(t-a)/h}\kappa(u)du$ is the weight for $\mathcal{T}=[a,b]$ . In the pre-smoothing step, the pilot density estimator is applied on each sample to obtain pre-smoothed densities $\check{p}_{1},\dots,\check{p}_{n}$ .

FPCA is then performed on the transformed densities $\check{f}_{i}=\operatorname{\psi}\check{p}_{i}$ , $i=1,\dots,n$ to obtain sample mean $\hat{\mu}(t)=n^{-1}\sum_{i=1}^{n}\check{f}_{i}(t)$ , sample covariance function $\hat{G}(s,t)=n^{-1}\sum_{i=1}^{n}(\check{f}_{i}(s)-\hat{\mu}(s))(\check{f}_{i}(t)-\hat{\mu}(t))$ , the associated eigenvalues $\hat{\lambda}_{k}$ and eigenfunctions $\hat{\varphi}_{k}$ , and the component scores $\hat{\eta}_{i}\in\mathds{R}^{K}$ in which the $k$ th element is $\hat{\eta}_{ik}=\left\langle\check{f}_{i}-\hat{\mu},\hat{\varphi}_{k}\right\rangle$ for $k=1,\dots,n-1$ . A sample version of the approximating families $\hat{\mathcal{P}}_{K}$ is obtained as

	$\displaystyle\hat{\mathcal{P}}_{K}=\left\{\hat{p}_{\theta}:\theta\in\mathds{R}^{K},\hat{B}_{K}(\theta)<\infty\right\},\text{ with}$		(2.12)
	$\displaystyle\hat{p}_{\theta}(t)=\exp\left(\hat{\mu}(t)+\sum_{k=1}^{K}\theta_{k}\hat{\varphi}_{k}(t)-\hat{B}_{K}(\theta)\right),\,t\in\mathcal{T},$		(2.13)

where $\hat{B}_{K}(\theta)=\log\left(\int_{\mathcal{T}}\exp\left(\hat{\mu}(t)+\sum_{k\leq K}\theta_{k}\hat{\varphi}_{k}(t)dt\right)\right)$ is the normalizing constant. Inference for new samples are conducted based on $\hat{\mathcal{P}}_{K}$ with sufficient statistic $\hat{\varphi}_{0,\text{ave}}\in\mathds{R}^{K}$ , in which the $k$ th element is $N^{-1}\sum_{j=1}^{N}\hat{\varphi}_{k}(X_{0j})$ ; to avoid double over-scripts here we use subscript “ave” to denote the average. The distribution of random densities can be accessed via the empirical distribution of the sample component scores. The proposed estimators in fully sample-based versions are then defined using the estimated quantities. The MLE and MAP within $\hat{\mathcal{P}}_{K}$ are $\hat{p}_{\hat{\theta}_{\text{MLE}}}$ and $\hat{p}_{\hat{\theta}_{\text{MAP}}}$ , respectively, where

	$\displaystyle\hat{\theta}_{\text{MLE}}$	$\displaystyle=\operatorname*{arg\,max}_{\theta:\hat{B}_{K}(\theta)<\infty}\left(\theta^{T}\hat{\varphi}_{0,\text{ave}}-\hat{B}_{K}(\theta)\right),$		(2.14)
	$\displaystyle\hat{\theta}_{\text{MAP}}$	$\displaystyle=\operatorname*{arg\,max}_{\theta:\hat{B}_{K}(\theta)<\infty}\left(\theta^{T}\hat{\varphi}_{0,\text{ave}}-\hat{B}_{K}(\theta)+\log\hat{\pi}(\theta)\right),$		(2.15)

in parallel with (2.7) and (2.8), and $\hat{\pi}$ is the product of normal densities with mean $0$ and variance $s^{2}_{\hat{\eta},k}$ , the sample variance of $\left\{\hat{\eta}_{ik},i=1,\dots,n\right\}$ , for $k=1,\dots,K$ .

For BLUP, $E\tau_{0}$ and $\operatorname{var}(\tau_{0})$ in (2.2.3) are estimated by the sample mean $\hat{\tau}_{\text{ave}}=n^{-1}\sum_{i\leq n}\hat{\tau}_{i}$ and covariance matrix $\Sigma_{\hat{\tau}}$ of moment coordinates $\left\{\hat{\tau}_{i}=\xi(\hat{\eta}_{i}):i=1,\dots,n\right\}$ in $\hat{\mathcal{P}}_{K}$ . Further decompose $\operatorname{var}(\bar{\varphi}_{0})=E\operatorname{var}(\bar{\varphi}_{0}\mid\tau_{0})+\operatorname{var}(\tau_{0})$ , and estimate the first term via integration by

\Sigma_{\hat{\varphi}_{\text{ave}}}=\frac{1}{nN}\sum_{i=1}^{n}\int_{\mathcal{T}}\left(\hat{\varphi}(t)-\hat{\tau}_{i}\right)\left(\hat{\varphi}(t)-\hat{\tau}_{i}\right)^{T}\check{p}_{i}(t)dt,

where $\hat{\varphi}:t\mapsto(\hat{\varphi}_{1}(t),\dots,\hat{\varphi}_{K}(t))\in\mathds{R}^{K}$ is the sufficient statistic. In combine, the plug-in BLUP estimate for the moment parameter within $\hat{\mathcal{P}}_{K}$ is obtained as

\hat{\xi}_{\text{BLUP}}=\Sigma_{\hat{\tau}}\left(\Sigma_{\hat{\varphi}_{\text{ave}}}+\Sigma_{\hat{\tau}}\right)^{-1}(\hat{\varphi}_{0,\text{ave}}-\hat{\tau}_{\text{ave}})+\hat{\tau}_{\text{ave}},

(2.16)

and the associated natural parameter $\hat{\theta}_{\text{BLUP}}=\hat{\theta}(\hat{\xi}_{\text{BLUP}})$ is the solution to $\hat{\xi}_{\text{BLUP}}=\partial_{\theta}\hat{B}_{K}(\theta)$ .

When the random densities $p_{1},\ldots,p_{n}$ are observed in their entirety, a scenario considered for the interest of theory and referred to as the subpopulations being completely observed, it suffices to skip the pre-smoothing step and proceed with the observed densities. In this case, FPCA is performed on the transformed trajectories $f_{i}=\operatorname{\psi}p_{i}$ , $i=1,\dots,n$ to obtain sample mean and covariance as $\tilde{\mu}(t)=n^{-1}\sum_{i\leq n}f_{i}(t)$ and $\tilde{G}(s,t)=n^{-1}\sum_{i\leq n}(f_{i}(s)-\tilde{\mu}(s))(f_{i}(t)-\tilde{\mu}(t))$ , eigenvalues and eigenfunctions $\tilde{\lambda}_{k}$ and $\tilde{\varphi}_{k}$ , and sample component scores $\tilde{\eta}_{ik}=\left\langle f_{i}-\tilde{\mu},\tilde{\varphi}_{k}\right\rangle$ , $k=1,\dots,n-1$ . The approximating family $\tilde{\mathcal{P}}_{K}$ with completely observed densities is defined as

	$\displaystyle\tilde{\mathcal{P}}_{K}=\left\{\tilde{p}_{\theta}:\theta\in\mathds{R}^{K},\tilde{B}_{K}(\theta)<\infty\right\},\text{ with}$		(2.17)
	$\displaystyle\tilde{p}_{\theta}(t)=\exp\left(\tilde{\mu}(t)+\sum_{k=1}^{K}\theta_{k}\tilde{\varphi}_{k}(t)-\tilde{B}_{K}(\theta)\right),\,t\in\mathcal{T}$		(2.18)

for $K=1,2,\dots$ , where $\tilde{B}_{K}(\theta)=\log\int_{\mathcal{T}}\exp\left(\tilde{\mu}(t)+\sum_{k\leq K}\theta_{k}\tilde{\varphi}_{k}(t)dt\right)$ is the normalizing constant. The proposed MLE, MAP and BLUP are then defined analogously to that within $\hat{\mathcal{P}}_{K}$ . We use tilde overscripts to denote estimated quantities with fully observed densities.

We refer to the process of constructing approximating exponential families from the discrete observations from $p_{1},\dots,p_{n}$ as the training step, which emphasizes that the $K$ -dimensional families $\hat{\mathcal{P}}_{K}$ are flexibly derived from the data despite themselves being parametric families compactly described by a few natural parameters. The random densities $p_{1},\ldots,p_{n}$ are referred to as the training densities associated with the training subpopulations, and the corresponding samples as training samples. For performance assessment, we call the process constructing density estimates from new observations within the approximating families the fitting or testing step, and correspondingly, define testing subpopulation, density, and sample.

The number of components $K$ controls the flexibility of $\hat{\mathcal{P}}_{K}$ and is the key tuning parameter trading off bias and variance when fitting a density. We propose to select $K$ using a penalized approach by considering the Akaike information criterion (AIC), and set $K=K^{*}$ which minimizes $\operatorname{AIC}(K)\overset{\triangle}{=}2K-2\log\hat{p}_{K}$ , where $\hat{p}_{K}$ is the density estimates within $\hat{\mathcal{P}}_{K}$ using either (2.14), (2.15), or (2.16). In practice, we observe that the optimal $K^{*}$ often increases as the fitting sample size increases. Intuitively, the proposed component selection automatically utilizes a more flexible model given a larger sample, in which case the additional variance that comes with the flexibility becomes more affordable.

The proposed procedure is computationally efficient, since the proposed MLE, MAP, and BLUP estimates are performed within exponential families, so the optimization tasks in obtaining (2.14)–(2.16) are convex, and the discrete samples are succinctly summarized using the sufficient statistics.

3 Simulation Studies

We study the numerical properties for the proposed methods and alternative parametric and non-parametric density estimators under various scenarios. Independent training samples $\left\{X_{ij}:j=1,\ldots,N_{i}\right\}$ were drawn from densities $p_{i}$ , $i=1,\dots,n$ , respectively, where the $p_{i}$ are independent copies of a random density $p$ . Independent testing samples $\left\{X^{*}_{lj}:j=1,\ldots,N_{l}^{*}\right\}$ were then drawn from additional independent copies $p_{l}^{*}$ of $p$ for $l=1,\dots,n^{*}$ . Typically, training sample sizes were larger, e.g. $N_{i}=200$ , while testing samples were smaller, e.g., $N_{l}^{*}=10$ . The small sample sizes in the testing samples pose a challenge for the density estimators compared. The underlying family $\mathcal{P}$ and the distribution of $p\in\mathcal{P}$ varied between scenarios. The experiment under each scenario was repeated 500 times.

The proposed methods (2.14), (2.15), and (2.16) are referred to as FPCA_MLE, FPCA_MAP, and FPCA_BLUP. Our proposals were compared against the maximum likelihood estimate within the ground truth family $\mathcal{P}$ as an idealized parametric approach, referred to as MLE; non-parametric methods including KDE (Sheather and Jones 1991) and logspline (Kooperberg and Stone 1991); and a repeated point processes (RPP) approach proposed by Gervini and Khanal (2019). These methods do not require training.

The mean KL divergence for the density estimates was used to assess the overall estimation performance. For density estimate $\hat{p}^{*}_{l}$ for the $l$ th testing sample, information loss was evaluated by KL divergence $D(p^{*}_{l}\;\|\;\hat{p}^{*}_{l})$ defined in (2.6), and we examined the average

\text{MKL}=\frac{1}{n^{*}}\sum_{l=1}^{n^{*}}D(p^{*}_{l}\;\|\;\hat{p}^{*}_{l})

over $n^{*}$ testing samples to quantify performance. Smaller MKL indicates better estimation performance. Results for an alternative error metric, the mean integrated squared errors (MISE), and additional simulation cases are included in the Supplementary Materials.

3.1 Flexible Exponential Families

We first set the density families to be either the truncated normal or the bimodal distributions. For the truncated normal scenario, each sample was generated from $N(\mu,\sigma^{2})$ truncated on $[-3,3]$ with $\mu\sim\text{Unif}(-2,2)$ and $\sigma\sim\text{Unif}(2,4)$ . For the bimodal scenario, each sample was generated according to density $p(x)\propto\exp((4+\theta)x-(26.5+\theta)x^{2}+47x^{3}-25x^{4}),x\in[0,1]$ with $\theta\sim\text{Unif}(0,10)$ . In each Monte Carlo experiment, $n=50$ training samples of size $N_{i}=200$ and $n^{*}=100$ testing samples with sizes $N=10$ or 50 were created.

The mean KL divergence in Figure 3.1 shows that the proposed FPCA_MLE using either KDE as pre-smoother or with the complete training densities observed both produce estimates comparable to MLE without requiring explicit specification of a parametric family. Among all the non-parametric density estimators, the proposed methods are shown to be stable and best-performing, especially when the testing samples are small, in which case the proposed methods outperformed MLE by borrowing strength among different samples, whereas MLE does not pool information. In particular, for truncated normal family with $N^{*}=10$ , on average the information loss of the proposed MAP and BLUP were around 20% less compared to that of MLE within ground truth family, and more than 40% less compared to that of KDE.

Refer to caption — Figure 3.1: Boxplots of mean KL divergence. RPP, repeated point processes approach (Gervini and Khanal 2019); locfit, local polynomial density estimate (Loader 1996); logspline, adaptive logspline (Kooperberg and Stone 1991). Both discrete training samples of sizes $N_{i}=200$ with KDE pre-smoothing and completely observed densities are considered for our proposed approaches, namely FPCA_MLE, FPCA_MAP, and FPCA_BLUP.

3.2 Gaussian Mixture Family

Contrary to the previous scenario where the underlying families were exponential families, we next estimate densities from a family consisting of mixtures of Gaussian distributions, which forms a multi-modal non-exponential family. This is to further demonstrate the flexibility of the proposed methods to a “mis-specified” case and also demonstrate the data-adaptive selection of the number of components.

The samples were generated following a mixture of three Gaussian distributions with density $p(x)\propto\sum_{l=1}^{3}\theta_{l}\phi((x-\mu_{l})/\sigma_{l})$ truncated to $x\in[-3,3]$ , where $\phi$ is the standard normal density and the mixture probabilities $(\theta_{1},\theta_{2},\theta_{3})$ were generated from $\text{Dirichlet}(1/3,1/3,1/3)$ . For $l=1,2,3$ , we generated the mixture mean $\mu_{l}$ from $\text{Unif}(-5,5)$ and standard deviation $\sigma_{l}$ from $\text{Unif}(0.5,5)$ . Here, instead of performing a direct MLE which is known to be unstable for mixture distributions, the expectation–maximization (EM) algorithm in the variant of Lee and Scott (2012) is used as the baseline parametric density estimator.

The left panel of Figure 3.2 compares the relative error of various density estimates to that of the EM algorithm as a baseline reference. The proposed methods enjoyed the least information loss for all but the largest sample size $N^{*}=500$ when EM method became the best. When the testing sample size $N^{*}=25$ , on average, the information loss of proposed MAP and BLUP were 20% less than that of RPP, and they were all at least five times better compared to the EM algorithm. Remarkably, this is achieved by only utilizing the first few components on average, as demonstrated in the right panel of Figure 3.2. This shows that the proposed methods are effective alternatives to the EM algorithm, which overcome local maximum issues by performing convex optimization within approximating exponential families. In addition, by using the proposed AIC method for component selection, the proposed estimates utilized more components as the size of fitting sample increased, suggesting that the AIC selection adapts to the increased complexity of the underlying family as unveiled by a larger sample. Indeed, the larger the sample size, the more representative the sample will be for the underlying distribution, which has 8 free parameters in our setup. By using more components, the proposed methods can accommodate the increased complexity shown by the data such as multi-modality, which may otherwise be hidden when fitting with fewer components.

4 Real Data Applications

4.1 MIMIC Data

We consider the age-at-admission distributions of patients with critical conditions in the Medical Information Mart for Intensive Care (MIMIC) data (Johnson et al. 2016), collected from patients admitted to critical care units at a large hospital in Boston between 2001 and 2012. The number of patients under different primary diagnosis are highly different. Ample data points are available to establish a precise non-parametric density estimate for common diseases, as shown in the left panels of Figure 4.1, whereas observations for uncommon conditions are scarce and cannot afford flexible non-parametric approaches such as KDE, as displayed in right panels of Figure 4.1. Estimating the densities in the hard-to-observe subpopulations of patients with uncommon conditions is the emphasis in our analysis.

Gender and diagnosis code (ICD9) combinations were used to define the subpopulations, where we excluded birth-related, unclassified, or unspecified diagnosis. A total of $940$ subpopulations with more than 5 observations each were analyzed, among which $75$ subpopulations with at least $75$ observations were used as the training while the rest as testing. For example, the upper left panel in Figure 4.1 shows a training sample formed by $109$ female patients with juvenile type diabetes with ketoacidosis (DMI ketoacd uncontrold), while the bottom right panel shows a testing sample consisting of 27 male patients with abdominal aortic aneurysm rupture (Rupt abd aortic aneurysm). We focused on estimating the densities supported in the age range of $[14,85]$ . KDE was utilized as the pre-smoother and the bandwidth was set to the median of the bandwidths in the training samples selected by the method of Sheather and Jones (1991).

As illustrated in right panels of Figure 4.1, the proposed density estimates produced sensible results given small test samples, which were smoother than the estimates given by KDE and avoided multiple spurious modes.

The modes of variation in the density functions obtained by varying each of the first three natural parameters of $\hat{\mathcal{P}}_{K}$ are shown in the left panels of Figure 4.2. The first three modes of variations correspond to a shift in the age distribution towards the old, the elderly, and the middle age, respectively. This suggests the existence of latent disease characteristics that affect the age profiles of different conditions.

The effects of shrinkage in the proposed MAP and BLUP estimators are demonstrated in the right panels of Figure 4.2 for the four rare conditions displayed in the right panels of Figure 4.1. Subpopulations with smaller sample sizes were typically shrunk more towards the population mean using the MAP and BLUP methods that pool population information from the training samples, in which case the estimate differed more notably from the MLE within the approximating family.

To numerically assess the density estimates, we consider cross-entropy which also targets the information loss instead of the KL divergence, since the underlying densities are unknown. Given any density estimator $\hat{p}$ , we approach the cross-entropy $H(p,\hat{p})\overset{\triangle}{=}-E_{p}\log{\hat{p}}$ , which is the quantity in $D(p\;\|\;\hat{p})$ that depends on the estimate $\hat{p}$ , by an unbiased leave-one-out estimate

H_{\text{LOO}}(p,\hat{p})\overset{\triangle}{=}-\frac{1}{N}\sum_{j\leq N}\log{\hat{p}_{-j}(X_{j})},

(4.1)

where $\hat{p}_{-j}$ is a density estimate constructed with all but the $j$ th observations. A summary of cross-entropy applied on 844 testing samples using the proposed methods and KDE are reported in Table 1, where a smaller cross-entropy indicates a better estimate. Occasionally KDE resulted in non-finite cross entropy, leading to a smaller number of samples reported.

Table 1 shows that the proposed methods produced density estimates that better reflect sample information compared to KDE by having smaller mean cross-entropy, and also smaller standard deviations. To quantify the sensitivity w.r.t. the training subpopulations, we partitioned the training samples into 5 folds, held out one fold at a time in the training step, fitted the densities for the testing samples, and reported the range of the five mean cross-entropies for a sensitivity measure. The proposed methods are shown to be insensitive to varying training subpopulations, which suggests that the selected training samples are sufficient for our methods to capture the major structure of the underlying family.

Table 1: Cross-entropy (4.1) of different estimators applied on the MIMIC data.

size	$n^{*}$	estimator	mean	sd	sensitivity
		MLE	4.60	1.62	0.062
		MAP	4.13	0.42	0.013
	423	BLUP	4.11	0.36	0.039
(0,10]	402	KDE	5.56	2.42	-
		MLE	4.03	0.32	0.015
		MAP	3.97	0.25	0.005
	343	BLUP	3.97	0.24	0.017
(10,35]	341	KDE	4.16	0.37	-
		MLE	3.93	0.22	0.013
		MAP	3.92	0.21	0.004
		BLUP	3.92	0.21	0.008
(35,75]	78	KDE	3.99	0.22	-

The results are stratified by size of testing samples, where the number of testing samples involved is listed as $n^{*}$ . The sensitivity of proposed methods are the range of the mean cross-entropy across 5 experiments, each holding out one of five folds in the training samples.

4.2 Precipitation Return Level Estimation

We then analyze the precipitation data in the Haihe region to illustrate the versatility of the proposed methods in providing good estimates for both the body part of the density function and the tail probability. We demonstrate that our proposed methods can borrow information from stations with longer records to construct approximate distribution families for annual maxima of hourly rainfall, which lead to better estimates of densities and return levels for those stations having shorter records.

Predicting the likelihood of extreme precipitation is of great importance because of the serious damage such events can cause to economy and social life. We are in particular interested in heavy rainfall during short time period. The likelihood of extreme precipitation event in a certain region can be quantified using the $T$ -year return level, which is defined as the level expected to be exceeded once in $T$ years, or equivalently the $1-1/T$ quantile of the marginal distribution under stationarity and independence assumptions. Sample quantile is a straight forward estimate for small $T$ , which is referred to as the empirical return level, but this approach requires more than $T$ data points being available. Otherwise, distribution models are necessary for estimating the return level for longer return periods.

We focus on estimating the 5- to 30-year return levels at various locations in the Haihe River basin, a vital region in northern China that includes mega-cities such as Beijing and Tianjin. Precipitation data from 232 weather stations around Haihe River basin from 1961 to 2012 are available from the National Meteorological Information Center (NMIC) and the China Meteorological Administration (CMA) and are used for this analysis. All the 232 stations investigated have at least 30 years of records.

Our target is the distribution of the annual maximum rainfall per hour, where each station defines a subpopulation. As the distribution of precipitation is heavily right-skewed, the pre-smoothing density estimates could be zero within the working domain, creating an issue when we calculate the log-density. To address this issue and the positive constraint, log-precipitation is used as the working variable. See LABEL:supp-appendix:log.inf of the Supplementary Materials for detail.

We compared return levels estimation and the information loss of the proposed methods with standard approaches. To imitate the situation of newer weather stations having shorter records, we randomly selected $n=50$ stations as training sites and retained the full records in these stations for training, and another $n^{*}=100$ testing sites where records from randomly chosen 10 consecutive years were used for fitting. Estimates of density and 5- to 30-year return levels were obtained from non-parametric approaches including our proposed methods and standard KDE, as well as parametric approaches of maximum likelihood estimates within the families of Gamma and GEV distribution. The performance metric used to evaluate density estimation was cross-entropy (4.1), which measures the overall fit. For return level estimates, we used the relative difference of the estimates to the empirical return levels that are obtained from full records in the testing stations, which measures the fit in the right tail. The entire process was repeated $10$ times, and the overall average cross-entropies and the relative differences of the return levels averaged over the stations are reported in Figure 4.3.

The proposed methods in the MLE, MAP, and BLUP variants are all significantly better than the KDE and GEV approaches under the information loss. The estimates using Gamma family of distributions were comparable to our results but underperformed the MAP and BLUP methods which borrow information from the population. For return level estimation, the proposed MAP and BLUP methods had the least variation compared to MLE, Gamma and GEV methods, where the last three methods do not pool information. For shorter (5- and 10- year) return levels, the bias of all methods were comparable, but for longer (20- and 30- year) return levels, the bias of our methods were significantly smaller than Gamma and GEV methods. In summary, our proposed methods have the best performance in estimating the return levels as well. Hence, the proposed methods provide better fits compared to classical approaches in both the bulk and the tail parts of the distributions and lead to better quantification of interesting characteristics of the distribution such as return levels from a small number of discrete observations.

These numerical results come as a positive surprise, as we only used 10 years of records at the testing sites and yet were able to obtain good estimates for 20- and even 30-years return levels. The proposed approaches are able to estimate the tail structure from the data rather than relying on assumptions, and borrow information from the population enforced through parameter shrinkage.

5 Theoretical Results

We state the theoretical results for the proposed methods including the identifiability of the approximating families, consistency of MLE within those families, and consistency of the component scores. While the major motivation and justification for the proposed methods are the small sample performance as demonstrated in the simulations and real data applications, in this section we also establish large sample properties of the proposed estimators to justify their broad validity. Subsection 5.1 and Subsection 5.2 investigate the large testing sample scenario where the testing sample size $N$ , the number of training samples $n$ , and the training sample sizes $N_{i}$ all diverge, while Subsection 5.3 considers the small testing sample case when $N$ is small and kept fixed but $n$ and $N_{i}$ diverge. We present theoretical properties of the MLE as justification for the proposed methods, since for a large testing sample the BLUP and MAP estimates emphasize the observation but not the population-level information and will be close to “degenerate” at the MLE estimate. Proofs and additional discussions are included in the Supplementary Materials.

For theoretical development, densities in the underlying family are required to be smooth and uniformly bounded away from zero and infinity. This condition is standard when analyzing densities as functional data given discrete observations; see, e.g., Petersen and Müller (2016); Han et al. (2020).

(A1)

All densities $q\in\mathcal{P}$ are continuously differentiable. Moreover, there is a constant $M>1$ such that, for all $q\in\mathcal{P}$ , $\lVert q\rVert_{\infty},\lVert 1/q\rVert_{\infty}$ and $\lVert q^{\prime}\rVert_{\infty}$ are bounded above by $M$ .

Under (A1), the approximating families $\mathcal{P}_{K}$ , $\tilde{\mathcal{P}}_{K}$ , and $\hat{\mathcal{P}}_{K}$ as defined in (2.3), (2.17), and (2.12) are identifiable, thanks to the fact that the entries of the sufficient statistic are orthogonal to each other and to the constant function, as shown in LABEL:supp-prop:identifiable in the Supplemental Materials.

5.1 Consistency of MLE as $N$ Diverges

The proposed MLE estimates within low-dimensional approximating family $\hat{\mathcal{P}}_{K}$ are shown to be consistent. The MLE is computed with an additional i.i.d. sample $X_{1},\dots,X_{N}$ of size $N$ , of which the density $p$ is an independent copy of the random density. The approximating family $\hat{\mathcal{P}}_{K}$ is obtained from $n$ discretely observed training subpopulations each with $N_{i}\geq m$ observations. Theoretical results for this practical scenario are presented here, and results for MLE obtained within $\mathcal{P}_{K}$ and $\tilde{\mathcal{P}}_{K}$ are included in the Supplementary Materials. For brevity, we use $\hat{p}=\hat{p}_{N,K}\in\hat{\mathcal{P}}_{K}$ to denote the MLE density estimate with parameter $\hat{\theta}=\hat{\theta}_{N,K}$ defined in (2.14).

The consistency results rely on the following conditions imposed on the covariance of the log-transformed random density.

(C1)

For process $f=\operatorname{\psi}p\in L^{2}(\mathcal{T})$ , the eigenvalues and eigenfunctions of the integral operator $\mathcal{G}$ associated with the covariance function $G$ satisfy, as $J\rightarrow\infty$ ,

	$\displaystyle\left(\sum_{k=1}^{J}\\|\varphi_{k}^{\prime}\\|_{\infty}\right)\left(\sum_{k>J}\lambda_{k}\right)$	$\displaystyle=O(1),$
	$\displaystyle\left(\sum_{k=1}^{J}\\|\varphi_{k}^{\prime}\\|_{\infty}\right)^{2/3}\left(\sum_{k>J}\lambda_{k}\right)$	$\displaystyle=o(1).$

(C2)

For all $k=1,\dots,\operatorname{rank}\mathcal{G}$ , the non-zero eigenvalues are distinct and satisfy $\delta_{k}\overset{\triangle}{=}\inf_{l\not=k}\lvert\lambda_{k}-\lambda_{l}\rvert/2>0$ ; the estimated eigenfunctions are properly aligned so that $\left\langle\varphi_{k},\tilde{\varphi}_{k}\right\rangle\geq 0$ and $\left\langle\varphi_{k},\hat{\varphi}_{k}\right\rangle\geq 0$ .

Condition (C1) ensures that the high frequency eigenfunctions are not overly bumpy relative to the amount of information they are carrying as quantified by the eigenvalues, so that the trajectories can be reasonably represented with the first few components. Condition (C2) is standard for simplifying presentation. We assume $\operatorname{rank}\mathcal{G}=\infty$ in discussion hereafter unless otherwise stated.

Conditions (A1), (C1), and (C2) in combine are sufficient for the consistency of MLE within $\mathcal{P}_{K}$ and $\tilde{\mathcal{P}}_{K}$ as shown in LABEL:supp-thm:kldiv.all of the Supplementary Materials. When discrete training samples and pre-smoothing are involved, in parallel to those in Petersen and Müller (2016), additional conditions (D1), (D2), and (S1) are also required as listed in the Appendix. Those conditions basically requires that the pilot density estimator performs well uniformly, provided the sample sizes are sufficiently large. Here (D1) is a uniform MISE requirement, (D2) requires the $L^{\infty}$ error to be $O_{p}(a_{m})$ , and (S1) is a requirement for the rate of sample sizes as the number of samples increases, so that one can derive uniform rates of convergence for the pre-smoothed training densities and log-densities under the $L^{\infty}$ norm. The conditions (D1), (D2), and (S1) can be satisfied by the weighted kernel density estimate (KDE) in (2.11) with proper bandwidths and kernels, e.g., $h\asymp m^{-1/3}$ and the Gaussian kernel. See LABEL:supp-sec:presmooth of the Supplementary Materials for detail.

5.1.1 Main Consistent Results

We have the following result for the proposed MLE within $\hat{\mathcal{P}}_{K}$ . Recall that $n$ is the number of training subpopulations, $m$ is the minimal training sample size in (S1), $N$ is the size of the fitting sample, and sequences $\left\{a_{m}\right\}$ , $\left\{b_{m}\right\}$ are those in (D1) and (D2).

Theorem 5.1.

Under (A1), (C1), (C2), (D1), (D2), and (S1), if $n$ , $N$ , $K$ increase to infinity in such a way that

	$\displaystyle A_{K}\sqrt{K/N}\rightarrow 0,\quad KA_{K}(1/\sqrt{n}+b_{m})\rightarrow 0,\quad\sum_{k\leq K}\lambda_{k}^{-1}(1/\sqrt{n}+a_{m})=O(1),$
	$\displaystyle A_{K}\sum_{k\leq K}\delta_{k}^{-1}(1/\sqrt{n}+b_{m})\rightarrow 0,\qquad\sum_{k\leq K}(\delta_{k}\lambda_{k})^{-1}(1/\sqrt{n}+b_{m})=O(1),$

where

A_{K}=2\max\left(\left\lvert\mathcal{T}\right\rvert^{-1/2},(\sum_{k=1}^{K}\lVert\varphi_{k}^{\prime}\rVert_{\infty})^{1/3}\right),

(5.1)

then

\displaystyle D(p\;\|\;\hat{p}_{N,K})

\displaystyle=O_{p}\left(\left(\frac{1}{n}+b_{m}^{2}\right)\left(K+\sum_{k\leq K}\delta_{k}^{-1}\right)^{2}+\sum_{k>K}\lambda_{k}+\frac{K}{N}\right).

Proof.

See LABEL:supp-thm:kldiv.all in the Supplementary Materials. ∎

Note that the result is highly flexible and suits any general infinite-dimensional families. In particular, we do not assume the underlying $\mathcal{P}$ to be a finite-dimensional exponential family. The three summands in the rate correspond to the training error, a bias term due to the approximating family, and a variance term from fitting with $N$ observations. Varying the number of components $K$ trades off errors from these three terms, as a more complex model with a larger $K$ will have larger training and fitting errors but smaller approximation bias. Quantity $A_{K}$ is closely related to (C1) and the equivalence of $L^{\infty}$ and $L^{2}$ norm in the eigenspaces of covariance operator; see LABEL:supp-prop:eigenspace.lipschitz in the Supplementary Materials.

An immediate corollary is the convergence of the densities in the $L^{1}$ norm by Pinsker’s inequality (Wainwright 2019), which states that the KL divergence dominates squares of $L^{1}$ norm of differences.

Corollary 5.1.

Under the conditions of Theorem 5.1,

\left\lVert\hat{p}_{N,K}-p\right\rVert_{1}^{2}=O_{p}\left(\left(\frac{1}{n}+b_{m}^{2}\right)\left(K+\sum_{k\leq K}\delta_{k}^{-1}\right)^{2}+\sum_{k>K}\lambda_{k}+\frac{K}{N}\right).

Our general result directly implies the next corollary where the underlying family is actually a $K_{0}$ -dimensional exponential family for some finite $K_{0}$ with no approximation bias resulting from truncating the log-densities. Note that (C1) is not required, and by convention we set $\lambda_{k}$ , $\varphi_{k}$ to be zero for $k>K_{0}$ .

Corollary 5.2.

Under (A1), (C2), (D1), (D2), and (S1), if $\mathcal{P}$ is a $K_{0}$ -dimensional exponential family for some finite $K_{0}$ , then for any fixed $K\geq K_{0}$ ,

\displaystyle D(p\;\|\;\hat{p}_{N,K})

\displaystyle=O_{p}\left(\frac{1}{n}+b_{m}^{2}+\frac{1}{N}\right),

as $n,N\rightarrow\infty$ and $m=m(n)\rightarrow\infty$ .

Proof.

See LABEL:supp-coro:kldiv.finiteK in the Supplementary Materials. ∎

5.1.2 Selecting $K$

We provide an example when the requirements in Theorem 5.1 are satisfied in the following proposition.

Proposition 5.1.

Under (C2), if

(1)

the eigenfunctions coincide with the polynomial or trigonometric basis so that $\lVert\varphi_{k}^{\prime}\rVert_{\infty}\asymp k$ while $k$ large;
(2)

eigenvalues have polynomial decay: $\lambda_{k}\asymp k^{-r}$ with $r\geq 3$ as $k\rightarrow\infty$ ;
(3)

minimum sizes of the $n$ training sample satisfies $m(n)=n^{r_{m}}$ for some $r_{m}>0$ ;
(4)

weighted KDE (2.11) is used as pre-smoother with $h\asymp m^{-1/3}=n^{-r_{m}/3}$ as $n\rightarrow\infty$ ; and
(5)

for any fixed $0<r_{a}<1/6$ , the number of components $K$ used for fitting in dependence on $r,n,N,r_{m}$ and $r_{a}$ satisfies

$K=\hat{K}^{*}\asymp\min\left(n^{\frac{1}{4(r+1)}},n^{\frac{r_{a}r_{m}}{r+1}},N^{1/r}\right)\text{ as }n,N\rightarrow\infty,$

then the conditions for Theorem 5.1 are satisfied with $a_{m}=n^{-r_{a}r_{m}}$ , $b_{m}=n^{-r_{m}/3}$ . As $n,N\rightarrow\infty$ ,

D(p\;\|\;\hat{p}_{N,\hat{K}^{*}})=O_{p}\left(n^{\gamma_{1}}+N^{-1+1/r}\right),

where $\gamma_{1}=r_{a}r_{m}(1-r)/(r+1)$ if $r_{m}\leq 1/(4r_{a})$ , and $\gamma_{1}=-1/4+1/(2r+2)$ if $r_{m}>1/(4r_{a})$ .

Proof.

See LABEL:supp-prop:optimal.K.all in the Supplementary Materials. ∎

Such $\hat{K}^{*}$ is derived by finding the optimal and feasible $K$ for the dominating terms in the convergence rate. Compared to the result for optimal $K$ without presmoothing as in LABEL:supp-prop:optimal.K.all of the Supplementary Materials, it is suggested that in this example, pre-smoothing does not slow down the convergence rate for the proposed MLE if the training subpopulations are densely observed so that $r_{m}$ is large, e.g. $r_{m}>3/2$ . Since if $r_{m}>1/(4r_{a})$ , then $D(p\;\|\;\tilde{p}_{N,\tilde{K}^{*}})\asymp D(p\;\|\;\hat{p}_{N,\hat{K}^{*}})$ with an optimal $\tilde{K}^{*}$ for MLE within $\tilde{\mathcal{P}}_{K}$ .

This result is particularly interesting since it suggests that, for better convergence rate, when pre-smoothing with KDE, slight over-smoothing may be preferred. Indeed, noting $\gamma_{1}<0$ will be smaller if $r_{a}r_{m}>1/4$ , it is desirable to have larger $r_{a}$ so that smaller $r_{m}$ can satisfy $r_{a}r_{m}>1/4$ , which means smaller training samples can be considered “dense”. As required in LABEL:supp-prop:preS.ass.good, the only restrictions for $r_{a}$ are $0<r_{h}+r_{a}<1/2$ and $0<r_{a}<r_{h}$ , and hence if we choose a smaller $r_{h}>1/4$ , a larger $r_{a}$ would be possible, making $r_{a}r_{m}>1/4$ easier to satisfy.

5.2 Consistency of Component Scores

Another interesting question is the convergence of the component scores of pre-smoothed training trajectories $\check{f}_{i}=\operatorname{\psi}\check{p}_{i}$ , $i=1,\ldots,n$ , which are used to capture distribution of random densities in MAP (2.15) and BLUP (2.16). Notation-wise, let $\check{f}_{i,K}=\hat{\mu}+\sum_{k\leq K}\check{\eta}_{ik}\hat{\varphi}_{k}$ be the truncated pre-smoothed log-density with $\check{\eta}_{ik}=\left\langle\check{f}_{i}-\hat{\mu},\hat{\varphi}_{k}\right\rangle$ for $k=1,\ldots,K$ , $i=1,\dots,n$ . The next proposition shows that the estimated component scores $\check{\eta}_{ik}$ converge uniformly to the truth $\eta_{ik}=\left\langle f_{i}-\mu,\varphi_{k}\right\rangle$ .

Proposition 5.2.

Under (A1), (D1), (D2), (S1), and (C2), for any $K<\infty$ , as $n\rightarrow\infty$ , we have

\sup_{i\leq n}\sup_{k\leq K}\lvert\check{\eta}_{ik}-\eta_{ik}\rvert=O_{p}\left(a_{m}+(n^{-1/2}+b_{m})\sup_{k\leq K}\delta_{k}^{-1}\right).

Proof.

See LABEL:supp-prop:comp.score.all of the Supplementary Materials. ∎

With the previous proposition, it is easy to see $\sup_{i\leq n}\sup_{k\leq K}\lvert\check{\eta}_{ik}-\eta_{ik}\rvert=O_{p}(a_{m}+n^{-1/2})$ as $n\rightarrow\infty$ if $K$ is fixed. To have a similar result with $K\rightarrow\infty$ as well, we need $n,K\rightarrow\infty$ in such a way that $(n^{-1/2}+b_{m})\sup_{k\leq K}\delta_{k}^{-1}\rightarrow 0$ . Under the conditions of Proposition 5.1, we can set $K\asymp n^{\beta}$ for any $\beta<\max(1/2,r_{m}/3)/(1+r)$ , then the sample component scores are consistent uniformly. In particular, $K=\hat{K}^{*}$ satisfies the requirement.

5.3 Consistency of MLE in a Small and Fixed Sample

The following result further characterizes the asymptotic behavior of the proposed MLE for a given new fitting sample with a small and fixed sample size $N$ , where $K$ is finite and fixed and there is a large amount of observations from training subpopulations.

Proposition 5.3.

Under (A1), (D1), (D2), (S1), and a fixed $K$ , for a given sample $X_{1},\dots,X_{N}$ drawn from density $p\in\mathcal{P}$ , assume the MLE $\mathring{p}_{N,K}\in\mathcal{P}_{K}$ exists and is unique with corresponding parameters $\mathring{\theta}\in\mathds{R}^{K}$ . Then the MLE $\hat{p}_{N,K}\in\hat{\mathcal{P}}_{K}$ exists and is unique with probability tending to 1 as $n\rightarrow\infty$ , where the $n$ is the number of training subpopulations used to construct $\hat{\mathcal{P}}_{K}$ . The corresponding parameter $\hat{\theta}_{n}\in\mathds{R}^{K}$ converges to $\mathring{\theta}$ in probability with

	$\displaystyle\left\lVert\hat{\theta}_{n}-\mathring{\theta}\right\rVert_{2}$	$\displaystyle=O_{p}(1/\sqrt{n}+a_{m}+b_{m}),$
	$\displaystyle D(\mathring{p}_{N,K}\;\\|\;\hat{p}_{N,K})$	$\displaystyle=O_{p}\left(\frac{1}{\sqrt{n}}+a_{m}+b_{m}\right).$

Proof.

See LABEL:supp-prop:rate.training.thetahat and LABEL:supp-coro:rate.training.kldiv of in the Supplemental Materials. ∎

Appendix A Assumptions for Pre-smoothing

When discrete training samples and pre-smoothing are involved, the following conditions in parallel to (D1), (D2), and (S1) in Petersen and Müller (2016) are also required. Let $q\in\mathcal{P}$ be a fix density; the conditions here are imposed on family $\mathcal{P}$ but not on the distribution of the random density $p$ . Further, let $\check{q}$ to be the corresponding pre-smoothed density, e.g. according to the weighted KDE defined in (2.11).

(D1)

There exists a positive sequence $b_{N}=o(1)$ as $N\rightarrow\infty$ , such that based on an i.i.d. sample of size $N$ from density $q\in\mathcal{P}$ , the density estimator $\check{q}$ satisfies $\check{q}\geq 0$ , $\int_{\mathcal{T}}\check{q}=1$ and

$\sup_{q\in\mathcal{P}}E(\left\lVert\check{q}-q\right\rVert_{2}^{2})=O(b_{N}^{2}).$
(D2)

There exists a positive sequence $a_{N}=o(1)$ as $N\rightarrow\infty$ and some $R>0$ such that based on an i.i.d. sample of size $N$ from a population for which the density is $q\in\mathcal{P}$ , density estimator $\check{q}$ satisfies $\check{q}\geq 0$ , $\int_{\mathcal{T}}\check{q}=1$ and

$\sup_{q\in\mathcal{P}}P(\lVert\check{q}-q\rVert_{\infty}>Ra_{N})\rightarrow 0\text{ as }N\rightarrow\infty.$

(S1)

Let $\check{q}$ be a density estimator satisfying (D2) and $N_{i}=N_{i}(n),i=1,\ldots,n$ be the sample sizes in the training subpopulations. There exists a sequence of lower bounds $m(n)\leq\min_{1\leq i\leq n}N_{i}$ such that $m(n)\to\infty$ as $n\rightarrow\infty$ and

$n\sup_{q\in\mathcal{P}}P(\lVert\check{q}-q\rVert_{\infty}>Ra_{m})\to 0,$

where $\check{q}$ is obtained with sample size $N(n)\geq m(n)$ .

SUPPLEMENTARY MATERIAL

The Supplemental Materials include additional simulation results and details of the theoretical works.

References

Amari (2016) Amari, S.-i. (2016). Information Geometry and Its Applications, Volume 194 of Applied mathematical sciences. Springer, [Tokyo].
Amari and Nagaoka (2000) Amari, S.-i. and H. Nagaoka (2000). Methods of Information Geometry, Volume 191 of Translations of mathematical monographs. American Mathematical Society, Providence, RI; Oxford University Press, Oxford.
Bonevski et al. (2014) Bonevski, B., M. Randell, C. Paul, K. Chapman, L. Twyman, J. Bryant, I. Brozek, and C. Hughes (2014). Reaching the hard-to-reach: a systematic review of strategies for improving health and medical research with socially disadvantaged groups. BMC medical research methodology 14(1), 42.
Chu et al. (2018) Chu, B., K. Huynh, D. Jacho-Chávez, and O. Kryvtsov (2018). On the evolution of the United Kingdom price distributions. Ann. Appl. Stat. 12(4), 2618–2646.
Dai (2020) Dai, X. (2020). Statistical inference on the Hilbert sphere with application to random densities. ArXiv.
Delicado (2011) Delicado, P. (2011). Dimensionality reduction when data are density functions. Comput. Statist. Data Anal. 55(1), 401–420.
Egozcue et al. (2006) Egozcue, J. J., J. L. Díaz-Barrero, and V. Pawlowsky-Glahn (2006). Hilbert space of probability density functions based on Aitchison geometry. Acta Math. Sin. (Engl. Ser.) 22(4), 1175–1182.
Fithian and Wager (2015) Fithian, W. and S. Wager (2015). Semiparametric exponential families for heavy-tailed data. Biometrika 102(2), 486–493.
Gervini (2016) Gervini, D. (2016). Independent component models for replicated point processes. Spat. Stat. 18(part B), 474–488.
Gervini and Khanal (2019) Gervini, D. and M. Khanal (2019). Exploring patterns of demand in bike sharing systems via replicated point process models. J. R. Stat. Soc. Ser. C. Appl. Stat. 68(3), 585–602.
Han et al. (2020) Han, K., H.-G. Müller, and B. U. Park (2020). Additive functional regression for densities as responses. J. Amer. Statist. Assoc. 115(530), 997–1010.
Hron et al. (2016) Hron, K., A. Menafoglio, M. Templ, K. Hrůzová, and P. Filzmoser (2016, February). Simplicial principal component analysis for density functions in Bayes spaces. Computational Statistics & Data Analysis 94, 330–350.
Huynh et al. (2011) Huynh, K. P., D. T. Jacho-Chávez, R. J. Petrunia, and M. Voia (2011). Functional principal component analysis of density families with categorical and continuous data on Canadian entrant manufacturing firms. J. Amer. Statist. Assoc. 106(495), 858–878.
Johnson et al. (2016) Johnson, A. E. W., T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016, May). MIMIC-III, a freely accessible critical care database. Scientific Data 3(1), 1–9.
Kim et al. (2020) Kim, D., Y. K. Lee, and B. U. Park (2020, January). Principal component analysis for Hilbertian functional data. Communications for Statistical Applications and Methods 27(1), 149–161.
Kneip and Utikal (2001) Kneip, A. and K. J. Utikal (2001). Inference for density families using functional principal component analysis. J. Amer. Statist. Assoc. 96(454), 519–542.
Kooperberg and Stone (1991) Kooperberg, C. and C. J. Stone (1991). A study of logspline density estimation. Comput. Statist. Data Anal. 12(3), 327–347.
Lee and Scott (2012) Lee, G. and C. Scott (2012). EM algorithms for multivariate Gaussian mixture models with truncated and censored data. Comput. Statist. Data Anal. 56(9), 2816–2829.
Loader (1996) Loader, C. R. (1996). Local likelihood density estimation. Ann. Statist. 24(4), 1602–1618.
Panaretos and Zemel (2016) Panaretos, V. M. and Y. Zemel (2016). Amplitude and phase variation of point processes. Ann. Statist. 44(2), 771–812.
Petersen and Müller (2016) Petersen, A. and H.-G. Müller (2016). Functional data analysis for density functions by transformation to a Hilbert space. Ann. Statist. 44(1), 183–218.
Sheather and Jones (1991) Sheather, S. J. and M. C. Jones (1991). A reliable data-based bandwidth selection method for kernel density estimation. J. Roy. Statist. Soc. Ser. B 53(3), 683–690.
Srivastava et al. (2007) Srivastava, A., I. Jermyn, and S. Joshi (2007). Riemannian analysis of probability density functions with applications in vision. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8.
Tsay (2016) Tsay, R. S. (2016). Some methods for analyzing big dependent data. J. Bus. Econom. Statist. 34(4), 673–688.
Wainwright (2019) Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint (First ed.). Cambridge University Press.
Wakefield and Elliott (1999) Wakefield, J. and P. Elliott (1999). Issues in the statistical analysis of small area health data. Statistics in Medicine 18(17-18), 2377–2399.
Wu et al. (2013) Wu, S., H.-G. Müller, and Z. Zhang (2013). Functional data analysis for point processes with rare events. Statist. Sinica 23(1), 1–23.
Yao et al. (2005) Yao, F., H.-G. Müller, and J.-L. Wang (2005). Functional data analysis for sparse longitudinal data. J. Amer. Statist. Assoc. 100(470), 577–590.

Nonparametric Estimation of Repeated Densities with Heterogeneous Sample Sizes

Abstract

1 Introduction

2 Models and Methods

2.1 Data Setup and Goals

2.2 Proposed Repeated Density Estimation

2.2.1 Constructing Approximating Family

2.2.2 Fitting Densities within Approximating Families

2.2.3 Incorporating Population-level Information

2.2.4 Pre-smoothing and Unknown Distribution of pp

3 Simulation Studies

3.1 Flexible Exponential Families

3.2 Gaussian Mixture Family

4 Real Data Applications

4.1 MIMIC Data

4.2 Precipitation Return Level Estimation

5 Theoretical Results

5.1 Consistency of MLE as NN Diverges

5.1.1 Main Consistent Results

Theorem 5.1.

Proof.

Corollary 5.1.

Corollary 5.2.

Proof.

5.1.2 Selecting KK

Proposition 5.1.

Proof.

5.2 Consistency of Component Scores

Proposition 5.2.

Proof.

5.3 Consistency of MLE in a Small and Fixed Sample

Proposition 5.3.

Proof.

Appendix A Assumptions for Pre-smoothing

References

2.2.4 Pre-smoothing and Unknown Distribution of $p$

5.1 Consistency of MLE as $N$ Diverges

5.1.2 Selecting $K$