This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\externaldocument

[supp-]supp

Nonparametric Estimation of Repeated Densities with Heterogeneous Sample Sizes

Jiaming Qiu    Xiongtao Dai    Zhengyuan Zhu
Department of Statistics, Iowa State University
Abstract

We consider the estimation of densities in multiple subpopulations, where the available sample size in each subpopulation greatly varies. This problem occurs in epidemiology, for example, where different diseases may share similar pathogenic mechanism but differ in their prevalence. Without specifying a parametric form, our proposed method pools information from the population and estimate the density in each subpopulation in a data-driven fashion. Drawing from functional data analysis, low-dimensional approximating density families in the form of exponential families are constructed from the principal modes of variation in the log-densities. Subpopulation densities are subsequently fitted in the approximating families based on likelihood principles and shrinkage. The approximating families increase in their flexibility as the number of components increases and can approximate arbitrary infinite-dimensional densities. We also derive convergence results of the density estimates with discrete observations. The proposed methods are shown to be interpretable and efficient in simulation as well as applications to electronic medical record and rainfall data.

Keywords: Sparse functional data analysis; Functional principal component analysis; Shrinkage estimation; Dimension reduction; Object data analysis

1 Introduction

Density estimation is one of the most fundamental tasks in statistics and machine learning. We consider the problem of repeated density estimation, namely, to estimate the distributions of a variable in multiple subpopulations with similar nature. For example, the distributions of interest can be the age distributions of patients with different diseases, or the precipitation distributions in multiple regions. Repeated density functions can be modeled as the realizations of a random density element that is subject to the positivity and unit integral constraints. Moreover, density functions are in practice unobserved and are accessed through discrete samples, rendering additional difficulties for data analysis. The primary objective of this work is to address these difficulties and derive reliable yet flexible density estimates for the subpopulations given discrete observations, in a fully data-driven and efficient manner.

Most of the existing work consider random densities that are either completely observed or are reconstructed from densely sampled observations. This type of density objects was first analyzed by Kneip and Utikal (2001) by directly applying functional principal component analysis (FPCA), a technique that has later been applied to analyze time-varying densities (Huynh et al. 2011; Tsay 2016; Chu et al. 2018). Recognizing the constraints in density functions, Delicado (2011) and Petersen and Müller (2016) proposed to analyze densities by utilizing one-to-one transformations to map them to an unconstrained space, where the transformed densities are then represented using FPCA and subsequently back transformed into densities. Distinct geometries are generated on the space of densities by different transformations, such as the log-hazard and log-quantile density transformations (Petersen and Müller 2016; Kim et al. 2020), log transformations for analyzing densities as compositional data (Egozcue et al. 2006; Hron et al. 2016); and the square root transformation with the Riemannian logarithm map (Srivastava et al. 2007; Dai 2020) which generates a Hilbert sphere geometry.

We instead consider fitting sparsely observed random densities with as few as a handful of realizations available. Subpopulations with small sample sizes arise in a number of possible situations when the sampling mechanism dictates heterogeneous sample sizes. A first situation is when the availability and cost of observation highly varies among subpopulations (Wakefield and Elliott 1999), with some groups easy but the rest hard to reach (Bonevski et al. 2014). This scenario is illustrated by the Medical Information Mart for Intensive Care (MIMIC) data (Johnson et al. 2016) in our first data application, where patient records are ample for common diseases such as diabetes and coronary heart disease but scarce for uncommon conditions. A second generating mechanism is that the data collection process had been established for different duration in different subpopulations (Fithian and Wager 2015), which is illustrated by our second data application considering rainfall recorded by weather stations around the Haihe River basin in northern China that differ in the number of years when data are available.

The literature on sparsely observed random densities is scant. A closely related topic is replicated point processes, of which the intensity functions can be normalized into densities. Panaretos and Zemel (2016) considered time-warping densely sampled Poisson point processes under a transportation geometry. Given sparsely sampled replicated point processes, Wu et al. (2013) proposed to pool information over multiple realizations and directly model the intensity functions using FPCA. Gervini (2016) and Gervini and Khanal (2019) considered semi-parametric models to decompose the intensity and the log-intensity functions, respectively.

Our proposal is a nonparametric approach that directly targets the repeated densities, which are modeled as realizations of a density-valued random element, and the discrete observations from each subpopulation are modeled as i.i.d. realizations from the corresponding density. Information is pooled over all subpopulations in a two-step fashion. First, we identify low-dimensional structures to parsimoniously approximate the densities using data from subpopulations densely sampled by applying FPCA to the centralized log-transformed densities that lie in an unconstrained L2L^{2} space. When transformed back to densities, the low-dimensional structures make up exponential families, where the sufficient statistics are the eigenfunctions of the transformed densities, and each density element is compactly represented by its natural parameter or, equivalently, moment coordinate (Amari 2016). Second, a new and potentially sparsely observed subpopulation is fitted within the approximating family, obtaining maximum likelihood estimate (MLE), and the coordinate estimate is then shrunk towards the population using distributional information, further enforcing information borrowing. Two shrinkage estimates, namely maximum a posteriori (MAP) and best linear unbiased prediction (BLUP), are proposed based on the natural parameter and moment coordinate, respectively.

To illustrate the idea, suppose that random densities are generated from the normal distribution family 𝒫={pμ,σ:μ,σ>0}\mathcal{P}=\left\{p_{\mu,\sigma}:\mu\in\mathds{R},\sigma>0\right\} with densities pμ,σ(x)exp(x(μ/σ2)x2/(2σ2))p_{\mu,\sigma}(x)\propto\exp\left(x(\mu/\sigma^{2})-x^{2}/(2\sigma^{2})\right) truncated to x[1,1]x\in[-1,1]. Observe that a random element pp taking values in 𝒫\mathcal{P} satisfies logpspan{1,x,x2}\log p\subset\operatorname{span}\left\{1,x,x^{2}\right\} and thus lies in a low-dimensional space. The low-dimensional structure can be recovered by applying dimension reduction techniques such as FPCA on logp\log p. When inferring the distribution of a subpopulation, information borrowing is performed within the low-dimensional space and is analogous to that for the classical linear mixed effects model. Remarkably, the proposed methods estimate the low-dimensional structure and pool information fully nonparametrically without imposing parametric shape assumptions. A worked simulation example is included in LABEL:supp-example:rand.intercept of the Supplementary Materials.

The proposed methods combine the flexibility from non-parametric density estimate and the statistical efficiency and interpretability offered by parametric inference and information borrowing. Ultimately, fitting a density with sparse observations follows likelihood principles in an estimated exponential family, so the required amount of observations by our proposed methods could be substantially fewer than what would otherwise be required for non-parametric estimates. Moreover, the exponential family approximation has advantages in computation and interpretability. The approximating family increases in flexibility as its dimension increases and can approximate an arbitrarily complex density. Rates of convergence of the proposed MLE under the Kullback–Leibler (KL) divergence and L1L^{1} norm are established under the asymptotics that the number of training subpopulations, the sample size of each subpopulation, and the sample size of the new subpopulation diverge to infinity. This result holds for general infinite-dimensional random densities under smoothness and boundedness conditions. If the random density actually lies in a finite-dimensional exponential family, then the proposed method converges under KL divergence to the truth with a near-parametric rate. If the sample size in the new subpopulation is fixed, the proposed MLE converges to the MLE in the best approximating exponential family.

The proposed models and estimation are introduced in Section 2. Numerical properties of the proposed methods are investigated and compared to alternative parametric and nonparametric density estimates in Section 3. Data illustrations with age-at-admission of ICU patients and rainfall data in the Haihe river basin are included in Section 4. Theoretical results are stated in Section 5. Additional simulation, discussion, and detailed proofs are included in the Supplemental Materials.

2 Models and Methods

2.1 Data Setup and Goals

Let {Xij:j=1,,Ni,i=1,,n}\left\{X_{ij}:j=1,\dots,N_{i},i=1,\dots,n\right\} be the observations of a continuous variable in subpopulation i=1,,ni=1,\dots,n; e.g. the age-at-admission to the Intensive Care Unit (ICU) as described in Subsection 4.1 grouped by primary diagnosis. Here Xi1,,XiNiX_{i1},\dots,X_{iN_{i}} are i.i.d. observations from the iith subpopulation with density pip_{i}. Densities p1,,pnp_{1},\dots,p_{n} are modeled as i.i.d. realizations of a random density pp taking values in family 𝒫\mathcal{P}, where 𝒫\mathcal{P} consists of densities supported on a common compact interval 𝒯\mathcal{T}\subset\mathds{R}. For example, there may exist latent characteristics that determine the susceptibility age profiles of different diseases, so the age distribution of a disease can be regarded the realization of a density-valued random element. To emphasis the grouping structure, we refer to pip_{i} as the density of the iith subpopulation, while the distribution of the random density pp as the population.

Our goal is to estimate the density p0p_{0} in a subpopulation with a few i.i.d. observations X0jX_{0j}, j=1,,Nj=1,\ldots,N, where the number of observations NN can potentially be small. For example, this problem occurs when one want to estimate the age distribution for an uncommon disease. We will first obtain low-dimensional structures of 𝒫\mathcal{P} from subpopulations where ample observations are available, e.g. from records for more common conditions, and apply the structure to estimate the density in the subpopulation with a small sample size.

2.2 Proposed Repeated Density Estimation

2.2.1 Constructing Approximating Family

We propose to approximate distribution family 𝒫\mathcal{P} with low rank exponential families by utilizing a functional data analytic approach. The scenario where the distribution of the random density pp is given is discussed here, and the scenario where the distribution is not directly available is described in Subsubsection 2.2.4.

The collection of densities form an infinite-dimensional constraint manifold instead of a Hilbert space. To borrow from the rich literature studying function data as objects in a Hilbert space, we follow a transformation approach (Petersen and Müller 2016) and analyze the transformed densities as Hilbertian elements. The centralized log-transformation (Hron et al. 2016) ψ:𝒫L2(𝒯)\operatorname{\psi}:\mathcal{P}\rightarrow L^{2}(\mathcal{T}) maps a positive density q𝒫q\in\mathcal{P} into the L2L^{2} space, obtaining

(ψq)(t)=logq(t)1|𝒯|𝒯logq(x)𝑑x,t𝒯.(\operatorname{\psi}q)(t)=\log q(t)-\frac{1}{\left\lvert\mathcal{T}\right\rvert}\int_{\mathcal{T}}\log q(x)dx,\,t\in\mathcal{T}. (2.1)

To derive low rank approximations to the densities, we apply the Karhunen–Loève expansion to the transformed trajectories f=ψpf=\operatorname{\psi}p in L2(𝒯)L^{2}(\mathcal{T}), obtaining

f(t)=(ψp)(t)=μ(t)+k=1ηkφk(t),t𝒯,f(t)=(\operatorname{\psi}p)(t)=\mu(t)+\sum_{k=1}^{\infty}\eta_{k}\varphi_{k}(t),t\in\mathcal{T}, (2.2)

where μ(t)=Ef(t)\mu(t)=Ef(t) is the mean function, ηk=𝒯(f(t)μ(t))φk(t)𝑑t\eta_{k}=\int_{\mathcal{T}}(f(t)-\mu(t))\varphi_{k}(t)dt is the kkth component score, and (λk,φk)(\lambda_{k},\varphi_{k}) is the kkth eigenvalue–eigenfunction pair of the covariance function G(s,t)=cov(f(s)μ(s),f(t)μ(t))G(s,t)=\operatorname{cov}(f(s)-\mu(s),f(t)-\mu(t)) satisfying λkφk(t)=𝒯G(s,t)φk(s)𝑑s\lambda_{k}\varphi_{k}(t)=\int_{\mathcal{T}}G(s,t)\varphi_{k}(s)ds on t𝒯t\in\mathcal{T}, for k=1,2,k=1,2,\dots. The eigenfunctions are orthonormal, and the eigenvalues are non-negative and non-increasing, satisfying λ1λ20\lambda_{1}\geq\lambda_{2}\geq\dots\geq 0.

Low rank approximating density families to 𝒫\mathcal{P} are then obtained by the back-transformation ψ1\operatorname{\psi}^{-1} and take the form of exponential families. For K=1,2,K=1,2,\dots, the KK-dimensional approximation 𝒫K\mathcal{P}_{K} to 𝒫\mathcal{P} is the exponential family

𝒫K={pθ:θK,BK(θ)<} with\displaystyle\mathcal{P}_{K}=\left\{p_{\theta}:\theta\in\mathds{R}^{K},B_{K}(\theta)<\infty\right\}\quad\text{ with} (2.3)
pθ(t)=exp(μ(t)+k=1Kθkφk(t)BK(θ)),t𝒯,\displaystyle p_{\theta}(t)=\exp\left(\mu(t)+\sum_{k=1}^{K}\theta_{k}\varphi_{k}(t)-B_{K}(\theta)\right),\;t\in\mathcal{T}, (2.4)

where BK(θ)=log(𝒯exp(μ(t)+k=1Kθkφk(t))𝑑t)B_{K}(\theta)=\log\left(\int_{\mathcal{T}}\exp\left(\mu(t)+\sum_{k=1}^{K}\theta_{k}\varphi_{k}(t)\right)dt\right) is the normalizing constant. Family 𝒫K\mathcal{P}_{K} is identifiable as will be shown in LABEL:supp-prop:identifiable of supplement. The random density pp is thus approached by its truncated version

pK(t)exp(μ(t)+kKηkφk(t)),t𝒯p_{K}(t)\propto\exp(\mu(t)+\sum_{k\leq K}\eta_{k}\varphi_{k}(t)),\,t\in\mathcal{T} (2.5)

lying in 𝒫K\mathcal{P}_{K}.

The model components in (2.4) enjoy interpretation within an exponential family: θk\theta_{k} is the natural parameter which compactly describes the density, the eigenfunctions φk\varphi_{k} is the sufficient statistic, and mean μ\mu is the baseline measure. Analogous to the case for FPCA, the leading eigenfunctions φk\varphi_{k}, k=1,,Kk=1,\dots,K encode the principal modes of variation in the log-densities, and the ηk\eta_{k} are the scores explaining the most variation; 𝒫K\mathcal{P}_{K} provides the most parsimonious KK-dimensional description of the random (log-)densities. Hence, performing density estimation within 𝒫K\mathcal{P}_{K} will effectively borrow information from the typical shapes of the densities displayed in the leading eigenfunctions or sufficient statistic.

2.2.2 Fitting Densities within Approximating Families

Given a small sample X0jX_{0j}, j=1,,Nj=1,\ldots,N from a new subpopulation, an estimate for the underlying density p0p_{0} is then obtained as a best fit within the approximating family 𝒫K\mathcal{P}_{K}. To assess the goodness of fit, various approaches are available from information theory (e.g. Amari and Nagaoka 2000; Amari 2016), and we adopt the information loss as quantified by the Kullback–Leibler (KL) divergence, which is defined for two positive densities q1q_{1}, q2q_{2} on 𝒯\mathcal{T} as

D(q1q2)=𝒯q1(t)log(q1(t)q2(t))𝑑t.D(q_{1}\;\|\;q_{2})=\int_{\mathcal{T}}q_{1}(t)\log\left(\frac{q_{1}(t)}{q_{2}(t)}\right)dt. (2.6)

KL divergence D(q1q2)D(q_{1}\;\|\;q_{2}) from q1q_{1} to q2q_{2} quantifies the information loss when one approximates q1q_{1} by q2q_{2}. To estimate p0p_{0}, we propose to use the best approximating element in 𝒫K\mathcal{P}_{K} minimizing the KL divergence from the empirical distribution. This leads to the maximum likelihood estimate (MLE) defined by pθ̊𝒫Kp_{\mathring{\theta}}\in\mathcal{P}_{K}, where

θ̊\displaystyle\mathring{\theta} =argmaxθ:BK(θ)<(θTφ¯0BK(θ)),\displaystyle=\operatorname*{arg\,max}_{\theta:B_{K}(\theta)<\infty}\left(\theta^{T}\bar{\varphi}_{0}-B_{K}(\theta)\right), (2.7)

and φ¯0K\bar{\varphi}_{0}\in\mathds{R}^{K} is the sufficient statistic constructed from the sample X0jX_{0j}, in which the kkth element is N1jNφk(X0j)N^{-1}\sum_{j\leq N}\varphi_{k}(X_{0j}), k=1,,Kk=1,\ldots,K.

2.2.3 Incorporating Population-level Information

In addition to the typical shapes of random densities, we borrow information from the distribution of these random densities in the population to reduce variation when fitting a new density. This will result in a significant improvement especially if the sample size in the new subpopulation is small.

The distribution of pp is captured by those of the component scores ηk\eta_{k}, k=1,,Kk=1,\dots,K. Incorporating such distributional information into the likelihood function, we propose maximum a posteriori (MAP) estimate p̊MAP=pθ̊MAP𝒫K\mathring{p}_{\text{MAP}}=p_{\mathring{\theta}_{\text{MAP}}}\in\mathcal{P}_{K}, where the parameter estimate θ̊MAP\mathring{\theta}_{\text{MAP}} is defined as

θ̊MAP\displaystyle\mathring{\theta}_{\text{MAP}} =argmaxθ:BK(θ)<(θTφ¯0BK(θ)+logπ(θ)),\displaystyle=\operatorname*{arg\,max}_{\theta:B_{K}(\theta)<\infty}\left(\theta^{T}\bar{\varphi}_{0}-B_{K}(\theta)+\log\pi(\theta)\right), (2.8)

and π\pi is the unconditional density of the ηk\eta_{k}. Analogous to the PACE procedure (Yao et al. 2005), π\pi is constructed using independent marginal distributions being normal distributions with mean Eηk=0E\eta_{k}=0 and variance Varηk=λk\operatorname{Var}\eta_{k}=\lambda_{k}, k=1,,Kk=1,\dots,K. Here θ̊MAP\mathring{\theta}_{\text{MAP}} is a shrinkage of the MLE θ̊\mathring{\theta} towards the population π\pi.

To accommodate non-normal unconditional distributions for the ηk\eta_{k}, we propose a second approach to incorporate the population-level knowledge by working with the moment parameterization. The moment parameter, also referred to as the moment coordinate, for an element pθp_{\theta} in exponential family 𝒫K\mathcal{P}_{K} is ξ=(ξ1,,ξK)K\xi=(\xi_{1},\dots,\xi_{K})\in\mathds{R}^{K} where ξk=𝒯φk(t)pθ(t)𝑑t\xi_{k}=\int_{\mathcal{T}}\varphi_{k}(t)p_{\theta}(t)dt. The moment parameter is in one-to-one correspondence with natural parameters (see, e.g., Amari 2016), noting

ξ=θBK(θ).\xi=\partial_{\theta}B_{K}(\theta). (2.9)

For brevity, we use ξ():θξ(θ)\xi(\cdot):\theta\mapsto\xi(\theta) and θ():ξθ(ξ)\theta(\cdot):\xi\mapsto\theta(\xi) to denote the one-to-one mappings derived from (2.9).

Motivated by the fact that the MLE of the natural parameter is the method of moments estimate under moment parameterization, we consider shrinkage estimate based on the the moment parametrization. Let η0\eta_{0} be the natural parameter of the truncated random densities p0,K𝒫Kp_{0,K}\in\mathcal{P}_{K} of p0p_{0} according to (2.5) and τ0=ξ(η0)K\tau_{0}=\xi(\eta_{0})\in\mathds{R}^{K} the corresponding moment coordinate. We propose to estimate the moment parameter τ0\tau_{0} by the best linear unbiased predictor (BLUP) of τ0\tau_{0} given the sample sufficient statistic φ¯0\bar{\varphi}_{0}, defined as

ξ̊BLUP\displaystyle\mathring{\xi}_{\text{BLUP}} =BLUP(τ0φ¯0)=cov(τ0,φ¯0)var(φ¯0)1(φ¯0Eφ¯0)+Eτ0\displaystyle=\operatorname{BLUP}(\tau_{0}\mid\bar{\varphi}_{0})\overset{\triangle}{=}\operatorname{cov}(\tau_{0},\bar{\varphi}_{0})\operatorname{var}(\bar{\varphi}_{0})^{-1}(\bar{\varphi}_{0}-E\bar{\varphi}_{0})+E\tau_{0}
=var(τ0)var(φ¯0)1(φ¯0Eτ0)+Eτ0,\displaystyle=\operatorname{var}(\tau_{0})\operatorname{var}(\bar{\varphi}_{0})^{-1}(\bar{\varphi}_{0}-E\tau_{0})+E\tau_{0}, (2.10)

where the last equality is due to E(φ¯0τ0)=τ0E(\bar{\varphi}_{0}\mid\tau_{0})=\tau_{0} and double expectation. Note that here the covariance and expectation are computed under the joint distribution of (τ0,φ¯0)(\tau_{0},\bar{\varphi}_{0}). We thus obtain a plug-in estimate θ̊BLUP=θ(ξ̊BLUP)\mathring{\theta}_{\text{BLUP}}=\theta(\mathring{\xi}_{\text{BLUP}}) of the natural parameter.

The MAP (2.8) and BLUP (2.2.3) estimates not only borrow typical shapes of densities via the sufficient statistic φ\varphi analogous to the MLE (2.7), but also take advantage of the distribution of the random density through the component scores. We refer to the information embedded in the component scores of log-densities as population-level information. The MAP (2.8) and BLUP (2.2.3) are, respectively, a shrinkage of the MLE (2.7) estimate towards the mean under the natural and the moment coordinate. As will be demonstrated in our numerical investigation later, this extra shrinkage is precious for reducing information loss especially given small samples.

2.2.4 Pre-smoothing and Unknown Distribution of pp

Oftentimes, the random densities p1,,pnp_{1},\dots,p_{n} and their distribution are unavailable but only discrete samples {Xij:j=1,Ni;i=1,n}\left\{X_{ij}:j=1,\dots N_{i};i=1\dots,n\right\} are accessible. In this case, we utilize pre-smoothing to construct pilot density estimates for the samples, analogous to the approach taken by Petersen and Müller (2016); Han et al. (2020). While many density estimators may be applied, for example kernel density estimate (KDE) and logspline (Kooperberg and Stone 1991), for theoretical consideration we adopt the weighted KDE

pˇi(t)=(𝒯j=1Niκ(xXijh)w(x,h)dx)1j=1Niκ(tXijh)w(t,h),t𝒯\check{p}_{i}(t)=\left(\int_{\mathcal{T}}\sum_{j=1}^{N_{i}}\kappa\left(\frac{x-X_{ij}}{h}\right)w(x,h)dx\right)^{-1}\sum_{j=1}^{N_{i}}\kappa\left(\frac{t-X_{ij}}{h}\right)w(t,h),\;t\in\mathcal{T} (2.11)

to handle boundary bias, where κ\kappa is a kernel function and w(t,h)=1/(tb)/h(ta)/hκ(u)𝑑uw(t,h)=1/\int_{(t-b)/h}^{(t-a)/h}\kappa(u)du is the weight for 𝒯=[a,b]\mathcal{T}=[a,b]. In the pre-smoothing step, the pilot density estimator is applied on each sample to obtain pre-smoothed densities pˇ1,,pˇn\check{p}_{1},\dots,\check{p}_{n}.

FPCA is then performed on the transformed densities fˇi=ψpˇi\check{f}_{i}=\operatorname{\psi}\check{p}_{i}, i=1,,ni=1,\dots,n to obtain sample mean μ^(t)=n1i=1nfˇi(t)\hat{\mu}(t)=n^{-1}\sum_{i=1}^{n}\check{f}_{i}(t), sample covariance function G^(s,t)=n1i=1n(fˇi(s)μ^(s))(fˇi(t)μ^(t))\hat{G}(s,t)=n^{-1}\sum_{i=1}^{n}(\check{f}_{i}(s)-\hat{\mu}(s))(\check{f}_{i}(t)-\hat{\mu}(t)), the associated eigenvalues λ^k\hat{\lambda}_{k} and eigenfunctions φ^k\hat{\varphi}_{k}, and the component scores η^iK\hat{\eta}_{i}\in\mathds{R}^{K} in which the kkth element is η^ik=fˇiμ^,φ^k\hat{\eta}_{ik}=\left\langle\check{f}_{i}-\hat{\mu},\hat{\varphi}_{k}\right\rangle for k=1,,n1k=1,\dots,n-1. A sample version of the approximating families 𝒫^K\hat{\mathcal{P}}_{K} is obtained as

𝒫^K={p^θ:θK,B^K(θ)<}, with\displaystyle\hat{\mathcal{P}}_{K}=\left\{\hat{p}_{\theta}:\theta\in\mathds{R}^{K},\hat{B}_{K}(\theta)<\infty\right\},\text{ with} (2.12)
p^θ(t)=exp(μ^(t)+k=1Kθkφ^k(t)B^K(θ)),t𝒯,\displaystyle\hat{p}_{\theta}(t)=\exp\left(\hat{\mu}(t)+\sum_{k=1}^{K}\theta_{k}\hat{\varphi}_{k}(t)-\hat{B}_{K}(\theta)\right),\,t\in\mathcal{T}, (2.13)

where B^K(θ)=log(𝒯exp(μ^(t)+kKθkφ^k(t)dt))\hat{B}_{K}(\theta)=\log\left(\int_{\mathcal{T}}\exp\left(\hat{\mu}(t)+\sum_{k\leq K}\theta_{k}\hat{\varphi}_{k}(t)dt\right)\right) is the normalizing constant. Inference for new samples are conducted based on 𝒫^K\hat{\mathcal{P}}_{K} with sufficient statistic φ^0,aveK\hat{\varphi}_{0,\text{ave}}\in\mathds{R}^{K}, in which the kkth element is N1j=1Nφ^k(X0j)N^{-1}\sum_{j=1}^{N}\hat{\varphi}_{k}(X_{0j}); to avoid double over-scripts here we use subscript “ave” to denote the average. The distribution of random densities can be accessed via the empirical distribution of the sample component scores. The proposed estimators in fully sample-based versions are then defined using the estimated quantities. The MLE and MAP within 𝒫^K\hat{\mathcal{P}}_{K} are p^θ^MLE\hat{p}_{\hat{\theta}_{\text{MLE}}} and p^θ^MAP\hat{p}_{\hat{\theta}_{\text{MAP}}}, respectively, where

θ^MLE\displaystyle\hat{\theta}_{\text{MLE}} =argmaxθ:B^K(θ)<(θTφ^0,aveB^K(θ)),\displaystyle=\operatorname*{arg\,max}_{\theta:\hat{B}_{K}(\theta)<\infty}\left(\theta^{T}\hat{\varphi}_{0,\text{ave}}-\hat{B}_{K}(\theta)\right), (2.14)
θ^MAP\displaystyle\hat{\theta}_{\text{MAP}} =argmaxθ:B^K(θ)<(θTφ^0,aveB^K(θ)+logπ^(θ)),\displaystyle=\operatorname*{arg\,max}_{\theta:\hat{B}_{K}(\theta)<\infty}\left(\theta^{T}\hat{\varphi}_{0,\text{ave}}-\hat{B}_{K}(\theta)+\log\hat{\pi}(\theta)\right), (2.15)

in parallel with (2.7) and (2.8), and π^\hat{\pi} is the product of normal densities with mean 0 and variance sη^,k2s^{2}_{\hat{\eta},k}, the sample variance of {η^ik,i=1,,n}\left\{\hat{\eta}_{ik},i=1,\dots,n\right\}, for k=1,,Kk=1,\dots,K.

For BLUP, Eτ0E\tau_{0} and var(τ0)\operatorname{var}(\tau_{0}) in (2.2.3) are estimated by the sample mean τ^ave=n1inτ^i\hat{\tau}_{\text{ave}}=n^{-1}\sum_{i\leq n}\hat{\tau}_{i} and covariance matrix Στ^\Sigma_{\hat{\tau}} of moment coordinates {τ^i=ξ(η^i):i=1,,n}\left\{\hat{\tau}_{i}=\xi(\hat{\eta}_{i}):i=1,\dots,n\right\} in 𝒫^K\hat{\mathcal{P}}_{K}. Further decompose var(φ¯0)=Evar(φ¯0τ0)+var(τ0)\operatorname{var}(\bar{\varphi}_{0})=E\operatorname{var}(\bar{\varphi}_{0}\mid\tau_{0})+\operatorname{var}(\tau_{0}), and estimate the first term via integration by

Σφ^ave=1nNi=1n𝒯(φ^(t)τ^i)(φ^(t)τ^i)Tpˇi(t)𝑑t,\Sigma_{\hat{\varphi}_{\text{ave}}}=\frac{1}{nN}\sum_{i=1}^{n}\int_{\mathcal{T}}\left(\hat{\varphi}(t)-\hat{\tau}_{i}\right)\left(\hat{\varphi}(t)-\hat{\tau}_{i}\right)^{T}\check{p}_{i}(t)dt,

where φ^:t(φ^1(t),,φ^K(t))K\hat{\varphi}:t\mapsto(\hat{\varphi}_{1}(t),\dots,\hat{\varphi}_{K}(t))\in\mathds{R}^{K} is the sufficient statistic. In combine, the plug-in BLUP estimate for the moment parameter within 𝒫^K\hat{\mathcal{P}}_{K} is obtained as

ξ^BLUP=Στ^(Σφ^ave+Στ^)1(φ^0,aveτ^ave)+τ^ave,\hat{\xi}_{\text{BLUP}}=\Sigma_{\hat{\tau}}\left(\Sigma_{\hat{\varphi}_{\text{ave}}}+\Sigma_{\hat{\tau}}\right)^{-1}(\hat{\varphi}_{0,\text{ave}}-\hat{\tau}_{\text{ave}})+\hat{\tau}_{\text{ave}}, (2.16)

and the associated natural parameter θ^BLUP=θ^(ξ^BLUP)\hat{\theta}_{\text{BLUP}}=\hat{\theta}(\hat{\xi}_{\text{BLUP}}) is the solution to ξ^BLUP=θB^K(θ)\hat{\xi}_{\text{BLUP}}=\partial_{\theta}\hat{B}_{K}(\theta).

When the random densities p1,,pnp_{1},\ldots,p_{n} are observed in their entirety, a scenario considered for the interest of theory and referred to as the subpopulations being completely observed, it suffices to skip the pre-smoothing step and proceed with the observed densities. In this case, FPCA is performed on the transformed trajectories fi=ψpif_{i}=\operatorname{\psi}p_{i}, i=1,,ni=1,\dots,n to obtain sample mean and covariance as μ~(t)=n1infi(t)\tilde{\mu}(t)=n^{-1}\sum_{i\leq n}f_{i}(t) and G~(s,t)=n1in(fi(s)μ~(s))(fi(t)μ~(t))\tilde{G}(s,t)=n^{-1}\sum_{i\leq n}(f_{i}(s)-\tilde{\mu}(s))(f_{i}(t)-\tilde{\mu}(t)), eigenvalues and eigenfunctions λ~k\tilde{\lambda}_{k} and φ~k\tilde{\varphi}_{k}, and sample component scores η~ik=fiμ~,φ~k\tilde{\eta}_{ik}=\left\langle f_{i}-\tilde{\mu},\tilde{\varphi}_{k}\right\rangle, k=1,,n1k=1,\dots,n-1. The approximating family 𝒫~K\tilde{\mathcal{P}}_{K} with completely observed densities is defined as

𝒫~K={p~θ:θK,B~K(θ)<}, with\displaystyle\tilde{\mathcal{P}}_{K}=\left\{\tilde{p}_{\theta}:\theta\in\mathds{R}^{K},\tilde{B}_{K}(\theta)<\infty\right\},\text{ with} (2.17)
p~θ(t)=exp(μ~(t)+k=1Kθkφ~k(t)B~K(θ)),t𝒯\displaystyle\tilde{p}_{\theta}(t)=\exp\left(\tilde{\mu}(t)+\sum_{k=1}^{K}\theta_{k}\tilde{\varphi}_{k}(t)-\tilde{B}_{K}(\theta)\right),\,t\in\mathcal{T} (2.18)

for K=1,2,K=1,2,\dots, where B~K(θ)=log𝒯exp(μ~(t)+kKθkφ~k(t)dt)\tilde{B}_{K}(\theta)=\log\int_{\mathcal{T}}\exp\left(\tilde{\mu}(t)+\sum_{k\leq K}\theta_{k}\tilde{\varphi}_{k}(t)dt\right) is the normalizing constant. The proposed MLE, MAP and BLUP are then defined analogously to that within 𝒫^K\hat{\mathcal{P}}_{K}. We use tilde overscripts to denote estimated quantities with fully observed densities.

We refer to the process of constructing approximating exponential families from the discrete observations from p1,,pnp_{1},\dots,p_{n} as the training step, which emphasizes that the KK-dimensional families 𝒫^K\hat{\mathcal{P}}_{K} are flexibly derived from the data despite themselves being parametric families compactly described by a few natural parameters. The random densities p1,,pnp_{1},\ldots,p_{n} are referred to as the training densities associated with the training subpopulations, and the corresponding samples as training samples. For performance assessment, we call the process constructing density estimates from new observations within the approximating families the fitting or testing step, and correspondingly, define testing subpopulation, density, and sample.

The number of components KK controls the flexibility of 𝒫^K\hat{\mathcal{P}}_{K} and is the key tuning parameter trading off bias and variance when fitting a density. We propose to select KK using a penalized approach by considering the Akaike information criterion (AIC), and set K=KK=K^{*} which minimizes AIC(K)=2K2logp^K\operatorname{AIC}(K)\overset{\triangle}{=}2K-2\log\hat{p}_{K}, where p^K\hat{p}_{K} is the density estimates within 𝒫^K\hat{\mathcal{P}}_{K} using either (2.14), (2.15), or (2.16). In practice, we observe that the optimal KK^{*} often increases as the fitting sample size increases. Intuitively, the proposed component selection automatically utilizes a more flexible model given a larger sample, in which case the additional variance that comes with the flexibility becomes more affordable.

The proposed procedure is computationally efficient, since the proposed MLE, MAP, and BLUP estimates are performed within exponential families, so the optimization tasks in obtaining (2.14)–(2.16) are convex, and the discrete samples are succinctly summarized using the sufficient statistics.

3 Simulation Studies

We study the numerical properties for the proposed methods and alternative parametric and non-parametric density estimators under various scenarios. Independent training samples {Xij:j=1,,Ni}\left\{X_{ij}:j=1,\ldots,N_{i}\right\} were drawn from densities pip_{i}, i=1,,ni=1,\dots,n, respectively, where the pip_{i} are independent copies of a random density pp. Independent testing samples {Xlj:j=1,,Nl}\left\{X^{*}_{lj}:j=1,\ldots,N_{l}^{*}\right\} were then drawn from additional independent copies plp_{l}^{*} of pp for l=1,,nl=1,\dots,n^{*}. Typically, training sample sizes were larger, e.g. Ni=200N_{i}=200, while testing samples were smaller, e.g., Nl=10N_{l}^{*}=10. The small sample sizes in the testing samples pose a challenge for the density estimators compared. The underlying family 𝒫\mathcal{P} and the distribution of p𝒫p\in\mathcal{P} varied between scenarios. The experiment under each scenario was repeated 500 times.

The proposed methods (2.14), (2.15), and (2.16) are referred to as FPCA_MLE, FPCA_MAP, and FPCA_BLUP. Our proposals were compared against the maximum likelihood estimate within the ground truth family 𝒫\mathcal{P} as an idealized parametric approach, referred to as MLE; non-parametric methods including KDE (Sheather and Jones 1991) and logspline (Kooperberg and Stone 1991); and a repeated point processes (RPP) approach proposed by Gervini and Khanal (2019). These methods do not require training.

The mean KL divergence for the density estimates was used to assess the overall estimation performance. For density estimate p^l\hat{p}^{*}_{l} for the llth testing sample, information loss was evaluated by KL divergence D(plp^l)D(p^{*}_{l}\;\|\;\hat{p}^{*}_{l}) defined in (2.6), and we examined the average

MKL=1nl=1nD(plp^l)\text{MKL}=\frac{1}{n^{*}}\sum_{l=1}^{n^{*}}D(p^{*}_{l}\;\|\;\hat{p}^{*}_{l})

over nn^{*} testing samples to quantify performance. Smaller MKL indicates better estimation performance. Results for an alternative error metric, the mean integrated squared errors (MISE), and additional simulation cases are included in the Supplementary Materials.

3.1 Flexible Exponential Families

We first set the density families to be either the truncated normal or the bimodal distributions. For the truncated normal scenario, each sample was generated from N(μ,σ2)N(\mu,\sigma^{2}) truncated on [3,3][-3,3] with μUnif(2,2)\mu\sim\text{Unif}(-2,2) and σUnif(2,4)\sigma\sim\text{Unif}(2,4). For the bimodal scenario, each sample was generated according to density p(x)exp((4+θ)x(26.5+θ)x2+47x325x4),x[0,1]p(x)\propto\exp((4+\theta)x-(26.5+\theta)x^{2}+47x^{3}-25x^{4}),x\in[0,1] with θUnif(0,10)\theta\sim\text{Unif}(0,10). In each Monte Carlo experiment, n=50n=50 training samples of size Ni=200N_{i}=200 and n=100n^{*}=100 testing samples with sizes N=10N=10 or 50 were created.

The mean KL divergence in Figure 3.1 shows that the proposed FPCA_MLE using either KDE as pre-smoother or with the complete training densities observed both produce estimates comparable to MLE without requiring explicit specification of a parametric family. Among all the non-parametric density estimators, the proposed methods are shown to be stable and best-performing, especially when the testing samples are small, in which case the proposed methods outperformed MLE by borrowing strength among different samples, whereas MLE does not pool information. In particular, for truncated normal family with N=10N^{*}=10, on average the information loss of the proposed MAP and BLUP were around 20% less compared to that of MLE within ground truth family, and more than 40% less compared to that of KDE.

Refer to caption
Figure 3.1: Boxplots of mean KL divergence. RPP, repeated point processes approach (Gervini and Khanal 2019); locfit, local polynomial density estimate (Loader 1996); logspline, adaptive logspline (Kooperberg and Stone 1991). Both discrete training samples of sizes Ni=200N_{i}=200 with KDE pre-smoothing and completely observed densities are considered for our proposed approaches, namely FPCA_MLE, FPCA_MAP, and FPCA_BLUP.

3.2 Gaussian Mixture Family

Contrary to the previous scenario where the underlying families were exponential families, we next estimate densities from a family consisting of mixtures of Gaussian distributions, which forms a multi-modal non-exponential family. This is to further demonstrate the flexibility of the proposed methods to a “mis-specified” case and also demonstrate the data-adaptive selection of the number of components.

The samples were generated following a mixture of three Gaussian distributions with density p(x)l=13θlϕ((xμl)/σl)p(x)\propto\sum_{l=1}^{3}\theta_{l}\phi((x-\mu_{l})/\sigma_{l}) truncated to x[3,3]x\in[-3,3], where ϕ\phi is the standard normal density and the mixture probabilities (θ1,θ2,θ3)(\theta_{1},\theta_{2},\theta_{3}) were generated from Dirichlet(1/3,1/3,1/3)\text{Dirichlet}(1/3,1/3,1/3). For l=1,2,3l=1,2,3, we generated the mixture mean μl\mu_{l} from Unif(5,5)\text{Unif}(-5,5) and standard deviation σl\sigma_{l} from Unif(0.5,5)\text{Unif}(0.5,5). Here, instead of performing a direct MLE which is known to be unstable for mixture distributions, the expectation–maximization (EM) algorithm in the variant of Lee and Scott (2012) is used as the baseline parametric density estimator.

The left panel of Figure 3.2 compares the relative error of various density estimates to that of the EM algorithm as a baseline reference. The proposed methods enjoyed the least information loss for all but the largest sample size N=500N^{*}=500 when EM method became the best. When the testing sample size N=25N^{*}=25, on average, the information loss of proposed MAP and BLUP were 20% less than that of RPP, and they were all at least five times better compared to the EM algorithm. Remarkably, this is achieved by only utilizing the first few components on average, as demonstrated in the right panel of Figure 3.2. This shows that the proposed methods are effective alternatives to the EM algorithm, which overcome local maximum issues by performing convex optimization within approximating exponential families. In addition, by using the proposed AIC method for component selection, the proposed estimates utilized more components as the size of fitting sample increased, suggesting that the AIC selection adapts to the increased complexity of the underlying family as unveiled by a larger sample. Indeed, the larger the sample size, the more representative the sample will be for the underlying distribution, which has 8 free parameters in our setup. By using more components, the proposed methods can accommodate the increased complexity shown by the data such as multi-modality, which may otherwise be hidden when fitting with fewer components.

Refer to caption
Figure 3.2: The left panel shows the ratios of the average mean KL divergence of the compared methods over that of the EM algorithm, where a ratio smaller than 1 indicates better performance than EM. The right panel displays the average number of components KK selected by the proposed AIC method, with ribbons indicating one standard deviation around the means. Axes in both panels are drawn in log10\log 10 scale. Training samples had sample sizes Ni=50N_{i}=50, and logspline was used as the pre-smoother. RPP, repeated point processes approach (Gervini and Khanal 2019); locfit, local polynomial density estimate (Loader 1996); logspline, adaptive logspline (Kooperberg and Stone 1991); FPCA_MLE, FPCA_MAP, and FPCA_BLUP, the proposed methods in the MLE, MAP, and BLUP variants.

4 Real Data Applications

4.1 MIMIC Data

We consider the age-at-admission distributions of patients with critical conditions in the Medical Information Mart for Intensive Care (MIMIC) data (Johnson et al. 2016), collected from patients admitted to critical care units at a large hospital in Boston between 2001 and 2012. The number of patients under different primary diagnosis are highly different. Ample data points are available to establish a precise non-parametric density estimate for common diseases, as shown in the left panels of Figure 4.1, whereas observations for uncommon conditions are scarce and cannot afford flexible non-parametric approaches such as KDE, as displayed in right panels of Figure 4.1. Estimating the densities in the hard-to-observe subpopulations of patients with uncommon conditions is the emphasis in our analysis.

Refer to caption
Figure 4.1: Each panel shows the age-at-admission distribution of either male or female patients with a different primary diagnosis, defining a subpopulation. Four training subpopulations with ample observations were shown in the left panels, and four testing subpopulations with scarce observations are shown in the right panels. Along with the KDE, the proposed density estimates MLE, MAP, and BLUP are displayed for the testing subpopulations. KDE, kernel density estimate; MLE, MAP and BLUP, the proposed density estimates.

Gender and diagnosis code (ICD9) combinations were used to define the subpopulations, where we excluded birth-related, unclassified, or unspecified diagnosis. A total of 940940 subpopulations with more than 5 observations each were analyzed, among which 7575 subpopulations with at least 7575 observations were used as the training while the rest as testing. For example, the upper left panel in Figure 4.1 shows a training sample formed by 109109 female patients with juvenile type diabetes with ketoacidosis (DMI ketoacd uncontrold), while the bottom right panel shows a testing sample consisting of 27 male patients with abdominal aortic aneurysm rupture (Rupt abd aortic aneurysm). We focused on estimating the densities supported in the age range of [14,85][14,85]. KDE was utilized as the pre-smoother and the bandwidth was set to the median of the bandwidths in the training samples selected by the method of Sheather and Jones (1991).

As illustrated in right panels of Figure 4.1, the proposed density estimates produced sensible results given small test samples, which were smoother than the estimates given by KDE and avoided multiple spurious modes.

The modes of variation in the density functions obtained by varying each of the first three natural parameters of 𝒫^K\hat{\mathcal{P}}_{K} are shown in the left panels of Figure 4.2. The first three modes of variations correspond to a shift in the age distribution towards the old, the elderly, and the middle age, respectively. This suggests the existence of latent disease characteristics that affect the age profiles of different conditions.

The effects of shrinkage in the proposed MAP and BLUP estimators are demonstrated in the right panels of Figure 4.2 for the four rare conditions displayed in the right panels of Figure 4.1. Subpopulations with smaller sample sizes were typically shrunk more towards the population mean using the MAP and BLUP methods that pool population information from the training samples, in which case the estimate differed more notably from the MLE within the approximating family.

To numerically assess the density estimates, we consider cross-entropy which also targets the information loss instead of the KL divergence, since the underlying densities are unknown. Given any density estimator p^\hat{p}, we approach the cross-entropy H(p,p^)=Eplogp^H(p,\hat{p})\overset{\triangle}{=}-E_{p}\log{\hat{p}}, which is the quantity in D(pp^)D(p\;\|\;\hat{p}) that depends on the estimate p^\hat{p}, by an unbiased leave-one-out estimate

HLOO(p,p^)=1NjNlogp^j(Xj),H_{\text{LOO}}(p,\hat{p})\overset{\triangle}{=}-\frac{1}{N}\sum_{j\leq N}\log{\hat{p}_{-j}(X_{j})}, (4.1)

where p^j\hat{p}_{-j} is a density estimate constructed with all but the jjth observations. A summary of cross-entropy applied on 844 testing samples using the proposed methods and KDE are reported in Table 1, where a smaller cross-entropy indicates a better estimate. Occasionally KDE resulted in non-finite cross entropy, leading to a smaller number of samples reported.

Table 1 shows that the proposed methods produced density estimates that better reflect sample information compared to KDE by having smaller mean cross-entropy, and also smaller standard deviations. To quantify the sensitivity w.r.t. the training subpopulations, we partitioned the training samples into 5 folds, held out one fold at a time in the training step, fitted the densities for the testing samples, and reported the range of the five mean cross-entropies for a sensitivity measure. The proposed methods are shown to be insensitive to varying training subpopulations, which suggests that the selected training samples are sufficient for our methods to capture the major structure of the underlying family.

Refer to caption
Figure 4.2: Left: mode of variation by varying each of the first three natural parameters of the estimated family 𝒫^K\hat{\mathcal{P}}_{K}, where blue and red stand for, respectively, lower and higher values. Right: for the four sparsely observed subpopulations shown in the right panels of Figure 4.1, the MLE parameter estimates (triangles) are shrunk in the natural parameter space (upper panel) and the expectation coordinate space (lower panel) w.r.t. the population distribution of training component scores (elliptical contours), producing the corresponding MAP (dots, upper right) and BLUP (dots, lower right) estimates.
Table 1: Cross-entropy (4.1) of different estimators applied on the MIMIC data.
size nn^{*} estimator mean sd sensitivity
MLE 4.60 1.62 0.062
MAP 4.13 0.42 0.013
423 BLUP 4.11 0.36 0.039
(0,10] 402 KDE 5.56 2.42 -
MLE 4.03 0.32 0.015
MAP 3.97 0.25 0.005
343 BLUP 3.97 0.24 0.017
(10,35] 341 KDE 4.16 0.37 -
MLE 3.93 0.22 0.013
MAP 3.92 0.21 0.004
BLUP 3.92 0.21 0.008
(35,75] 78 KDE 3.99 0.22 -

The results are stratified by size of testing samples, where the number of testing samples involved is listed as nn^{*}. The sensitivity of proposed methods are the range of the mean cross-entropy across 5 experiments, each holding out one of five folds in the training samples.

4.2 Precipitation Return Level Estimation

We then analyze the precipitation data in the Haihe region to illustrate the versatility of the proposed methods in providing good estimates for both the body part of the density function and the tail probability. We demonstrate that our proposed methods can borrow information from stations with longer records to construct approximate distribution families for annual maxima of hourly rainfall, which lead to better estimates of densities and return levels for those stations having shorter records.

Predicting the likelihood of extreme precipitation is of great importance because of the serious damage such events can cause to economy and social life. We are in particular interested in heavy rainfall during short time period. The likelihood of extreme precipitation event in a certain region can be quantified using the TT-year return level, which is defined as the level expected to be exceeded once in TT years, or equivalently the 11/T1-1/T quantile of the marginal distribution under stationarity and independence assumptions. Sample quantile is a straight forward estimate for small TT, which is referred to as the empirical return level, but this approach requires more than TT data points being available. Otherwise, distribution models are necessary for estimating the return level for longer return periods.

We focus on estimating the 5- to 30-year return levels at various locations in the Haihe River basin, a vital region in northern China that includes mega-cities such as Beijing and Tianjin. Precipitation data from 232 weather stations around Haihe River basin from 1961 to 2012 are available from the National Meteorological Information Center (NMIC) and the China Meteorological Administration (CMA) and are used for this analysis. All the 232 stations investigated have at least 30 years of records.

Our target is the distribution of the annual maximum rainfall per hour, where each station defines a subpopulation. As the distribution of precipitation is heavily right-skewed, the pre-smoothing density estimates could be zero within the working domain, creating an issue when we calculate the log-density. To address this issue and the positive constraint, log-precipitation is used as the working variable. See LABEL:supp-appendix:log.inf of the Supplementary Materials for detail.

We compared return levels estimation and the information loss of the proposed methods with standard approaches. To imitate the situation of newer weather stations having shorter records, we randomly selected n=50n=50 stations as training sites and retained the full records in these stations for training, and another n=100n^{*}=100 testing sites where records from randomly chosen 10 consecutive years were used for fitting. Estimates of density and 5- to 30-year return levels were obtained from non-parametric approaches including our proposed methods and standard KDE, as well as parametric approaches of maximum likelihood estimates within the families of Gamma and GEV distribution. The performance metric used to evaluate density estimation was cross-entropy (4.1), which measures the overall fit. For return level estimates, we used the relative difference of the estimates to the empirical return levels that are obtained from full records in the testing stations, which measures the fit in the right tail. The entire process was repeated 1010 times, and the overall average cross-entropies and the relative differences of the return levels averaged over the stations are reported in Figure 4.3.

The proposed methods in the MLE, MAP, and BLUP variants are all significantly better than the KDE and GEV approaches under the information loss. The estimates using Gamma family of distributions were comparable to our results but underperformed the MAP and BLUP methods which borrow information from the population. For return level estimation, the proposed MAP and BLUP methods had the least variation compared to MLE, Gamma and GEV methods, where the last three methods do not pool information. For shorter (5- and 10- year) return levels, the bias of all methods were comparable, but for longer (20- and 30- year) return levels, the bias of our methods were significantly smaller than Gamma and GEV methods. In summary, our proposed methods have the best performance in estimating the return levels as well. Hence, the proposed methods provide better fits compared to classical approaches in both the bulk and the tail parts of the distributions and lead to better quantification of interesting characteristics of the distribution such as return levels from a small number of discrete observations.

These numerical results come as a positive surprise, as we only used 10 years of records at the testing sites and yet were able to obtain good estimates for 20- and even 30-years return levels. The proposed approaches are able to estimate the tail structure from the data rather than relying on assumptions, and borrow information from the population enforced through parameter shrinkage.

Refer to caption
Figure 4.3: Left panel: Cross-entropies of different estimators applied on the Haihe rainfall data. Left panel: Cross-entropies computed by (4.1), where each method was fitted with decennial records over 1010 repeated experiments. MLE, MAP, and BLUP, the proposed methods; KDE, kernel density estimate; Gamma, density fit within a Gamma distribution; GEV, density fit within a GEV family. Right panel: Relative differences of estimated return levels using decennial records compared to the empirical ones using full records (30\geq 30 years) for the testing weather stations. Each solid dot represents the average relative difference at a station averaged over Monte Carlo experiments, while each hollow circle represents the mean over all stations. Estimates closer to the horizontal zero-reference line are better. The GEV had some extreme over-estimate outliers, pulling its mean errors outside plotting range.

5 Theoretical Results

We state the theoretical results for the proposed methods including the identifiability of the approximating families, consistency of MLE within those families, and consistency of the component scores. While the major motivation and justification for the proposed methods are the small sample performance as demonstrated in the simulations and real data applications, in this section we also establish large sample properties of the proposed estimators to justify their broad validity. Subsection 5.1 and Subsection 5.2 investigate the large testing sample scenario where the testing sample size NN, the number of training samples nn, and the training sample sizes NiN_{i} all diverge, while Subsection 5.3 considers the small testing sample case when NN is small and kept fixed but nn and NiN_{i} diverge. We present theoretical properties of the MLE as justification for the proposed methods, since for a large testing sample the BLUP and MAP estimates emphasize the observation but not the population-level information and will be close to “degenerate” at the MLE estimate. Proofs and additional discussions are included in the Supplementary Materials.

For theoretical development, densities in the underlying family are required to be smooth and uniformly bounded away from zero and infinity. This condition is standard when analyzing densities as functional data given discrete observations; see, e.g., Petersen and Müller (2016); Han et al. (2020).

  1. (A1)

    All densities q𝒫q\in\mathcal{P} are continuously differentiable. Moreover, there is a constant M>1M>1 such that, for all q𝒫q\in\mathcal{P}, q,1/q\lVert q\rVert_{\infty},\lVert 1/q\rVert_{\infty} and q\lVert q^{\prime}\rVert_{\infty} are bounded above by MM.

Under (A1), the approximating families 𝒫K\mathcal{P}_{K}, 𝒫~K\tilde{\mathcal{P}}_{K}, and 𝒫^K\hat{\mathcal{P}}_{K} as defined in (2.3), (2.17), and (2.12) are identifiable, thanks to the fact that the entries of the sufficient statistic are orthogonal to each other and to the constant function, as shown in LABEL:supp-prop:identifiable in the Supplemental Materials.

5.1 Consistency of MLE as NN Diverges

The proposed MLE estimates within low-dimensional approximating family 𝒫^K\hat{\mathcal{P}}_{K} are shown to be consistent. The MLE is computed with an additional i.i.d. sample X1,,XNX_{1},\dots,X_{N} of size NN, of which the density pp is an independent copy of the random density. The approximating family 𝒫^K\hat{\mathcal{P}}_{K} is obtained from nn discretely observed training subpopulations each with NimN_{i}\geq m observations. Theoretical results for this practical scenario are presented here, and results for MLE obtained within 𝒫K\mathcal{P}_{K} and 𝒫~K\tilde{\mathcal{P}}_{K} are included in the Supplementary Materials. For brevity, we use p^=p^N,K𝒫^K\hat{p}=\hat{p}_{N,K}\in\hat{\mathcal{P}}_{K} to denote the MLE density estimate with parameter θ^=θ^N,K\hat{\theta}=\hat{\theta}_{N,K} defined in (2.14).

The consistency results rely on the following conditions imposed on the covariance of the log-transformed random density.

  1. (C1)

    For process f=ψpL2(𝒯)f=\operatorname{\psi}p\in L^{2}(\mathcal{T}), the eigenvalues and eigenfunctions of the integral operator 𝒢\mathcal{G} associated with the covariance function GG satisfy, as JJ\rightarrow\infty,

    (k=1Jφk)(k>Jλk)\displaystyle\left(\sum_{k=1}^{J}\|\varphi_{k}^{\prime}\|_{\infty}\right)\left(\sum_{k>J}\lambda_{k}\right) =O(1),\displaystyle=O(1),
    (k=1Jφk)2/3(k>Jλk)\displaystyle\left(\sum_{k=1}^{J}\|\varphi_{k}^{\prime}\|_{\infty}\right)^{2/3}\left(\sum_{k>J}\lambda_{k}\right) =o(1).\displaystyle=o(1).
  2. (C2)

    For all k=1,,rank𝒢k=1,\dots,\operatorname{rank}\mathcal{G}, the non-zero eigenvalues are distinct and satisfy δk=inflk|λkλl|/2>0\delta_{k}\overset{\triangle}{=}\inf_{l\not=k}\lvert\lambda_{k}-\lambda_{l}\rvert/2>0; the estimated eigenfunctions are properly aligned so that φk,φ~k0\left\langle\varphi_{k},\tilde{\varphi}_{k}\right\rangle\geq 0 and φk,φ^k0\left\langle\varphi_{k},\hat{\varphi}_{k}\right\rangle\geq 0.

Condition (C1) ensures that the high frequency eigenfunctions are not overly bumpy relative to the amount of information they are carrying as quantified by the eigenvalues, so that the trajectories can be reasonably represented with the first few components. Condition (C2) is standard for simplifying presentation. We assume rank𝒢=\operatorname{rank}\mathcal{G}=\infty in discussion hereafter unless otherwise stated.

Conditions (A1), (C1), and (C2) in combine are sufficient for the consistency of MLE within 𝒫K\mathcal{P}_{K} and 𝒫~K\tilde{\mathcal{P}}_{K} as shown in LABEL:supp-thm:kldiv.all of the Supplementary Materials. When discrete training samples and pre-smoothing are involved, in parallel to those in Petersen and Müller (2016), additional conditions (D1), (D2), and (S1) are also required as listed in the Appendix. Those conditions basically requires that the pilot density estimator performs well uniformly, provided the sample sizes are sufficiently large. Here (D1) is a uniform MISE requirement, (D2) requires the LL^{\infty} error to be Op(am)O_{p}(a_{m}), and (S1) is a requirement for the rate of sample sizes as the number of samples increases, so that one can derive uniform rates of convergence for the pre-smoothed training densities and log-densities under the LL^{\infty} norm. The conditions (D1), (D2), and (S1) can be satisfied by the weighted kernel density estimate (KDE) in (2.11) with proper bandwidths and kernels, e.g., hm1/3h\asymp m^{-1/3} and the Gaussian kernel. See LABEL:supp-sec:presmooth of the Supplementary Materials for detail.

5.1.1 Main Consistent Results

We have the following result for the proposed MLE within 𝒫^K\hat{\mathcal{P}}_{K}. Recall that nn is the number of training subpopulations, mm is the minimal training sample size in (S1), NN is the size of the fitting sample, and sequences {am}\left\{a_{m}\right\}, {bm}\left\{b_{m}\right\} are those in (D1) and (D2).

Theorem 5.1.

Under (A1), (C1), (C2), (D1), (D2), and (S1), if nn, NN, KK increase to infinity in such a way that

AKK/N0,KAK(1/n+bm)0,kKλk1(1/n+am)=O(1),\displaystyle A_{K}\sqrt{K/N}\rightarrow 0,\quad KA_{K}(1/\sqrt{n}+b_{m})\rightarrow 0,\quad\sum_{k\leq K}\lambda_{k}^{-1}(1/\sqrt{n}+a_{m})=O(1),
AKkKδk1(1/n+bm)0,kK(δkλk)1(1/n+bm)=O(1),\displaystyle A_{K}\sum_{k\leq K}\delta_{k}^{-1}(1/\sqrt{n}+b_{m})\rightarrow 0,\qquad\sum_{k\leq K}(\delta_{k}\lambda_{k})^{-1}(1/\sqrt{n}+b_{m})=O(1),

where

AK=2max(|𝒯|1/2,(k=1Kφk)1/3),A_{K}=2\max\left(\left\lvert\mathcal{T}\right\rvert^{-1/2},(\sum_{k=1}^{K}\lVert\varphi_{k}^{\prime}\rVert_{\infty})^{1/3}\right), (5.1)

then

D(pp^N,K)\displaystyle D(p\;\|\;\hat{p}_{N,K}) =Op((1n+bm2)(K+kKδk1)2+k>Kλk+KN).\displaystyle=O_{p}\left(\left(\frac{1}{n}+b_{m}^{2}\right)\left(K+\sum_{k\leq K}\delta_{k}^{-1}\right)^{2}+\sum_{k>K}\lambda_{k}+\frac{K}{N}\right).
Proof.

See LABEL:supp-thm:kldiv.all in the Supplementary Materials. ∎

Note that the result is highly flexible and suits any general infinite-dimensional families. In particular, we do not assume the underlying 𝒫\mathcal{P} to be a finite-dimensional exponential family. The three summands in the rate correspond to the training error, a bias term due to the approximating family, and a variance term from fitting with NN observations. Varying the number of components KK trades off errors from these three terms, as a more complex model with a larger KK will have larger training and fitting errors but smaller approximation bias. Quantity AKA_{K} is closely related to (C1) and the equivalence of LL^{\infty} and L2L^{2} norm in the eigenspaces of covariance operator; see LABEL:supp-prop:eigenspace.lipschitz in the Supplementary Materials.

An immediate corollary is the convergence of the densities in the L1L^{1} norm by Pinsker’s inequality (Wainwright 2019), which states that the KL divergence dominates squares of L1L^{1} norm of differences.

Corollary 5.1.

Under the conditions of Theorem 5.1,

p^N,Kp12=Op((1n+bm2)(K+kKδk1)2+k>Kλk+KN).\left\lVert\hat{p}_{N,K}-p\right\rVert_{1}^{2}=O_{p}\left(\left(\frac{1}{n}+b_{m}^{2}\right)\left(K+\sum_{k\leq K}\delta_{k}^{-1}\right)^{2}+\sum_{k>K}\lambda_{k}+\frac{K}{N}\right).

Our general result directly implies the next corollary where the underlying family is actually a K0K_{0}-dimensional exponential family for some finite K0K_{0} with no approximation bias resulting from truncating the log-densities. Note that (C1) is not required, and by convention we set λk\lambda_{k}, φk\varphi_{k} to be zero for k>K0k>K_{0}.

Corollary 5.2.

Under (A1), (C2), (D1), (D2), and (S1), if 𝒫\mathcal{P} is a K0K_{0}-dimensional exponential family for some finite K0K_{0}, then for any fixed KK0K\geq K_{0},

D(pp^N,K)\displaystyle D(p\;\|\;\hat{p}_{N,K}) =Op(1n+bm2+1N),\displaystyle=O_{p}\left(\frac{1}{n}+b_{m}^{2}+\frac{1}{N}\right),

as n,Nn,N\rightarrow\infty and m=m(n)m=m(n)\rightarrow\infty.

Proof.

See LABEL:supp-coro:kldiv.finiteK in the Supplementary Materials. ∎

5.1.2 Selecting KK

We provide an example when the requirements in Theorem 5.1 are satisfied in the following proposition.

Proposition 5.1.

Under (C2), if

  1. (1)

    the eigenfunctions coincide with the polynomial or trigonometric basis so that φkk\lVert\varphi_{k}^{\prime}\rVert_{\infty}\asymp k while kk large;

  2. (2)

    eigenvalues have polynomial decay: λkkr\lambda_{k}\asymp k^{-r} with r3r\geq 3 as kk\rightarrow\infty;

  3. (3)

    minimum sizes of the nn training sample satisfies m(n)=nrmm(n)=n^{r_{m}} for some rm>0r_{m}>0;

  4. (4)

    weighted KDE (2.11) is used as pre-smoother with hm1/3=nrm/3h\asymp m^{-1/3}=n^{-r_{m}/3} as nn\rightarrow\infty; and

  5. (5)

    for any fixed 0<ra<1/60<r_{a}<1/6, the number of components KK used for fitting in dependence on r,n,N,rmr,n,N,r_{m} and rar_{a} satisfies

    K=K^min(n14(r+1),nrarmr+1,N1/r) as n,N,K=\hat{K}^{*}\asymp\min\left(n^{\frac{1}{4(r+1)}},n^{\frac{r_{a}r_{m}}{r+1}},N^{1/r}\right)\text{ as }n,N\rightarrow\infty,

then the conditions for Theorem 5.1 are satisfied with am=nrarma_{m}=n^{-r_{a}r_{m}}, bm=nrm/3b_{m}=n^{-r_{m}/3}. As n,Nn,N\rightarrow\infty,

D(pp^N,K^)=Op(nγ1+N1+1/r),D(p\;\|\;\hat{p}_{N,\hat{K}^{*}})=O_{p}\left(n^{\gamma_{1}}+N^{-1+1/r}\right),

where γ1=rarm(1r)/(r+1)\gamma_{1}=r_{a}r_{m}(1-r)/(r+1) if rm1/(4ra)r_{m}\leq 1/(4r_{a}), and γ1=1/4+1/(2r+2)\gamma_{1}=-1/4+1/(2r+2) if rm>1/(4ra)r_{m}>1/(4r_{a}).

Proof.

See LABEL:supp-prop:optimal.K.all in the Supplementary Materials. ∎

Such K^\hat{K}^{*} is derived by finding the optimal and feasible KK for the dominating terms in the convergence rate. Compared to the result for optimal KK without presmoothing as in LABEL:supp-prop:optimal.K.all of the Supplementary Materials, it is suggested that in this example, pre-smoothing does not slow down the convergence rate for the proposed MLE if the training subpopulations are densely observed so that rmr_{m} is large, e.g. rm>3/2r_{m}>3/2. Since if rm>1/(4ra)r_{m}>1/(4r_{a}), then D(pp~N,K~)D(pp^N,K^)D(p\;\|\;\tilde{p}_{N,\tilde{K}^{*}})\asymp D(p\;\|\;\hat{p}_{N,\hat{K}^{*}}) with an optimal K~\tilde{K}^{*} for MLE within 𝒫~K\tilde{\mathcal{P}}_{K}.

This result is particularly interesting since it suggests that, for better convergence rate, when pre-smoothing with KDE, slight over-smoothing may be preferred. Indeed, noting γ1<0\gamma_{1}<0 will be smaller if rarm>1/4r_{a}r_{m}>1/4, it is desirable to have larger rar_{a} so that smaller rmr_{m} can satisfy rarm>1/4r_{a}r_{m}>1/4, which means smaller training samples can be considered “dense”. As required in LABEL:supp-prop:preS.ass.good, the only restrictions for rar_{a} are 0<rh+ra<1/20<r_{h}+r_{a}<1/2 and 0<ra<rh0<r_{a}<r_{h}, and hence if we choose a smaller rh>1/4r_{h}>1/4, a larger rar_{a} would be possible, making rarm>1/4r_{a}r_{m}>1/4 easier to satisfy.

5.2 Consistency of Component Scores

Another interesting question is the convergence of the component scores of pre-smoothed training trajectories fˇi=ψpˇi\check{f}_{i}=\operatorname{\psi}\check{p}_{i}, i=1,,ni=1,\ldots,n, which are used to capture distribution of random densities in MAP (2.15) and BLUP (2.16). Notation-wise, let fˇi,K=μ^+kKηˇikφ^k\check{f}_{i,K}=\hat{\mu}+\sum_{k\leq K}\check{\eta}_{ik}\hat{\varphi}_{k} be the truncated pre-smoothed log-density with ηˇik=fˇiμ^,φ^k\check{\eta}_{ik}=\left\langle\check{f}_{i}-\hat{\mu},\hat{\varphi}_{k}\right\rangle for k=1,,Kk=1,\ldots,K, i=1,,ni=1,\dots,n. The next proposition shows that the estimated component scores ηˇik\check{\eta}_{ik} converge uniformly to the truth ηik=fiμ,φk\eta_{ik}=\left\langle f_{i}-\mu,\varphi_{k}\right\rangle.

Proposition 5.2.

Under (A1), (D1), (D2), (S1), and (C2), for any K<K<\infty, as nn\rightarrow\infty, we have

supinsupkK|ηˇikηik|=Op(am+(n1/2+bm)supkKδk1).\sup_{i\leq n}\sup_{k\leq K}\lvert\check{\eta}_{ik}-\eta_{ik}\rvert=O_{p}\left(a_{m}+(n^{-1/2}+b_{m})\sup_{k\leq K}\delta_{k}^{-1}\right).
Proof.

See LABEL:supp-prop:comp.score.all of the Supplementary Materials. ∎

With the previous proposition, it is easy to see supinsupkK|ηˇikηik|=Op(am+n1/2)\sup_{i\leq n}\sup_{k\leq K}\lvert\check{\eta}_{ik}-\eta_{ik}\rvert=O_{p}(a_{m}+n^{-1/2}) as nn\rightarrow\infty if KK is fixed. To have a similar result with KK\rightarrow\infty as well, we need n,Kn,K\rightarrow\infty in such a way that (n1/2+bm)supkKδk10(n^{-1/2}+b_{m})\sup_{k\leq K}\delta_{k}^{-1}\rightarrow 0. Under the conditions of Proposition 5.1, we can set KnβK\asymp n^{\beta} for any β<max(1/2,rm/3)/(1+r)\beta<\max(1/2,r_{m}/3)/(1+r), then the sample component scores are consistent uniformly. In particular, K=K^K=\hat{K}^{*} satisfies the requirement.

5.3 Consistency of MLE in a Small and Fixed Sample

The following result further characterizes the asymptotic behavior of the proposed MLE for a given new fitting sample with a small and fixed sample size NN, where KK is finite and fixed and there is a large amount of observations from training subpopulations.

Proposition 5.3.

Under (A1), (D1), (D2), (S1), and a fixed KK, for a given sample X1,,XNX_{1},\dots,X_{N} drawn from density p𝒫p\in\mathcal{P}, assume the MLE p̊N,K𝒫K\mathring{p}_{N,K}\in\mathcal{P}_{K} exists and is unique with corresponding parameters θ̊K\mathring{\theta}\in\mathds{R}^{K}. Then the MLE p^N,K𝒫^K\hat{p}_{N,K}\in\hat{\mathcal{P}}_{K} exists and is unique with probability tending to 1 as nn\rightarrow\infty, where the nn is the number of training subpopulations used to construct 𝒫^K\hat{\mathcal{P}}_{K}. The corresponding parameter θ^nK\hat{\theta}_{n}\in\mathds{R}^{K} converges to θ̊\mathring{\theta} in probability with

θ^nθ̊2\displaystyle\left\lVert\hat{\theta}_{n}-\mathring{\theta}\right\rVert_{2} =Op(1/n+am+bm),\displaystyle=O_{p}(1/\sqrt{n}+a_{m}+b_{m}),
D(p̊N,Kp^N,K)\displaystyle D(\mathring{p}_{N,K}\;\|\;\hat{p}_{N,K}) =Op(1n+am+bm).\displaystyle=O_{p}\left(\frac{1}{\sqrt{n}}+a_{m}+b_{m}\right).
Proof.

See LABEL:supp-prop:rate.training.thetahat and LABEL:supp-coro:rate.training.kldiv of in the Supplemental Materials. ∎

Appendix A Assumptions for Pre-smoothing

When discrete training samples and pre-smoothing are involved, the following conditions in parallel to (D1), (D2), and (S1) in Petersen and Müller (2016) are also required. Let q𝒫q\in\mathcal{P} be a fix density; the conditions here are imposed on family 𝒫\mathcal{P} but not on the distribution of the random density pp. Further, let qˇ\check{q} to be the corresponding pre-smoothed density, e.g. according to the weighted KDE defined in (2.11).

  1. (D1)

    There exists a positive sequence bN=o(1)b_{N}=o(1) as NN\rightarrow\infty, such that based on an i.i.d. sample of size NN from density q𝒫q\in\mathcal{P}, the density estimator qˇ\check{q} satisfies qˇ0\check{q}\geq 0, 𝒯qˇ=1\int_{\mathcal{T}}\check{q}=1 and

    supq𝒫E(qˇq22)=O(bN2).\sup_{q\in\mathcal{P}}E(\left\lVert\check{q}-q\right\rVert_{2}^{2})=O(b_{N}^{2}).
  2. (D2)

    There exists a positive sequence aN=o(1)a_{N}=o(1) as NN\rightarrow\infty and some R>0R>0 such that based on an i.i.d. sample of size NN from a population for which the density is q𝒫q\in\mathcal{P}, density estimator qˇ\check{q} satisfies qˇ0\check{q}\geq 0, 𝒯qˇ=1\int_{\mathcal{T}}\check{q}=1 and

    supq𝒫P(qˇq>RaN)0 as N.\sup_{q\in\mathcal{P}}P(\lVert\check{q}-q\rVert_{\infty}>Ra_{N})\rightarrow 0\text{ as }N\rightarrow\infty.
  1. (S1)

    Let qˇ\check{q} be a density estimator satisfying (D2) and Ni=Ni(n),i=1,,nN_{i}=N_{i}(n),i=1,\ldots,n be the sample sizes in the training subpopulations. There exists a sequence of lower bounds m(n)min1inNim(n)\leq\min_{1\leq i\leq n}N_{i} such that m(n)m(n)\to\infty as nn\rightarrow\infty and

    nsupq𝒫P(qˇq>Ram)0,n\sup_{q\in\mathcal{P}}P(\lVert\check{q}-q\rVert_{\infty}>Ra_{m})\to 0,

    where qˇ\check{q} is obtained with sample size N(n)m(n)N(n)\geq m(n).

SUPPLEMENTARY MATERIAL

The Supplemental Materials include additional simulation results and details of the theoretical works.

References

  • Amari (2016) Amari, S.-i. (2016). Information Geometry and Its Applications, Volume 194 of Applied mathematical sciences. Springer, [Tokyo].
  • Amari and Nagaoka (2000) Amari, S.-i. and H. Nagaoka (2000). Methods of Information Geometry, Volume 191 of Translations of mathematical monographs. American Mathematical Society, Providence, RI; Oxford University Press, Oxford.
  • Bonevski et al. (2014) Bonevski, B., M. Randell, C. Paul, K. Chapman, L. Twyman, J. Bryant, I. Brozek, and C. Hughes (2014). Reaching the hard-to-reach: a systematic review of strategies for improving health and medical research with socially disadvantaged groups. BMC medical research methodology 14(1), 42.
  • Chu et al. (2018) Chu, B., K. Huynh, D. Jacho-Chávez, and O. Kryvtsov (2018). On the evolution of the United Kingdom price distributions. Ann. Appl. Stat. 12(4), 2618–2646.
  • Dai (2020) Dai, X. (2020). Statistical inference on the Hilbert sphere with application to random densities. ArXiv.
  • Delicado (2011) Delicado, P. (2011). Dimensionality reduction when data are density functions. Comput. Statist. Data Anal. 55(1), 401–420.
  • Egozcue et al. (2006) Egozcue, J. J., J. L. Díaz-Barrero, and V. Pawlowsky-Glahn (2006). Hilbert space of probability density functions based on Aitchison geometry. Acta Math. Sin. (Engl. Ser.) 22(4), 1175–1182.
  • Fithian and Wager (2015) Fithian, W. and S. Wager (2015). Semiparametric exponential families for heavy-tailed data. Biometrika 102(2), 486–493.
  • Gervini (2016) Gervini, D. (2016). Independent component models for replicated point processes. Spat. Stat. 18(part B), 474–488.
  • Gervini and Khanal (2019) Gervini, D. and M. Khanal (2019). Exploring patterns of demand in bike sharing systems via replicated point process models. J. R. Stat. Soc. Ser. C. Appl. Stat. 68(3), 585–602.
  • Han et al. (2020) Han, K., H.-G. Müller, and B. U. Park (2020). Additive functional regression for densities as responses. J. Amer. Statist. Assoc. 115(530), 997–1010.
  • Hron et al. (2016) Hron, K., A. Menafoglio, M. Templ, K. Hrůzová, and P. Filzmoser (2016, February). Simplicial principal component analysis for density functions in Bayes spaces. Computational Statistics & Data Analysis 94, 330–350.
  • Huynh et al. (2011) Huynh, K. P., D. T. Jacho-Chávez, R. J. Petrunia, and M. Voia (2011). Functional principal component analysis of density families with categorical and continuous data on Canadian entrant manufacturing firms. J. Amer. Statist. Assoc. 106(495), 858–878.
  • Johnson et al. (2016) Johnson, A. E. W., T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016, May). MIMIC-III, a freely accessible critical care database. Scientific Data 3(1), 1–9.
  • Kim et al. (2020) Kim, D., Y. K. Lee, and B. U. Park (2020, January). Principal component analysis for Hilbertian functional data. Communications for Statistical Applications and Methods 27(1), 149–161.
  • Kneip and Utikal (2001) Kneip, A. and K. J. Utikal (2001). Inference for density families using functional principal component analysis. J. Amer. Statist. Assoc. 96(454), 519–542.
  • Kooperberg and Stone (1991) Kooperberg, C. and C. J. Stone (1991). A study of logspline density estimation. Comput. Statist. Data Anal. 12(3), 327–347.
  • Lee and Scott (2012) Lee, G. and C. Scott (2012). EM algorithms for multivariate Gaussian mixture models with truncated and censored data. Comput. Statist. Data Anal. 56(9), 2816–2829.
  • Loader (1996) Loader, C. R. (1996). Local likelihood density estimation. Ann. Statist. 24(4), 1602–1618.
  • Panaretos and Zemel (2016) Panaretos, V. M. and Y. Zemel (2016). Amplitude and phase variation of point processes. Ann. Statist. 44(2), 771–812.
  • Petersen and Müller (2016) Petersen, A. and H.-G. Müller (2016). Functional data analysis for density functions by transformation to a Hilbert space. Ann. Statist. 44(1), 183–218.
  • Sheather and Jones (1991) Sheather, S. J. and M. C. Jones (1991). A reliable data-based bandwidth selection method for kernel density estimation. J. Roy. Statist. Soc. Ser. B 53(3), 683–690.
  • Srivastava et al. (2007) Srivastava, A., I. Jermyn, and S. Joshi (2007). Riemannian analysis of probability density functions with applications in vision. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.  1–8.
  • Tsay (2016) Tsay, R. S. (2016). Some methods for analyzing big dependent data. J. Bus. Econom. Statist. 34(4), 673–688.
  • Wainwright (2019) Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint (First ed.). Cambridge University Press.
  • Wakefield and Elliott (1999) Wakefield, J. and P. Elliott (1999). Issues in the statistical analysis of small area health data. Statistics in Medicine 18(17-18), 2377–2399.
  • Wu et al. (2013) Wu, S., H.-G. Müller, and Z. Zhang (2013). Functional data analysis for point processes with rare events. Statist. Sinica 23(1), 1–23.
  • Yao et al. (2005) Yao, F., H.-G. Müller, and J.-L. Wang (2005). Functional data analysis for sparse longitudinal data. J. Amer. Statist. Assoc. 100(470), 577–590.