\RenewEnviron

equation

\BODY

(1)

Asymptotic properties of generalized closed-form maximum likelihood estimators

Pedro L. Ramos¹, Eduardo Ramos², and Francisco A. Rodrigues²,
and Francisco Louzada²
¹ Facultad de Matemáticas, Pontificia Universidad Católica de Chile, Santiago, Chile
² Institute of Mathematics and Computer Science, University of São Paulo, São Carlos, Brazil

Summary

The maximum likelihood estimator (MLE) is pivotal in statistical inference, yet its application is often hindered by the absence of closed-form solutions for many models. This poses challenges in real-time computation scenarios, particularly within embedded systems technology, where numerical methods are impractical. This study introduces a generalized form of the MLE that yields closed-form estimators under certain conditions. We derive the asymptotic properties of the proposed estimator and demonstrate that our approach retains key properties such as invariance under one-to-one transformations, strong consistency, and an asymptotic normal distribution. The effectiveness of the generalized MLE is exemplified through its application to the Gamma, Nakagami, and Beta distributions, showcasing improvements over the traditional MLE. Additionally, we extend this methodology to a bivariate gamma distribution, successfully deriving closed-form estimators. This advancement presents significant implications for real-time statistical analysis across various applications.

Keywords: Closed-form estimators; Maximum Likelihood estimators; Generalized maximum likelihood estimator; generalized estimator.

1 Introduction

Introduced by Ronald Fisher [1], the maximum likelihood method is one of the most well-known and used inferential procedures to estimate the unknown parameters of a given distribution. Alternative methods to the maximum likelihood estimator (MLE) have been proposed in the literature, such as those based on statistical moments [13], percentile [14, 15], product of spacings [7], or goodness of fit measures, to list a few. Although alternative inferential methods are popular nowadays, the MLEs are the most widely used due to their flexibility in incorporating additional complexity (such as random effects, covariates, censoring, among others) and their properties: asymptotical efficiency, consistent and invariance under one-to-one transformation. These properties are achieved when the MLEs satisfy some regularity conditions [5, 16, 20].

It is now well established from various studies that the MLEs do not always return closed-form expressions for most common distributions. In these cases, numerical methods, such as Newton-Rapshon or its variants, are usually considered to find the values that maximize the likelihood function. Important variants of the maximum likelihood estimator, such as profile [18], pseudo [11], conditional [2], penalized [3, 10] and marginal likelihoods [8], have been presented to eliminate nuisance parameters and decrease the computational cost. Another important procedure to achieve the MLEs is the expectation-maximization (EM) algorithm [9], which involves unobserved latent variables jointly with unknown parameters. The expectation and maximization steps also involve, in most cases, the use of numerical methods that may have a high computational cost. However, there is a need to use closed-form estimators to estimate the unknown parameters in many situations. For instance, in embed technology, small components need to compute the estimates without using maximization procedures, and in real-time applications, it is necessary to provide an immediate answer.

In this study, we present a generalized approach to the maximum likelihood method, enabling the derivation of closed-form expressions for estimating distribution parameters in numerous scenarios. Our primary objective is to establish the asymptotic normality and strong consistency of our proposed estimator. Furthermore, we demonstrate that these conditions are significantly simplified within a broad family of generalized maximum likelihood equations. The practical implications of our findings are substantial, offering efficient and rapid computational methods for obtaining estimates. Most importantly, our results facilitate the construction of confidence intervals and hypothesis tests, thus broadening their applicability in various fields.

The proposed method is illustrated regarding the Gamma, Beta, Nakagami distributions and a bivariate gamma model. In these cases, the standard ML does not have closed-form expressions, and numerical methods or approximations are necessary to find these distributions’ solutions. Hence our approach does not require iterative numerical methods, and, computationally, the work required by using our estimators is less complicated than that required in ML estimators. The remainder of this paper is organized as follows. Section 2 presents the new generalized maximum likelihood estimator and its properties. Section 3 considers the application of the Gamma, Nakagami, and Beta distributions. Finally, Section 4 summarizes the study.

2 Generalized Maximum Likelihood Estimator

The method we propose here can be applied to obtain closed-form expressions for distributions with a given density $f(x\,;\,\boldsymbol{\theta})$ . In order to formulate the method, let $\Omega$ represent the sample space, let $\mathcal{X}$ represent the space of the data $x$ , where $\mathcal{X}$ is equipped with a measure $\mu$ , which can be either discrete or continuous, let $\Theta\subset\mathbb{R}^{s}$ be an open set containing the true parameter $\boldsymbol{\theta}_{0}$ , to be estimated, and for each $x\in\mathcal{X}$ we let $\mathcal{A}_{x}\subset\mathbb{R}^{r}$ , $0\leq r\leq s$ be an open set, possibly depending on $x$ , containing a fixed parameter $\boldsymbol{\alpha}_{0}$ , which represents additional parameters which will be used during the procedure to obtain the estimators.

Now, suppose $X_{1}$ , $X_{2}$ , $\cdots$ , $X_{n}$ are independent and identically distributed (iid) random variables, which can be either discrete or continuous, with a strictly positive density function $f(x\,;\,\boldsymbol{\theta}_{0})$ . Then, given a function $g(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha})$ defined for $x\in\mathcal{X}$ , $\boldsymbol{\theta}\in\Theta$ and $\boldsymbol{\alpha}\in\mathcal{A}_{x}$ we define the generalized maximum likelihood equations for $\boldsymbol{\theta}$ over the coordinates $(\theta_{1},\cdots,\theta_{s-r},\boldsymbol{\alpha})$ at $\boldsymbol{\alpha}=\boldsymbol{\alpha}_{0}$ to be the set of equations

		$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\theta_{j}}\log\,g(X_{i}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})=\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\theta_{j}}\log\,g(X_{1}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})\right],\quad 1\leq j\leq s-r,$		(2)
		$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\alpha_{j}}\log\,g(X_{i}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})=\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\alpha_{j}}\log\,g(X_{1}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})\right],\quad 1\leq j\leq r,$		(2)

as long as these partial derivatives exist and the expected values above are finite.

To see how the generalized likelihood equations generalize the maximum likelihood equations, note that, in case the equation $\int_{\mathcal{X}}f(x\,;\,\boldsymbol{\theta})\,d\mu=1$ can be differentiated under the integral sign we obtain $\int_{\mathcal{X}}\frac{\partial}{\partial\theta_{j}}f(x\,;\,\boldsymbol{\theta})\,d\mu=0$ for all $j$ , in which case, letting $g(x\,;\,\boldsymbol{\theta})=f(x\,;\,\boldsymbol{\theta})$ , it follows that the generalized maximum likelihood equations for $\boldsymbol{\theta}$ over the coordinates $(\theta_{1},\cdots,\theta_{s})$ are given by the equations

\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\theta_{j}}\log\,f(X_{i}\,;\,\boldsymbol{\theta})=0,\quad 1\leq j\leq s,

which coincide with the maximum likelihood equations. This differentiation under the integral sign condition is in fact a natural condition to impose, since it is universally used in order to prove the consistency and asymptotic normality of the maximum likelihood estimator.

From now on, our goal shall be that of giving conditions guaranteeing existence of solutions for the generalized maximum likelihood equations as well as conditions under which an obtained solution $\boldsymbol{\hat{\theta}}_{n}(X)$ of the generalized maximum likelihood equations is a consistent estimator for the true parameter $\boldsymbol{\theta}_{0}$ , and is asymptotically normal. In order to formulate the result, given a fixed $\boldsymbol{\alpha}_{0}\in\Theta$ we denote

h_{i}(x\,;\,\boldsymbol{\theta})=\frac{\partial}{\partial\beta_{j}}\log\,g\left(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0}\right)-\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\beta_{i}}\log\,g\left(X_{1}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0}\right)\right]

(3)

for all $x$ , $\boldsymbol{\theta}$ and $i$ , where $(\beta_{1},\cdots,\beta_{s})=(\theta_{1},\cdots,\theta_{s-r},\alpha_{1},\cdots,\alpha_{r})$ . Moreover, we let $J(\boldsymbol{\theta})=\left(J_{i,j}(\boldsymbol{\theta})\right)\in M_{s}(\mathbb{R})$ and $K(\boldsymbol{\theta})=\left(K_{i,j}(\boldsymbol{\theta})\right)\in M_{s}(\mathbb{R})$ , be defined by

	$\displaystyle J_{i,j}(\boldsymbol{\theta})$	$\displaystyle=\operatorname{E}_{\boldsymbol{\theta}}\left[-\frac{\partial}{\partial\theta_{j}}\,h_{i}(X_{1}\,;\,\boldsymbol{\theta})\right]\mbox{ and}$		(4)
	$\displaystyle K_{i,j}(\boldsymbol{\theta})$	$\displaystyle=\operatorname{cov}_{\boldsymbol{\theta}}\left[h_{i}(X_{1}\,;\,\boldsymbol{\theta}),\,h_{j}(X_{1}\,;\,\boldsymbol{\theta})\right],$		(4)

for $1\leq i\leq s$ and $1\leq j\leq s$ . These matrices shall play the role that the Fisher information matrix $I$ plays in the classical maximum likelihood method.

In the following, we say an estimator $\hat{\theta}_{n}(\boldsymbol{X})$ satisfy the modified likelihood equations (10) with probability converging to one strongly if, letting $A_{n}$ denote the subset of $\Omega$ in which $\hat{\theta}_{n}(\boldsymbol{X})$ satisfy (10), we have

\lim_{n\to\infty}P(\cap_{m\geq n}A_{m})=1.

(5)

More generally, we say a series of events $A_{1}$ , $A_{2}$ , $\cdots$ happen with probability converging to one strongly if (5) is valid.

In the following we prove a result regarding the existence and strong consistence of solutions of the modified likelihood equations (10) for an arbitrary probability density function $f$ .

Theorem 2.1.

Denote $\boldsymbol{X}=\left(X_{1},X_{2},\cdots,X_{n}\right)$ , where $X_{1}$ , $\cdots$ , $X_{n}$ are iid with density $f(x\,;\,\boldsymbol{\theta}_{0})$ and suppose:

(A)

$J(\boldsymbol{\theta}_{0})$ and $K(\boldsymbol{\theta}_{0})$ , as defined in (12), exist and $J(\boldsymbol{\theta}_{0})$ is invertible.
(B)

$h_{j}(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})$ is measurable in $x$ and $\frac{\partial}{\partial\theta_{i}}h_{j}(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})$ exist and is continuous in $\boldsymbol{\theta}$ , for all $j$ and $\theta\in\Theta$ , where $h_{j}$ is given in (3).

(C)

There exist measurable functions $M_{ij}(x)$ and an open set $\Theta_{0}$ containing the true parameter $\boldsymbol{\theta}_{0}$ such that $\overline{\Theta}_{0}\subset\Theta$ and for all $\boldsymbol{\theta}\in\overline{\Theta}_{0}$ and $x\in\mathcal{X}$ we have

\displaystyle\left|\frac{\partial}{\partial\theta_{i}}h_{j}(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})\right|\leq M_{ij}(x)\mbox{ and }E_{\boldsymbol{\theta}_{0}}\left[M_{ij}(X_{1})\right]<\infty,

for all $1\leq i\leq s$ and $1\leq j\leq s$ .

Then, with probability converging to one, the generalized maximum likelihood equations has a solution. Specifically, there exist $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{x})=(\hat{\theta}_{1n}(\boldsymbol{x}),\cdots,\hat{\theta}_{sn}(\boldsymbol{x}))$ measurable in $\boldsymbol{x}\in\mathcal{X}^{n}$ such that:

I)

$\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X})$ satisfy the modified likelihood equations (10) with probability converging to one strongly.
II)

$\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X})$ is a strongly consistent estimator for $\boldsymbol{\theta}_{0}$
III)

$\sqrt{n}(\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X})-\boldsymbol{\theta}_{0})^{T}\overset{D}{\to}N_{s}\left(0,(J(\boldsymbol{\theta}_{0})^{-1})^{T}K(\boldsymbol{\theta}_{0})J(\boldsymbol{\theta}_{0})^{-1}\right)$ .

Proof.

The proof is available in the Appendix. ∎

Note that if $r=0$ in the above result, then condition $(A)$ corresponds to requiring the Fisher information matrix $I(\boldsymbol{\theta}_{0})$ to be invertible, since in such case $J(\boldsymbol{\theta}_{0})=I(\boldsymbol{\theta}_{0})$ .

As a corollary of the above result we have in special the following theorem, which simplifies the above conditions $(A)$ to $(C)$ , when $g$ is contained in a certain family of measurable functions.

Theorem 2.2.

Denote $\boldsymbol{X}=\left(X_{1},X_{2},\cdots,X_{n}\right)$ , where $X_{1}$ , $\cdots$ , $X_{n}$ are iid with density $f(x\,;\,\boldsymbol{\theta}_{0})$ , let $g(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha})$ be defined as

g(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha})=V(x)\exp\left(\sum_{i=1}^{s}\eta_{i}(\boldsymbol{\theta})T_{i}(x,\boldsymbol{\alpha})+L(\boldsymbol{\theta},\boldsymbol{\alpha})\right)\mbox{ for all }x\in\mathcal{X},\ \boldsymbol{\theta}\in\Theta,\ \boldsymbol{\alpha}\in\mathcal{A}_{x},

where $\eta_{i}$ and $L$ are $C^{3}$ for all $i$ , $V$ is measurable and positive, $T_{i}(x,\boldsymbol{\alpha_{0}})$ is measurable in $x$ , the partial derivatives $\frac{\partial}{\partial\alpha_{j}}T_{i}(x,\boldsymbol{\alpha}_{0})$ exist for all $i$ and $j$ , and suppose:

(A)

$J(\boldsymbol{\theta}_{0})$ and $K(\boldsymbol{\theta}_{0})$ , as defined in (12), exist and $J(\boldsymbol{\theta}_{0})$ is invertible.
(B)

$\operatorname{E}_{\boldsymbol{\theta}}\left[T_{i}(x,\boldsymbol{\alpha_{0}})\right]$ and $\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\alpha_{j}}T_{i}(x,\boldsymbol{\alpha}_{0})\right]$ are finite, for all $i$ , $j$ and $\boldsymbol{\theta}\in\Theta$ .

Then, with probability converging to one as $n\to\infty$ , the generalized maximum likelihood equations has a solution. Specifically, there exist $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{x})=(\hat{\theta}_{1n}(\boldsymbol{x}),\cdots,\hat{\theta}_{sn}(\boldsymbol{x}))$ measurable in $\boldsymbol{x}\in\mathcal{X}^{n}$ such that:

I)

$\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X})$ satisfy the modified likelihood equations (10) with probability converging to one strongly.
II)

$\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X})$ is a strongly consistent estimator for $\boldsymbol{\theta}_{0}$
III)

$\sqrt{n}(\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X})-\boldsymbol{\theta}_{0})^{T}\overset{D}{\to}N_{s}\left(0,(J(\boldsymbol{\theta}_{0})^{-1})^{T}K(\boldsymbol{\theta}_{0})J(\boldsymbol{\theta}_{0})^{-1}\right)$ .

Proof.

The proof is available in the Appendix. ∎

Note the family of measurable functions imposed above for $g$ is more general than the exponential family of distributions, and besides, $g$ is not even required to be a probability density function. Additionally, note that no restrictions are made over $f$ besides being a probability density function. Thus, we consider this result to be important since it provides an infinite number of possible estimators, due to the infinite possible choices for $g$ , and besides, provides easy to verify conditions for obtained estimators to be strongly consistent and asymptotically normal.

Now, in order to define the generalized generalized likelihood equations under change of variables, given $\pi:\Theta\to\Lambda$ a diffeomorphism and letting $g^{*}(x\,;\,\boldsymbol{\lambda},\boldsymbol{\alpha})=g(x\,;\,\pi^{-1}(\boldsymbol{\lambda}),\boldsymbol{\alpha})$ for all $x$ , $\boldsymbol{\lambda}\in\Lambda$ and $\boldsymbol{\alpha}_{0}\in\mathcal{A}_{x}$ , we let the generalized maximum likelihood equations for $\boldsymbol{\lambda}=\pi(\boldsymbol{\theta})$ at $\boldsymbol{\alpha}=\boldsymbol{\alpha}_{0}$ be defined by the set of equations

		$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\lambda_{j}}\log\,g^{}(X_{i}\,;\,\boldsymbol{\lambda},\boldsymbol{\alpha}_{0})=\operatorname{E}_{\pi^{-1}(\boldsymbol{\lambda})}\left[\frac{\partial}{\partial\lambda_{j}}\log\,g^{}(X_{1}\,;\,\boldsymbol{\lambda},\boldsymbol{\alpha}_{0})\right],\quad 1\leq j\leq s-r,$		(6)
		$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\alpha_{j}}\log\,g^{}(X_{i}\,;\,\boldsymbol{\lambda},\boldsymbol{\alpha}_{0})=\operatorname{E}_{\pi^{-1}(\boldsymbol{\lambda})}\left[\frac{\partial}{\partial\alpha_{j}}\log\,g^{}(X_{1}\,;\,\boldsymbol{\lambda},\boldsymbol{\alpha}_{0})\right],\quad 1\leq j\leq r,$		(6)

as long as these partial derivatives exist and the expected values are finite.

Proposition 2.3 (One-to-one invariance).

Suppose $\Theta=\Theta_{1}\times\Theta_{2}$ , $\Lambda=\Lambda_{1}\times\Lambda_{2}$ where $\Theta_{1}$ , $\Lambda_{1}\subset\mathbb{R}^{s-r}$ and $\Theta_{2}$ and $\Lambda_{2}\subset\mathbb{R}^{r}$ are open sets, suppose $\pi:\Lambda\to\Theta$ can be written as

\pi(\boldsymbol{\theta})=(\pi_{1}(\boldsymbol{\theta}_{1}),\pi_{2}(\boldsymbol{\theta}_{2})),\mbox{ for all }\boldsymbol{\theta}=(\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2})\in\Theta_{1}\times\Theta_{2},

where $\pi_{1}:\Theta_{1}\to\Lambda_{1}$ and $\pi_{2}:\Theta_{2}\to\Lambda_{2}$ are diffeomorphisms, suppose that for some $n\in\mathbb{N}$ , with probability one on $\Omega$ , $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X})$ is a solution for the generalized maximum likelihood equations for $\boldsymbol{\theta}$ at $\boldsymbol{\alpha}=\boldsymbol{\alpha}_{0}$ . Then, with probability one in $\Omega$ , $\pi(\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}))$ is a solution to the generalized likelihood equations for $\boldsymbol{\lambda}=\pi(\boldsymbol{\theta})$ at $\boldsymbol{\alpha}=\boldsymbol{\alpha}_{0}$ .

Proof.

Since $\pi_{1}$ does not depend on $\boldsymbol{\theta}_{2}$ and $\pi_{2}$ does not depend on $\boldsymbol{\theta}_{1}$ , it follows that $\pi^{-1}(\boldsymbol{\lambda})=\left(\pi_{1}^{-1}(\boldsymbol{\lambda}_{1}),\pi_{2}^{-1}(\boldsymbol{\lambda}_{2})\right)$ for all $\boldsymbol{\lambda}=(\boldsymbol{\lambda}_{1},\boldsymbol{\lambda}_{2})\in\Lambda_{1}\times\Lambda_{2}$ and thus, letting $\pi_{1}^{-1}(\boldsymbol{\lambda}_{1})=\left(\pi^{*}_{11}(\boldsymbol{\lambda}_{1}),\cdots,\pi^{*}_{1(s-r)}(\boldsymbol{\lambda}_{1})\right)$ for all $\boldsymbol{\lambda}_{1}\in\Lambda_{1}$ , and letting $g^{*}(x\,;\,\boldsymbol{\lambda},\boldsymbol{\alpha})=g(x\,;\,\pi^{-1}(\boldsymbol{\lambda}),\boldsymbol{\alpha})$ for all $x$ , from the chain rule it follows that

\displaystyle\frac{\partial}{\partial\lambda_{j}}\log g^{*}(X_{i}\,;\,\boldsymbol{\lambda},\boldsymbol{\alpha}_{0})=\sum_{k=1}^{s-r}\frac{\partial}{\partial\theta_{k}}\log g(X_{i}\,;\,\pi^{-1}\left(\boldsymbol{\lambda}\right),\boldsymbol{\alpha}_{0})\frac{\partial}{\partial\lambda_{j}}\pi^{*}_{1k}(\boldsymbol{\lambda}_{1}),

(7)

for all $i$ and $1\leq j\leq s-r$ . Moreover, by hypothesis, with probability one in $\Omega$ , $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X})$ satisfy

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\theta_{j}}\log\,g(X_{i}\,;\,\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0})=\int_{\mathcal{X}}\left(\frac{\partial}{\partial\theta_{j}}\log\,g(X_{1}\,;\,\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0})\right)f(X_{1}\,;\,\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0})\;d\mu,

(8)

for all $i$ and $1\leq j\leq s-r$ . Thus, denoting $\boldsymbol{\hat{\lambda}}_{n}(\boldsymbol{X})=\pi(\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}))$ it follows combining (7) and (8) that

	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\lambda_{j}}\log\,g^{}(X_{i}\,;\,\boldsymbol{\hat{\lambda}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0})=\sum_{k=1}^{s-r}\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\theta_{k}}\log g(X_{i}\,;\,\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0})\right)\frac{\partial}{\partial\lambda_{j}}\pi^{}_{1k}(\boldsymbol{\lambda}_{1}),$
	$\displaystyle=\int_{\mathcal{X}}\left(\sum_{k=1}^{s-r}\left(\sum_{i=1}^{n}\frac{\partial}{\partial\theta_{k}}\log g(X_{i}\,;\,\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0})\right)\frac{\partial}{\partial\lambda_{j}}\pi^{*}_{1k}(\boldsymbol{\lambda}_{1})\right)f(X_{1}\,;\,\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0})\;d\mu$
	$\displaystyle=\int_{\mathcal{X}}\left(\frac{\partial}{\partial\lambda_{j}}\log\,g^{*}(X_{1}\,;\,\boldsymbol{\hat{\lambda}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0})\right)f(X_{1}\,;\,\pi^{-1}(\boldsymbol{\hat{\lambda}}_{n}(\boldsymbol{X})),\boldsymbol{\alpha}_{0})\;d\mu$
	$\displaystyle=\operatorname{E}_{\pi^{-1}(\boldsymbol{\hat{\lambda}}_{n}(\boldsymbol{X}))}\left[\frac{\partial}{\partial\lambda_{j}}\log\,g^{*}(X_{1}\,;\,\boldsymbol{\hat{\lambda}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0})\right]$

with probability one on $\Omega$ . That is, $\boldsymbol{\hat{\lambda}}_{n}(\boldsymbol{X})$ satisfy with probability one on $\Omega$ the first equation in (6). Additionally since by hypothesis $\pi$ does not depend on the variable $\alpha_{j}$ for given $1\leq j\leq r$ it follows that

\displaystyle\frac{\partial}{\partial\alpha_{j}}\,\log g^{*}(X_{i}\,;\,\boldsymbol{\hat{\lambda}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0})=\frac{\partial}{\partial\alpha_{j}}\,\log g(X_{i}\,;\,\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0}),

for all $i$ and $1\leq j\leq r$ , from which it follows using the hypothesis, just as before, that

\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\alpha_{j}}\log\,g^{*}(X_{i}\,;\,\boldsymbol{\hat{\lambda}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0})=\operatorname{E}_{\pi^{-1}(\boldsymbol{\hat{\lambda}}_{n}(\boldsymbol{X}))}\left[\frac{\partial}{\partial\alpha_{j}}\log\,g^{*}(X_{1}\,;\,\boldsymbol{\hat{\lambda}}_{n}(\boldsymbol{X}),\boldsymbol{\alpha}_{0})\right]

for $1\leq j\leq r$ , with probability one on $\Omega$ , which concludes the proof. ∎

In general, MML estimators will not necessarily be functions of sufficient statistics. Additionally, in our applications, we shall use $g$ as a generalized version of the distribution $f$ in order to obtain the generalized maximum likelihood estimators. As we shall see, due to the high number of new distributions introduced in the past decades, it is not difficult to find such functions $g$ generalizing $f$ . In the next section, we present applications of the proposed method.

3 Examples

We illustrate the proposed method by applying it to the Gamma, Nakagami-m, and Beta distributions. The examples are presented for well-known distributions, so we shall not present their backgrounds. The standard MLEs for the cited distributions are widely discussed in statistical books, which shows that no closed-form expression can be achieved using the MLE method.

The Gamma and the Nakagami-m distributions are particular cases of the generalized Gamma distribution while the Beta distribution is a special case of the generalized Beta distribution. Therefore, we will consider this generalized distributions to obtain the generalized maximum-likelihood equations used to obtain the closed-form estimators.

As we shall see, in all examples presented here we shall have

\displaystyle\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\theta_{j}}\log\,g(X_{1}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})\right]=0\mbox{ and }\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\alpha_{j}}\log\,g(X_{1}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})\right]=0

(9)

for all $j$ . This should be expected in these examples due to differentiation under the integral sign of the equation $\int_{\mathcal{X}}g(X_{1}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha})\;d\mu=1$ , since in these examples $f$ is a special case of $g$ , and $g$ is a probability distribution. In special, in these examples the generalized maximum likelihood equations shall be given by

		$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\theta_{j}}\log\,g(X_{i}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})=0,\quad 1\leq j\leq s-r,$		(10)
		$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\alpha_{j}}\log\,g(X_{i}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})=0,\quad 1\leq j\leq r,$		(10)

and moreover $J(\boldsymbol{\theta})$ and $K(\boldsymbol{\theta})$ shall be given by

	$\displaystyle J_{i,j}(\boldsymbol{\theta})$	$\displaystyle=\operatorname{E}_{\boldsymbol{\theta}}\left[-\frac{\partial^{2}}{\partial\theta_{j}\partial\beta_{i}}\,g(X_{1}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})\right]\mbox{ and}$		(11)
	$\displaystyle K_{i,j}(\boldsymbol{\theta})$	$\displaystyle=\operatorname{cov}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\beta_{i}}g(X_{1}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0}),\,\frac{\partial}{\partial\beta_{j}}g(X_{1}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})\right],$		(11)

for all $i$ and $j$ , where $\boldsymbol{\beta}$ is as in (3). Additionally, since we shall use only well known distributions $g(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha})$ , whose Fisher information matrix $I(\boldsymbol{\theta},\boldsymbol{\alpha})$ can be computed either by

	$\displaystyle I_{i,j}(\boldsymbol{\theta},\boldsymbol{\alpha})$	$\displaystyle=\operatorname{E}_{\boldsymbol{\theta}}\left[-\frac{\partial^{2}}{\partial\beta^{}_{i}\partial\beta^{}_{j}}\,g(X_{1}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha})\right]\mbox{ or}$		(12)
	$\displaystyle I_{i,j}(\boldsymbol{\theta},\boldsymbol{\alpha})$	$\displaystyle=\operatorname{cov}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\beta^{}_{i}}g(X_{1}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}),\,\frac{\partial}{\partial\beta^{}_{j}}g(X_{1}\,;\,\boldsymbol{\theta},\boldsymbol{\alpha})\right],$		(12)

where $\boldsymbol{\beta}^{*}=(\boldsymbol{\alpha},\boldsymbol{\theta})$ it follows that, in these examples, $J(\boldsymbol{\theta})$ and $K(\boldsymbol{\theta})$ are submatrices of $I(\boldsymbol{\theta},\boldsymbol{\alpha}_{0})$ .

Example 1: Let us consider that $X_{1}$ , $X_{2}$ , $\ldots$ , $X_{n}$ are iid random variables (RV) following a gamma distribution with probability density function (PDF) given by:

f(x\,;\,\lambda,\phi)=\frac{1}{\Gamma(\phi)}\left(\frac{\phi}{\lambda}\right)^{\phi}x^{\phi-1}\exp\left(-\frac{\phi}{\lambda}x\right)\mbox{ for all }x>0,

(13)

where $\phi>0$ is the shape parameter, $\lambda>0$ is the scale parameter and $\Gamma(\alpha)=\int_{0}^{\infty}{e^{-x}x^{\alpha-1}dx}$ is the gamma function.

We can apply the generalized maximum likelihood approach for this distribution by considering the density function $g(x\,;\,\lambda,\phi,\alpha)$ representing the generalized gamma distribution, where $\lambda>0$ , $\phi>0$ and $\alpha>0$ , given by

g(x\,;\,\lambda,\phi,\alpha)=\frac{\alpha}{\Gamma(\phi)}\left(\frac{\phi}{\lambda}\right)^{\phi}x^{\alpha\phi-1}\exp\left(-\frac{\phi}{\lambda}x^{\alpha}\right)\mbox{ for all }x>0.

(14)

In order to formulate the generalized maximum likelihood equations for this distribution we first note that

		$\displaystyle\operatorname{E}_{\lambda,\phi}\left[X_{1}\right]=\lambda,\ \operatorname{E}_{\lambda,\phi}\left[\log(X_{1})\right]=\psi(\phi)-\log\left(\frac{\phi}{\lambda}\right)\mbox{ and }$		(15)
		$\displaystyle\operatorname{E}_{\lambda,\phi}\left[X_{1}\log(X_{1})\right]=\lambda\left(\psi(\phi)+\frac{1}{\phi}-\log\left(\frac{\phi}{\lambda}\right)\right),$		(15)

which, combined with $\operatorname{E}_{\lambda,\phi}[1]=1$ implies that

		$\displaystyle\operatorname{E}_{\lambda,\phi}\left[\frac{\partial}{\partial\lambda}\log g(X_{1}\,;\,\lambda,\phi,1)\right]=\frac{\phi}{\lambda}-\frac{\phi}{\lambda^{2}}\operatorname{E}_{\lambda,\phi}\left[X_{1}\right]=0\mbox{ and}$
		$\displaystyle\operatorname{E}_{\lambda,\phi}\left[\frac{\partial}{\partial\alpha}\log g(X_{1}\,;\,\lambda,\phi,1)\right]=1+\phi\operatorname{E}_{\lambda,\phi}\left[\log\left(X_{1}\right)\right]-\frac{\phi}{\lambda}\operatorname{E}_{\lambda,\phi}\left[X_{1}\log\left(X_{1}\right)\right]=0,$

that is, (9) is satisfied. Thus, the generalized likelihood equations for $\boldsymbol{\theta}=(\lambda,\phi)$ over the coordinates $(\lambda,\alpha)$ at $\alpha=1$ are given by

	$\displaystyle\sum_{i=1}^{n}\frac{\partial}{\partial\lambda}\log g(X_{i}\,;\,\lambda,\phi,1)=\frac{n\phi}{\lambda}-\frac{\phi}{\lambda^{2}}\sum_{i=1}^{n}{X_{i}}=0\mbox{ and}$
	$\displaystyle\sum_{i=1}^{n}\frac{\partial}{\partial\alpha}\log g(X_{i}\,;\,\lambda,\phi,1)=n+\phi\left(\sum_{i=1}^{n}\log\left(X_{i}\right)-\frac{1}{\lambda}\sum_{i=1}^{n}X_{i}\log\left(X_{i}\right)\right)=0.$

Following [17], as long as the equality $X_{1}=\cdots=X_{n}$ does not hold, we have $\sum_{i=1}^{n}X_{i}\log\left(X_{i}\right)-\frac{1}{n}\sum_{i=1}^{n}X_{i}\sum_{i=1}^{n}\log\left(X_{i}\right)\neq 0$ , in which case a direct computation shows the generalized likelihood equations above has as only solution

\hat{\lambda}_{n}=\frac{1}{n}\sum_{i=1}^{n}{X_{i}}\mbox{ and }\hat{\phi}_{n}=\frac{\sum_{i=1}^{n}X_{i}}{\sum_{i=1}^{n}X_{i}\log\left(X_{i}\right)-\frac{1}{n}\sum_{i=1}^{n}X_{i}\sum_{i=1}^{n}\log\left(X_{i}\right)}.

(16)

On the other hand, the MLE for $\phi$ and $\lambda$ would be obtained by solving the non-linear system of equations

\lambda=\frac{1}{n}\sum_{i=1}^{n}X_{i}\mbox{ and }\log(\phi)-\psi(\phi)=\log(\lambda)-\frac{1}{n}\sum_{i=1}^{n}{\log\left(X_{i}\right)},

(17)

where $\psi(k)=\frac{\partial}{\partial k}\log\Gamma(k)=\frac{\Gamma^{\prime}(k)}{\Gamma(k)}$ is the digamma function.

We now apply Theorem 2.2 to prove that the obtained estimators are consistent and asymptotically normal.

Proposition 3.1.

$\hat{\phi}_{n}$ and $\hat{\lambda}_{n}$ are strongly consistent estimators for the true parameters $\phi$ and $\lambda$ , and asymptotically normal with $\sqrt{n}\left(\hat{\lambda}_{n}-\lambda\right)\overset{D}{\to}N\left(0,\lambda^{2}/\phi\right)$ and $\sqrt{n}\left(\hat{\phi}_{n}-\phi\right)\overset{D}{\to}N\left(0,\phi^{3}\psi^{\prime}(\phi+1)+\phi^{2}\right)$ .

Proof.

In order to apply Theorem 2.2 we note $g(x\,;\,\lambda,\phi,\alpha)$ can be rewritten as

	$\displaystyle g(x\,;\,\lambda,\phi,\alpha)=\frac{\alpha}{\Gamma(\phi)}\left(\frac{\phi}{\lambda}\right)^{\phi}x^{\alpha\phi-1}\exp\left(-\frac{\phi}{\lambda}x^{\alpha}\right)=$
	$\displaystyle V(x)\exp\left(\eta_{1}(\lambda,\phi)T_{1}(x,\alpha)+\eta_{2}(\lambda,\phi)T_{2}(x,\alpha)+L(\lambda,\phi,\alpha)\right)$

for all $x>0$ , $\lambda>0$ , $\theta>0$ and $\alpha>0$ , where

	$\displaystyle V(x)=\frac{1}{x},\ \eta_{1}(\lambda,\phi)=\phi,\ \eta_{2}(\lambda,\phi)=-\frac{\phi}{\lambda}$
	$\displaystyle T_{1}(x,\alpha)=\alpha\log(x),\ T_{2}(x,\alpha)=x^{\alpha}\mbox{ and }$
	$\displaystyle L(\lambda,\phi,\alpha)=\log\left(\alpha\right)-\log\left(\Gamma(\phi)\right)+\phi\log\left(\frac{\phi}{\lambda}\right).$

To check condition $(A)$ of Theorem 2.2 note that, for $\alpha=1$ and using reparametrization over the Fisher information matrix from the GG distribution available in [12] it follows that the Fisher information matrix under our parametrization satisfy

I(\lambda,\phi,\alpha_{0})=\begin{bmatrix}\dfrac{\phi}{\lambda^{2}}&0&\frac{\phi\log\left(\frac{\phi}{\lambda}\right)-\phi\psi(\phi)-1}{\lambda}\\ 0&\frac{\phi\psi^{\prime}(\phi)-1}{\phi}&\frac{1}{\phi}\\ \frac{\phi\log\left(\frac{\phi}{\lambda}\right)-\phi\psi(\phi)-1}{\lambda}&\frac{1}{\phi}&I_{3,3}(\lambda,\phi)\end{bmatrix},

(18)

for $\alpha_{0}=1$ , where

I_{3,3}(\lambda,\phi)=\log\left(\frac{\phi}{\lambda}\right)\left(\phi\log\left(\frac{\phi}{\lambda}\right)-2\phi\psi(\phi)-2\right)+\phi\psi^{\prime}(\phi)+2\psi(\phi)+\phi\psi(\phi)^{2}+1.

Therefore since, as discussed earlier, $J(\lambda,\phi)$ and $K(\lambda,\phi)$ can be computed as submatrices of $I(\lambda,\phi,\alpha_{0})$ , we have

J(\lambda,\phi)=\begin{bmatrix}\dfrac{\phi}{\lambda^{2}}&\frac{\phi\log\left(\frac{\phi}{\lambda}\right)-\phi\psi(\phi)-1}{\lambda}\\ 0&\frac{1}{\phi}\end{bmatrix}\mbox{ and }K(\lambda,\phi)=\begin{bmatrix}\dfrac{\phi}{\lambda^{2}}&\frac{\phi\log\left(\frac{\phi}{\lambda}\right)-\phi\psi(\phi)-1}{\lambda}\\ \frac{\phi\log\left(\frac{\phi}{\lambda}\right)-\phi\psi(\phi)-1}{\lambda}&I_{3,3}(\lambda,\phi)\end{bmatrix},

(19)

and thus, since $\det(J(\lambda,\phi))=\frac{1}{\lambda^{2}}\neq 0$ , it follows that $J(\lambda,\phi)$ is invertible for all $\phi>0$ and $\lambda>0$ with

J(\lambda,\phi)^{-1}=\begin{bmatrix}\frac{\lambda^{2}}{\phi}&-\lambda\left(\phi\log\left(\frac{\phi}{\lambda}\right)-\phi\psi(\phi)-1\right)\\ 0&\phi\end{bmatrix},

that is, condition $(A)$ is verified. Additionally, after some algebraic computations, one can verify that

(J(\lambda,\phi)^{-1})^{T}K(\lambda,\phi)J(\lambda,\phi)^{-1}=\begin{bmatrix}\frac{\lambda^{2}}{\phi}&0\\ 0&\phi^{3}\psi^{\prime}(\phi+1)+\phi^{2}\end{bmatrix}.

(20)

Item $(B)$ is straightforward to check from (2.2). Thus conditions $(A)$ and $(B)$ of Theorem 2.2 are valid and therefore, from Theorem 2.2, we conclude there exist $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{x})=(\hat{\theta}_{1n}(\boldsymbol{x}),\hat{\theta}_{2n}(\boldsymbol{x}))$ measurable in $\boldsymbol{x}\in\mathcal{X}^{n}$ satisfying items $I)$ to $III)$ of Theorem 2.2.

Now, since the equation $X_{1}=\cdots=X_{n}$ has probability zero of ocurring for $n\geq 2$ , it follows that $(\hat{\lambda}_{n},\hat{\phi}_{n})$ as given in (16) is, with probability one, the only solution of the generalized maximum likelihood equations for $n\geq 2$ . This fact combined with item $I)$ of Theorem 2.2 implies that $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X})-(\hat{\lambda}_{n},\hat{\phi}_{n})\overset{a.s.}{\to}0$ . Thus the proposition follows from items $II)$ and $III)$ of Theorem 2.2 combined with (20). ∎

Note that the MLE of $\phi$ differs from the obtained using our approach, which leads to a closed-form expression. Figure 1 presents the Bias and root of the mean square error (RMSE) obtained from $100,000$ replications assuming $n=10,15,\ldots,100$ and $\phi=2$ and $\lambda=1.5$ . We presented only the results related to $\phi$ , since the estimator of $\lambda$ is the same using both approaches. It can be seen from the obtained results that both estimators’ results in similar (although not the same) results.

Refer to caption — Figure 1: Bias and RMSE for $\phi$ for samples sizes of $10,15,\ldots,100$ elements considering $\phi=2$ and $\lambda=1.5$ .

Example 2: Let $X_{1}$ , $X_{2}$ , $\ldots$ , $X_{n}$ be iid random variables following a Nakagami-m distribution with PDF given by

f(x\,;\,\lambda,\phi)=\frac{2}{\Gamma(\phi)}\left(\frac{\phi}{\lambda}\right)^{\phi}t^{2\phi-1}\exp\left(-\frac{\phi}{\lambda}t^{2}\right),

for all $t>0$ , where $\phi>0.5$ and $\lambda>0$ .

Once again letting $g$ as in (14), just as in Example 1, following [19], as long as $X_{1}=\cdots=X_{n}$ does not hold, it follows that $\sum_{i=1}^{n}X_{i}^{2}\log\left(X_{i}^{2}\right)-\frac{1}{n}\sum_{i=1}^{n}X_{i}^{2}\sum_{i=1}^{n}\log\left(X_{i}^{2}\right)\neq 0$ , in which case the generalized maximum likelihood equations for $(\lambda,\phi)$ over the coordinates $(\lambda,\alpha)$ at $\alpha=2$ has as only solution

\displaystyle\hat{\lambda}_{n}=\frac{1}{n}\sum_{i=1}^{n}{X_{i}^{2}}\ \ \mbox{ and }\ \ \hat{\phi}_{n}=\cfrac{\sum_{i=1}^{n}X_{i}^{2}}{\sum_{i=1}^{n}X_{i}^{2}\log\left(X_{i}^{2}\right)-\frac{1}{n}\sum_{i=1}^{n}X_{i}^{2}\sum_{i=1}^{n}\log\left(X_{i}^{2}\right)}\,\cdot

The estimator has a similar expression as that of the MMLEs of the Gamma distribution. Once again, we note these estimators are strongly consistent and asymptotically normal:

Proposition 3.2.

Proof.

The arguments and computations involved are completely analogous to that of Proposition 3.4. ∎

Here, we also compare the proposed estimators with the standard MLE. In Figure 2 we present the Bias and RMSE obtained from $100,000$ replications assuming $n=10,15,\ldots,100$ and $\phi=4$ and $\lambda=10$ . We also presented only the results related to $\phi$ . It can be seen from the obtained results that both estimators returned very close estimates.

Note that the approach given above can be considered for other particular cases. For instance, the Wilson-Hilferty distributions are obtained when $\alpha=3$ . Hence, we can obtain closed-form estimators for cited distribution as well. It is essential to mention that, in the above examples, we do not claim that the GG distribution is the unique distribution which can be used to obtain closed-form estimators for the Gamma and Nakagami distributions. Different choices for $g$ may lead to different closed-form estimators.

Now we apply the proposed approach in a generalized version of the beta distribution that will return a closed-form estimator for both parameters.

Example 3: Let us assume that the chosen beta distribution has the PDF given by

f(x\,;\,\alpha,\beta)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{\operatorname{B}(\alpha,\beta)}\quad 0<x<1,

(21)

where $\operatorname{B}(\alpha,\beta)=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}$ is the beta function, $\alpha>0$ , $\beta>0$ .

We can apply the generalized maximum likelihood approach for this distribution by considering the function $g(x\,;\,\alpha,\beta,a,c)$ representing the generalized beta distribution, where $\alpha>2$ , $\beta>2$ given by:

g(x\,;\,\alpha,\beta,a,c)=\frac{\left(x-a\right)^{\alpha-1}\left(c-x\right)^{\beta-1}}{(c-a)^{\alpha+\beta-1}\operatorname{B}(\alpha,\beta)}\mbox{ for all }x\in(0,1)\mbox{ and }0\leq a<x<c\leq 1.

Once again in order to formulate the generalized maximum likelihood equations for $\boldsymbol{\theta}=(\alpha,\beta)$ over the coordinates $(a,c)$ at $(a,c)=(0,1)$ we note that

\operatorname{E}_{\alpha,\beta}\left[\frac{1}{X_{1}}\right]=\frac{\alpha+\beta-1}{\alpha-1}\mbox{ and }\operatorname{E}_{\alpha,\beta}\left[\frac{1}{1-X_{1}}\right]=\frac{\alpha+\beta-1}{\beta-1},

(22)

from which it follows that

		$\displaystyle\operatorname{E}_{\alpha,\beta}\left[\frac{\partial}{\partial a}\log g(X_{1}\,;\,\alpha,\beta,0,1)\right]=-(\alpha-1)\operatorname{E}_{\alpha,\beta}\left[\frac{1}{X_{1}}\right]+(\alpha+\beta-1)\operatorname{E}_{\alpha,\beta}\left[1\right]=0\mbox{ and}$
		$\displaystyle\operatorname{E}_{\alpha,\beta}\left[\frac{\partial}{\partial c}\log g(X_{1}\,;\,\alpha,\beta,0,1)\right]=(\beta-1)\operatorname{E}_{\alpha,\beta}\left[\frac{1}{X_{1}}\right]-(\alpha+\beta-1)\operatorname{E}_{\alpha,\beta}\left[1\right]=0.$

that is, (9) is satisfied. Thus the generalized likelihood equations for $\boldsymbol{\theta}=(\alpha,\beta)$ over the coordinates $(a,c)$ at $(a,c)=(0,1)$ are given by

		$\displaystyle\sum_{i=1}^{n}\frac{\partial}{\partial a}\log g(X_{i}\,;\,\alpha,\beta,0,1)=-(\alpha-1)\sum_{i=1}^{n}{\frac{1}{X_{i}}}\,+n(\alpha+\beta-1)=0,\mbox{ and }$
		$\displaystyle\sum_{i=1}^{n}\frac{\partial}{\partial c}\log g(X_{i}\,;\,\alpha,\beta,0,1)=(\beta-1)\sum_{i=1}^{n}{\frac{1}{1-X_{i}}}\,-n(\alpha+\beta-1)=0.$

Note that, from the harmonic-arithmetic inequality, as long as the equality $X_{1}=\ldots=X_{n}$ does not hold, we have $\sum_{i=1}^{n}\frac{1-X_{i}}{X_{i}}-\frac{n^{2}}{\sum_{i=1}^{n}\frac{X_{i}}{1-X_{i}}}>0$ and $\sum_{i=1}^{n}\frac{X_{i}}{1-X_{i}}-\frac{n^{2}}{\sum_{i=1}^{n}\frac{1-X_{i}}{X_{i}}}>0$ , in which case, after some algebraic manipulations it is seem that the only solutions for the above system of linear equations are given by

\hat{\alpha}_{n}=\left(\sum_{i=1}^{n}\frac{1}{X_{i}}\right)\left(\sum_{i=1}^{n}\frac{1-X_{i}}{X_{i}}-\frac{n^{2}}{\sum_{i=1}^{n}\frac{X_{i}}{1-X_{i}}}\right)^{-1}.

(23)

\hat{\beta}_{n}=\left(\sum_{i=1}^{n}\frac{1}{1-X_{i}}\right)\left(\sum_{i=1}^{n}\frac{X_{i}}{1-X_{i}}-\frac{n^{2}}{\sum_{i=1}^{n}\frac{1-X_{i}}{X_{i}}}\right)^{-1}.

(24)

In the following, we apply Theorem 2.2 prove that these estimators are consistent and asymptotically normal.

Proposition 3.3.

$\hat{\alpha}_{n}$ and $\hat{\beta}_{n}$ are strongly consistent estimators for the true parameters $\alpha$ and $\beta$ , and asymptotically normal with $\sqrt{n}\left(\hat{\alpha}_{n}-\alpha\right)\overset{D}{\to}N\left(0,Q(\alpha,\beta)\right)$ and $\sqrt{n}\left(\hat{\beta}_{n}-\beta\right)\overset{D}{\to}N\left(0,Q(\beta,\alpha)\right)$ , where

Q(y,z)=\frac{y(y-1)^{2}(4yz^{2}-6z^{2}-10yz+5y+16z-10)}{(y-2)(z-2)(y+z-1)}\mbox{ for all }y>2\mbox{ and }z>2.

Proof.

In order to apply Theorem 2.2 we note $g(x\,;\,\alpha,\beta,a,c)$ can be written as

g(x\,;\,\alpha,\beta,a,c)=V(x)\exp\left[\eta_{1}(\alpha,\beta)\log(x-a)+\eta_{2}(\alpha,\beta)\log(c-x)+L(\alpha,\beta,a,c)\right]

for all $x\in(0,1)$ , $\alpha>2$ , $\beta>2$ and $(a,c)\in\mathcal{A}_{x}$ , with $\mathcal{A}_{x}$ representing the restriction $0\leq a<x<c\leq 1$ , where

	$\displaystyle V(x)=1,\ \eta_{1}(\alpha,\beta)=\alpha-1,\ \eta_{2}(\alpha,\beta)=\beta-1\mbox{ and }$
	$\displaystyle L(\alpha,\beta,a,b)=-(\alpha+\beta-1)\log(c-a)-\log(\operatorname{B}(\alpha,\beta)).$

In order to check condition $(A)$ of Theorem 2.2 note that for $a=0$ and $c=1$ , following the computation of the Fisher information matrix for $g$ given in [4], we have

J(\alpha,\beta)=\begin{bmatrix}\frac{\beta}{(\alpha-1)}&-1\\ 1&-\frac{\alpha}{(\beta-1)}\end{bmatrix}\mbox{ and }K(\alpha,\beta)=\begin{bmatrix}\frac{\beta(\alpha+\beta-1)}{(\alpha-2)}&\alpha+\beta-1\\ \alpha+\beta-1&\frac{\alpha(\alpha+\beta-1)}{(\beta-2)}\end{bmatrix}.

(25)

Thus, since $\alpha+\beta-1>0$ , it is easy to see $J(\alpha,\beta)$ is invertible with

(J(\alpha,\beta))^{-1}=\begin{bmatrix}\frac{\alpha(\alpha-1)}{\alpha+\beta-1}&-\frac{(\alpha-1)(\beta-1)}{\alpha+\beta-1}\\ \frac{(\alpha-1)(\beta-1)}{\alpha+\beta-1}&-\frac{\beta(\alpha-1)}{\alpha+\beta-1}\end{bmatrix}.

Therefore we conclude condition $(A)$ is satisfied, and after some algebraic computations one may find that

(J(\alpha,\beta)^{-1})^{T}K(\alpha,\beta)J(\alpha,\beta)^{-1}=\begin{bmatrix}Q(\alpha,\beta)&Q_{1}(\alpha,\beta)\\ Q_{1}(\alpha,\beta)&Q(\beta,\alpha)\end{bmatrix}.

(26)

where $Q(y,z)$ is as in the proposition and $Q_{1}(y,z)$ is a rational function on $y$ and $z$ .

Once again, item $(B)$ is straightforward to check from (22). Thus, conditions $(A)$ and $(B)$ of Theorem 2.2 are valid, and therefore, following the same arguments as in the proof of Proposition 3.4, the proposition follows from the conclusion of Theorem 2.2 combined with (26). ∎

Figure 3 provides the Bias and RMSE obtained from $100,000$ replications assuming $n=10,15,\ldots,100$ and $\alpha=3$ and $\beta=2.5$ . Here we considered the proposed estimator and compared with the standard MLE that does not have a closed-form expression.

Unlike the Gamma and Nakagami distributions, we observed that the closed-form estimator has an additional bias. Although they are obtained from different distributions for many parameter values, they returned similar results. A major drawback of the estimators (23) and (24) is that the properties that ensure the consistency and asymptotic normality do not hold when the values of $\alpha$ and $\beta$ are smaller than $2$ .

Example 4: Let us consider that $(X_{1},Y_{1})$ , $(X_{2},Y_{2})$ , $\ldots$ , $(X_{n},Y_{n})$ are iid random variables (RV) following a gamma distribution with probability density function (PDF) given by:

f(x,y\,;\,\beta,\alpha_{1},\alpha_{2})=\frac{1}{\beta^{\alpha^{*}_{2}}\Gamma(\alpha_{1})\Gamma(\alpha_{2})}x^{\alpha_{1}-1}(y-x)^{\alpha_{2}-1}e^{-\frac{y}{\beta}},\mbox{ where }0<x<y<\infty

(27)

where $\alpha_{1}$ , $\alpha_{2}$ and $\beta$ are positive, $\alpha_{2}\neq 1$ and $\alpha_{2}^{*}=\alpha_{1}+\alpha_{2}$ .

We can apply the generalized maximum likelihood approach for this distribution by considering the density function $g(x,y\,;\,\beta,\alpha_{1},\alpha_{2},\gamma_{1},\gamma_{2})$ representing the generalized gamma distribution given by

g(x,y\,;\,\beta,\alpha_{1},\alpha_{2},\gamma_{1},\gamma_{2})=\frac{\gamma_{1}\gamma_{2}}{\beta^{\alpha^{*}_{2}}\Gamma(\alpha_{1})\Gamma(\alpha_{2})}x^{\alpha_{1}\gamma_{1}-1}\left(y^{\gamma_{2}}-x^{\gamma_{1}}\right)^{\alpha_{2}-1}e^{-\frac{y^{\gamma_{2}}}{\beta}}y^{\gamma_{2}-1}

where $\beta$ , $\alpha_{1}$ , $\alpha_{2}$ are positive, $0<x<y$ , and $(\gamma_{1},\gamma_{2})\in\mathcal{A}_{x,y}$ , where $\mathcal{A}_{x,y}\subset(0,\infty)$ represents the inequality $x^{\gamma_{1}}<y^{\gamma_{2}}$ .

In order to formulate the generalized maximum likelihood equations for this distribution at $(\gamma_{1},\gamma_{2})=(1,1)$ , letting $\bar{Z}_{1}$ , $\bar{Z}_{2}$ , $\bar{Z}_{3}$ , $\bar{Z}_{4}$ , $\bar{Z}_{5}$ and $\bar{Z}_{6}$ be the means of $Y$ , $\log X$ , $\log Y$ , $Y\log Y$ , $\frac{X\log X}{Y-X}$ and $\frac{Y\log Y}{Y-X}$ , respectively, where $Y=(Y_{1},\cdots,Y_{n})$ and $X=(x_{1},\cdots,X_{n})$ , and define $\bar{z}_{i}$ analogously for $1\leq i\leq 6$ . From [21] we have

$\displaystyle\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{1}\right]$	$\displaystyle=\alpha_{2}^{}\beta,\ \operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{2}\right]=\psi(\alpha_{1})+\log\beta,\ \operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{3}\right]=\psi(\alpha_{2}^{})+\log\beta,$	(28)
$\displaystyle\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{4}\right]$	$\displaystyle=\alpha_{2}^{}\beta\left[\psi(\alpha_{2}^{})+\log\beta+\frac{1}{\alpha_{2}^{*}}\right],$
$\displaystyle\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{5}\right]$	$\displaystyle=\frac{\alpha_{1}}{\alpha_{2}-1}\left[\psi(\alpha_{1})+\log\beta+\frac{1}{\alpha_{1}}\right],\mbox{ and}$
$\displaystyle\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{6}\right]$	$\displaystyle=\frac{\alpha_{2}^{}-1}{\alpha_{2}-1}\left[\psi(\alpha_{2}^{})+\log\beta\right],$

from which it follows that

	$\displaystyle\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\beta}\log g(X_{1},Y_{1}\,;\,\boldsymbol{\theta},1,1)\right]$	$\displaystyle=-\frac{(\alpha_{1}+\alpha_{2})}{\beta}+\frac{1}{\beta^{2}}\operatorname{E}_{\boldsymbol{\theta}}[\bar{Z}_{1}]=0,$
	$\displaystyle\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\gamma_{1}}\log g(X_{1},Y_{1}\,;\,\boldsymbol{\theta},1,1)\right]$	$\displaystyle=\alpha_{1}\operatorname{E}_{\boldsymbol{\theta}}[\bar{Z}_{2}]-(\alpha_{2}-1)\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{5}\right]+1=0,$
	$\displaystyle\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\gamma_{2}}\log g(X_{1},Y_{1}\,;\,\boldsymbol{\theta},1,1)\right]$	$\displaystyle=(\alpha_{2}-1)\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{6}\right]-\frac{1}{\beta}\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{4}\right]+1+\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{3}\right]=0,$

that is, (9) is satisfied. Thus, the generalized likelihood equations for $\boldsymbol{\theta}=(\beta,\alpha_{1},\alpha_{2})$ over the coordinates $(\beta,\gamma_{1},\gamma_{2})$ at $(\gamma_{1},\gamma_{2})=(1,1)$ are given by

	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\beta}\log g(x_{i},y_{i}\,;\,\boldsymbol{\theta},1,1)$	$\displaystyle=-\frac{\alpha_{1}+\alpha_{2}}{\beta}+\frac{1}{\beta^{2}}\bar{z}_{1}=0,$
	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\gamma_{1}}\log g(x_{i},y_{i}\,;\,\boldsymbol{\theta},1,1)$	$\displaystyle=\alpha_{1}\bar{z}_{2}-(\alpha_{2}-1)\bar{z}_{5}+1=0,$
	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\gamma_{2}}\log g(x_{i},y_{i}\,;\,\boldsymbol{\theta},1,1)$	$\displaystyle=(\alpha_{2}-1)\bar{z}_{6}-\frac{1}{\beta}\bar{z}_{4}+1+\bar{z}_{3}=0$

Multiplying the first equation above by $\beta$ we obtain a linear system of equations in $\alpha_{1}$ , $\alpha_{2}$ and $\frac{1}{\beta}$ , from which, using the Cramer rule it will follow that it has as unique solution given by

\displaystyle\hat{\alpha}_{2}=-\frac{B(\boldsymbol{\bar{z}})}{A(\boldsymbol{\bar{z}})},\ \hat{\alpha}_{1}=\frac{1}{\bar{z}_{2}}\left[(\hat{\alpha}_{2}-1)\sum_{j=1}^{n}\bar{z}_{5}-1\right],\ \hat{\beta}=\frac{\bar{z}_{1}}{(\hat{\alpha}_{1}+\hat{\alpha}_{2})}.

as long as $A(\boldsymbol{\bar{z}})\neq 0$ , where

A(\boldsymbol{\bar{z}})=\bar{z}_{6}\bar{z}_{1}\bar{z}_{2}-\bar{z}_{5}\bar{z}_{4}-\bar{z}_{2}\bar{z}_{4}\mbox{ and }B(\boldsymbol{\bar{z}})=-(\bar{z}_{5}+1)\bar{z}_{4}+(1+\bar{z}_{3}+\bar{z}_{6})\bar{z}_{1}\bar{z}_{2}

for $\boldsymbol{\bar{z}}=\left(\bar{z}_{1},\cdots,\bar{z}_{6}\right)$ .

We now apply Theorem 2.2 to prove that the obtained estimators are strongly consistent and asymptotically normal.

Proposition 3.4.

$\hat{\alpha}_{1}$ , $\hat{\alpha}_{2}$ and $\hat{\beta}$ are strongly consistent estimators for the true parameters and $\hat{\alpha}_{1}$ , $\hat{\alpha}_{2}$ and $\hat{\beta}$ are asymptotically normal, as long as $\alpha_{2}\neq 1$ and

(\alpha_{2}^{*}-1)\psi(\alpha_{1})+\alpha_{2}^{*}\psi(\alpha_{2}^{*})+(2\alpha_{2}^{*}-1)\log(\beta)+1\neq 0.

Proof.

In order to apply Theorem 2.2 we note $g(x\,;\,\lambda,\phi,\alpha)$ can be rewritten as

	$\displaystyle g(x,y\,;\,\beta,\alpha_{1},\alpha_{2},\gamma_{1},\gamma_{2})=\frac{\gamma_{1}\gamma_{2}}{\beta^{\alpha^{*}_{2}}\Gamma(\alpha_{1})\Gamma(\alpha_{2})}x^{\alpha_{1}\gamma_{1}-1}\left(y^{\gamma_{2}}-x^{\gamma_{1}}\right)^{\alpha_{2}-1}e^{-\frac{y^{\gamma_{2}}}{\beta}}y^{\gamma_{2}-1}=$
	$\displaystyle V(x)\exp\left(\eta_{1}(\boldsymbol{\theta})T_{1}(x,\gamma_{1},\gamma_{2})+\eta_{2}(\boldsymbol{\theta})T_{2}(x,\gamma_{1},\gamma_{2})+\eta_{3}(\boldsymbol{\theta})T_{3}(x,\gamma_{1},\gamma_{2})+L(\lambda,\phi,\alpha)\right)$

for all $0<x<y$ , positive $\alpha_{1}$ , $\alpha_{2}$ , $\beta$ , and $(\gamma_{1},\gamma_{2})\in\mathcal{A}_{x,y}$ , where $\eta_{i}(\boldsymbol{\theta})$ and $T_{i}(x,\gamma_{1},\gamma_{2})$ satisfy the conditions of Theorem 2.2.

To check condition $(A)$ of Theorem 2.2 note that, for $(\gamma_{1},\gamma_{2})=(1,1)$ and using the relations (28) we have

J(\alpha_{1},\alpha_{2},\beta)=\begin{bmatrix}\frac{1}{\beta}&\frac{1}{\beta}&\frac{1}{\beta^{3}}\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{1}\right]\\ -\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{2}\right]&\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{5}\right]&0\\ 0&-\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{6}\right]&-\frac{1}{\beta^{2}}\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{4}\right].\end{bmatrix}

(29)

Thus it follows that

	$\displaystyle\operatorname{det}J(\alpha_{1},\alpha_{2},\beta)$	$\displaystyle=\frac{1}{\beta^{3}}\left(\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{6}\right]\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{1}\right]\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{2}\right]-\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{5}\right]\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{4}\right]-\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{2}\right]\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{4}\right]\right)$
		$\displaystyle=\frac{1}{\beta^{2}}\alpha_{2}^{}(\psi(\alpha_{1})+\log\beta)\left(\frac{\alpha_{2}^{}-1}{\alpha_{2}-1}\right)(\psi(\alpha_{2}^{*})+\log\beta)$
		$\displaystyle-\frac{1}{\beta^{2}}\left(\frac{\alpha_{2}^{}-1}{\alpha_{2}-1}(\psi(\alpha_{1})+\log\beta)+\frac{1}{\alpha_{2}-1}\right)\alpha_{2}^{}\left(\psi(\alpha_{2}^{})+\log\beta+\frac{1}{\alpha_{2}^{}}\right)$
		$\displaystyle=-\frac{1}{\beta^{2}}\left(\frac{\alpha_{2}^{}-1}{\alpha_{2}-1}(\psi(\alpha_{1})+\log\beta)+\frac{\alpha_{2}^{}}{\alpha_{2}-1}\left(\psi(\alpha_{2}^{*})+\log\beta\right)+\frac{1}{\alpha_{2}-1}\right)$
		$\displaystyle=-\frac{1}{\beta^{2}(\alpha_{2}-1)}\left((\alpha_{2}^{}-1)\psi(\alpha_{1})+\alpha_{2}^{}\psi(\alpha_{2}^{})+(2\alpha_{2}^{}-1)\log(\beta)+1\right)\neq 0$

that is, condition $(A)$ is verified.

Item $(B)$ of (2.2) is straightforward to check from the relations (28). Thus conditions $(A)$ and $(B)$ of Theorem 2.2 are valid and therefore, from Theorem 2.2, we conclude there exist $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{x})=(\hat{\theta}_{1n}(\boldsymbol{x}),\hat{\theta}_{2n}(\boldsymbol{x}),\hat{\theta}_{3n}(\boldsymbol{x}))$ measurable in $\boldsymbol{x}\in\mathcal{X}^{n}$ satisfying items $I)$ to $III)$ of Theorem 2.2.

Now, from the strong law of large numbers, as $n\to\infty$ we have

\left(\bar{Z}_{1},\bar{Z}_{2},\cdots,\bar{Z}_{6}\right)\overset{a.s.}{\to}\left(\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{1}\right],\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{2}\right],\cdots,\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{6}\right]\right)

and thus it follows from the continuous mapping theorem that

	$\displaystyle A(\boldsymbol{\bar{Z}})\overset{a.s.}{\to}\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{6}\right]\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{1}\right]\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{2}\right]-\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{5}\right]\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{4}\right]-\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{2}\right]\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{4}\right]$
	$\displaystyle=-\frac{\beta}{(\alpha_{2}-1)}\left((\alpha_{2}^{}-1)\psi(\alpha_{1})+\alpha_{2}^{}\psi(\alpha_{2}^{})+(2\alpha_{2}^{}-1)\log(\beta)+1\right)\neq 0.$

In special, due to the alternate characterization of strong convergence, it follows that, with probability converging to one strongly we have $A(\boldsymbol{\bar{Z}})\neq 0$ , in which case the modified likelihood equations has $(\hat{\alpha}_{1},\hat{\alpha}_{2},\hat{\beta})$ as its unique solution. This fact combined with item $I)$ of Theorem 2.2 implies that $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X})-(\hat{\alpha}_{1},\hat{\alpha}_{2},\hat{\beta})\overset{a.s.}{\to}0$ . Thus the proposition follows, once again, from items $II)$ and $III)$ of Theorem 2.2. ∎

4 Final Remarks

We have shown that the proposed generalized version of the maximum likelihood estimators provides a vital alternative to achieve closed-form expressions when the standard MLE approach fails. The proposed approach can also be used with discrete distributions, the results remain valid, and the obtained estimators are still strong consistent, invariant, and asymptotic normally distributed. Due to the likelihood function’s flexibility, additional complexity can be included in the distribution and the inferential procedure such as censoring, long-term survival, covariates, and random effects.

The method introduced in this study particularly benefits from the utilization of generalized versions of the baseline distribution. This aspect not only adds significant impetus to the application of various new distributions that have emerged over recent decades but also underscores their practical relevance. Moreover, given that the estimators derived from these generalized distributions are not uniform in nature, it prompts an insightful comparison among them. Such comparative analyses are instrumental in identifying the most effective estimator, especially when evaluated against specific performance metrics. On a different note, our findings demonstrate that the generalized form is not confined to being a distribution. This realization broadens our investigative scope beyond just generalized density functions, allowing for a more expansive and inclusive exploration of potential solutions.

As shown in Examples 1 and 2, the estimators’ behaviors in terms of Bias and RMSE are similar to those obtained under the MLE for the Gamma and Nakagami distributions. Therefore, corrective bias approaches can also be used to remove the bias of the generalized estimators. For the Beta distribution, the comparison showed different behavior for the proposed estimators. We observed that for specific small values of the parameters, the results might not be consistent. This example illustrates what happens in situations where, for some parameter values, the Fisher information of the generalized distribution has singularity problems. Finally, we discuss an approach to obtain closed-form estimators for a bivariate model which provides some insights that can be used in other multivariate models.

This observation lays the groundwork for further exploration, especially in the realm of real-time statistical estimation. It underscores the need for new estimators for distributions with intricate parameter spaces, tailoring them for rapid computation. This aspect is particularly vital for integration with machine learning methodologies, such as tree-based algorithms, where swift and efficient computational techniques are essential. Our study adds a new dimension to the ongoing discourse in statistical estimation, pivoting towards solutions that are not only theoretically sound but also practically viable in dealing with complex data sets. In an era where data complexity and volume are escalating, our approach heralds a promising direction for developing more agile and adaptable statistical tools, crucial for real-time analysis and decision-making in dynamic environments.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Acknowledgements

Eduardo Ramos acknowledges financial support from São Paulo State Research Foundation (FAPESP Proc. 2019/27636-9). Francisco Rodrigues acknowledges financial support from CNPq (grant number 309266/2019-0). Francisco Louzada is supported by the Brazilian agencies CNPq (grant number 301976/2017-1) and FAPESP (grant number 2013/07375-0).

Appendix

In order to prove Theorem 2.2 we shall need the tecnical lemma that follows.

In the following, given $\boldsymbol{x}\in\mathbb{R}^{m}$ we let $\left\|\boldsymbol{x}\right\|_{2}=\sqrt{\sum_{i=1}^{m}x_{i}^{2}}$ and given a matrix $A\in M_{m}(\mathbb{R})$ we let $\left\|A\right\|_{2}$ denote the usual spectral norm defined by $\left\|A\right\|_{2}=\sup_{\left\|x\right\|=1}\left\|Ax^{T}\right\|_{2}$ . Moreover, given a differentiable function $F:\Theta\to\mathbb{R}^{m}$ , for $\Theta\subset\mathbb{R}^{m}$ open, we denote $\frac{\partial}{\partial\theta_{j}}F(\boldsymbol{\theta})=\left(\frac{\partial}{\partial\theta_{j}}F_{1}(\boldsymbol{\theta}),\cdots,\frac{\partial}{\partial\theta_{j}}F_{n}(\boldsymbol{\theta})\right)$ , and we denote by $\frac{\partial}{\partial\boldsymbol{\theta}}F(\boldsymbol{\theta})$ the Jacobian of $F$ at $\boldsymbol{\theta}\in\Theta$ , that is, $\frac{\partial}{\partial\boldsymbol{\theta}}F(\boldsymbol{\theta})=\left(\frac{\partial}{\partial\theta_{j}}F_{i}(\boldsymbol{\theta})\right)\in M_{m}(\mathbb{R})$ for all $\boldsymbol{\theta}\in\Theta$ .

Lemma .1.

Let $\Theta\subset\mathbb{R}^{m}$ be open, let $F:\Theta\to\mathbb{R}^{m}$ be $C^{1}$ , let $J\in M_{m}(\mathbb{R})$ be invertible, denote $\lambda=\frac{1}{2}\left\|J^{-1}\right\|_{2}^{-1}$ and $r=\frac{1}{\lambda}\left\|F(\boldsymbol{\theta}_{0})\right\|_{2}$ , and suppose that:

\displaystyle\bar{B}(\boldsymbol{\theta}_{0},r)\subset\Theta\mbox{ and }\left\|\frac{\partial}{\partial\boldsymbol{\theta}}F(\boldsymbol{\theta})-J\right\|_{2}\leq\lambda\mbox{ for all }\theta\in\bar{B}(\boldsymbol{\theta}_{0},r),

Then there exist $\boldsymbol{\theta}^{*}\in\bar{B}(\boldsymbol{\theta}_{0},r)$ such that $F(\boldsymbol{\theta}^{*})=0$ .

Proof.

The proof shall follow from a simple application of the Browder Fixed Point Theorem.

Letting $L:\bar{B}(\boldsymbol{\theta}_{0},r)\to\mathbb{R}^{m}$ be defined by $L(\boldsymbol{\theta})=\boldsymbol{\theta}-J^{-1}\frac{\partial}{\partial\boldsymbol{\theta}}F(\boldsymbol{\theta})$ for all $\bar{B}(\boldsymbol{\theta}_{0},r)$ we shall prove that $L(\bar{B}(\boldsymbol{\theta}_{0},r))\subset\bar{B}(\boldsymbol{\theta}_{0},r)$ . Indeed, from the chain rule it follows that $L$ is differentiable in $\bar{B}(\boldsymbol{\theta}_{0},r)$ with

L^{\prime}(\boldsymbol{\theta})=I-J^{-1}\frac{\partial}{\partial\boldsymbol{\theta}}F(\boldsymbol{\theta})=J^{-1}\left(J-\frac{\partial}{\partial\boldsymbol{\theta}}F(\boldsymbol{\theta})\right)\mbox{ for all }\boldsymbol{\theta}\in\Theta.

Thus, for all $\boldsymbol{\theta}\in\bar{B}(\boldsymbol{\theta}_{0},r)$ we have

\displaystyle\left\|\frac{\partial}{\partial\boldsymbol{\theta}}L(\boldsymbol{\theta})\right\|_{2}=\left\|J^{-1}\left(J-\frac{\partial}{\partial\boldsymbol{\theta}}F(\boldsymbol{\theta})\right)\right\|_{2}\leq\left\|J^{-1}\right\|_{2}\left\|J-\frac{\partial}{\partial\boldsymbol{\theta}}F(\boldsymbol{\theta})\right\|_{2}\leq\frac{1}{2},

and thus from the mean value inequality we have

\left\|L(\boldsymbol{\theta})-L(\boldsymbol{\theta}_{0})\right\|_{2}\leq\frac{1}{2}\left\|\boldsymbol{\theta}-\boldsymbol{\theta}_{0}\right\|_{2}\leq\frac{r}{2}\mbox{ for all }\boldsymbol{\theta}\in\bar{B}(\boldsymbol{\theta}_{0},r).

(30)

Moreover, note that

\left\|L(\boldsymbol{\theta}_{0})-\boldsymbol{\theta}_{0}\right\|_{2}=\left\|J^{-1}F(\boldsymbol{\theta}_{0})\right\|_{2}\leq\left\|J^{-1}\right\|_{2}\left\|F(\boldsymbol{\theta}_{0})\right\|_{2}=\frac{1}{2\lambda}\left\|F(\boldsymbol{\theta}_{0})\right\|_{2}=\frac{r}{2}

(31)

Thus, given $\boldsymbol{\theta}\in\bar{B}(\boldsymbol{\theta}_{0},r)$ from inequalities (30) and (31) and the triangle inequality we have

\displaystyle\left\|L(\boldsymbol{\theta})-\boldsymbol{\theta}_{0}\right\|_{2}\leq\left\|L(\boldsymbol{\theta})-L(\boldsymbol{\theta}_{0})\right\|_{2}+\left\|L(\boldsymbol{\theta}_{0})-\boldsymbol{\theta}_{0}\right\|_{2}\leq\frac{r}{2}+\frac{r}{2}=r,

that is, $L(\boldsymbol{\theta})\in\bar{B}(\boldsymbol{\theta}_{0},r)$ for all $\boldsymbol{\theta}\in\bar{B}(\boldsymbol{\theta_{0}},r)$ , which proves that $L(\bar{B}(\boldsymbol{\theta}_{0},r)\subset\bar{B}(\boldsymbol{\theta}_{0},r)$ . Thus, since $L:\bar{B}(\boldsymbol{\theta}_{0},r)\to\bar{B}(\boldsymbol{\theta}_{0},r)$ is continuous, from the Browder Fixed Point Theorem we conclude $L$ has at least one fixed point $\boldsymbol{\theta}^{*}$ in $\bar{B}(\boldsymbol{\theta}_{0},r)$ , and thus

L(\boldsymbol{\theta}^{*})=\boldsymbol{\theta}^{*}\Rightarrow J^{-1}F(\boldsymbol{\theta}^{*})=0\Rightarrow F(\boldsymbol{\theta}^{*})=0

which concludes the proof. ∎

Additionally we shall need the following lemma, regarding elementary properties of the spectral norm:

Lemma .2.

Given $A\in M_{n}(\mathbb{R})$ , the following items hold

i)

$\left\|A\right\|_{2}\leq\sum_{i=1}^{n}\left\|b_{i}\right\|_{2}$ , where $b_{i}=(a_{i1},\cdots,a_{in})$ for $1\leq i\leq n$ .
ii)

If $B\in M_{n}(\mathbb{R})$ is invertible and $\left\|A-B\right\|_{2}<\left\|B^{-1}\right\|_{2}^{-1}$ , then $A$ is invertible as well.

Proof.

To prove item $i)$ applying the Cauchy–Schwarz inequality, we have for all $x\in\mathbb{R}^{n}$ that

\left\|Ax^{T}\right\|_{2}=\sqrt{\sum_{i=1}^{n}\langle b_{i},x\rangle^{2}}\leq\sqrt{\sum_{i=1}^{n}\left\|b_{i}\right\|_{2}^{2}\left\|x\right\|_{2}^{2}}=\left(\sqrt{\sum_{i=1}^{n}\left\|b_{i}\right\|_{2}^{2}}\right)\left\|x\right\|_{2},

which proves that $\left\|A\right\|_{2}\leq\sqrt{\sum_{i=1}^{n}\left\|b_{i}\right\|_{2}^{2}}$ by definition of the spectral norm, and thus the result follows directly from the inequality $\sqrt{\sum_{i=1}^{n}\left\|b_{i}\right\|_{2}^{2}}\leq\sum_{i=1}^{n}\left\|b_{i}\right\|_{2}$ .

To prove item $ii)$ , note that, under the hypothesis, letting $C=B^{-1}A$ it follows that

\left\|C-I\right\|_{2}=\left\|B^{-1}(A-B)\right\|_{2}\leq\left\|B^{-1}\right\|_{2}\left\|A-B\right\|_{2}<1

which implies that $C$ is invertible and thus $A=BC$ must be invertible as well, since it is a product of invertible square matrices. ∎

Using the above results we are now ready to prove Theorem 2.1.

Proof.

Existence of solutions:

Letting $h_{j}$ be as in (3), that is

h_{j}(x\,;\,\boldsymbol{\theta})=\frac{\partial}{\partial\beta_{j}}\log\,g\left(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0}\right)-\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\beta_{j}}\log\,g\left(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0}\right)\right]

for all $x\in\mathcal{X}$ , where $(\beta_{1},\cdots,\beta_{s})=(\theta_{1},\cdots,\theta_{s-r},\alpha_{1},\cdots,\alpha_{r})$ and letting $F_{n}:\Theta\times\mathcal{X}^{n}\to\mathbb{R}^{s}$ be defined by $F_{n}=\left(F_{n1},\cdots,F_{ns}\right)$ where

\displaystyle F_{nj}(\boldsymbol{\theta},\boldsymbol{x})=-\frac{1}{n}\sum_{i=1}^{n}h_{j}\left(x_{i}\,;\,\boldsymbol{\theta}\right),

for all $\boldsymbol{\theta}\in\Theta$ , $\boldsymbol{x}=(x_{1},\cdots,x_{n})\in\mathcal{X}^{n}$ , and $1\leq j\leq s$ . Note due to the strong law of the large numbers and from $\operatorname{E}_{\boldsymbol{\theta}_{0}}\left[h_{j}(X_{i}(w)\,;\,\boldsymbol{\theta}\right]=0$ for all $i\in\mathbb{N}$ that

F_{m}(\boldsymbol{\theta}_{0},\boldsymbol{X}(w))\overset{a.s.}{\to}0

and thus, from the alternative definition of strong convergence it follows that

\lim_{n\to\infty}\operatorname{Pr}\left\{\cap_{m=n}^{\infty}\left\{\left\|F_{m}(\boldsymbol{\theta}_{0},\boldsymbol{X}(w))\right\|_{2}>\epsilon\right\}\right\}=0

(32)

for all $\epsilon>0$ . Now, letting $J:\Theta\to M_{s}(\mathbb{R})$ be defined by $J(\boldsymbol{\theta})=\left(J_{i,j}(\boldsymbol{\theta})\right)\in M_{s}(\mathbb{R})$ , where

J_{i,j}(\boldsymbol{\theta})=\operatorname{E}_{\boldsymbol{\theta}_{0}}\left[-\frac{\partial}{\partial\theta_{j}}h_{i}(X_{1}\,;\,\boldsymbol{\theta})\right].

Condition (B) says that

\displaystyle\left|\frac{\partial}{\partial\theta_{i}}h_{j}(X_{1}\,;\,\boldsymbol{\theta})\right|\leq M_{ij}(X_{1})\mbox{ and }E_{\boldsymbol{\theta}_{0}}\left[M_{ij}(X_{1})\right]<\infty,

(33)

for all $i$ and $j$ . In special, from (33) and from the dominated convergence theorem it follows that $J(\boldsymbol{\theta})$ is continuous at $\boldsymbol{\theta}_{0}$ . Moreover, denoting $J_{i}(\boldsymbol{\theta})=\left(J_{i,1}(\boldsymbol{\theta}),\cdots,J_{i,s}(\boldsymbol{\theta})\right)\in\mathbb{R}^{s}$ for all $i$ , from (33) and the uniform strong law of the large numbers it follows that

\sup_{\boldsymbol{\theta}\in\overline{\Theta}_{0}}\left\|\frac{\partial}{\partial\theta_{i}}F_{n}(\boldsymbol{\theta},\boldsymbol{X}(w))-J_{i}(\boldsymbol{\theta})\right\|_{2}\overset{a.s.}{\to}0

for all $i$ , and thus, once again due to the alternative definition of strong convergence we have

\lim_{n\to\infty}\operatorname{Pr}\left\{\cap_{n=m}^{\infty}\left\{\sup_{\boldsymbol{\theta}\in\overline{\Theta}_{0}}\left\|\frac{\partial}{\partial\theta_{i}}F_{m}(\boldsymbol{\theta},\boldsymbol{X}(w))-J_{i}(\boldsymbol{\theta})\right\|_{2}>\epsilon\right\}\right\}=0

(34)

for all $\epsilon>0$ and $i$ . Now, given $m>0$ such that $\bar{B}\left(\boldsymbol{\theta_{0}},\frac{1}{m}\right)\subset\bar{\Theta}_{0}$ and $\frac{1}{m}<\frac{\lambda}{2}$ , where $\lambda=\frac{1}{2}\left\|J(\boldsymbol{\theta}_{0})^{-1}\right\|_{2}^{-1}$ , combining (32) and (34), it follows there exist $N_{m}>0$ and a set $\Omega_{m}$ of probability $1-\frac{1}{m}$ , such that

\left\|F_{n}(\boldsymbol{\theta}_{0},\boldsymbol{X}(w))\right\|_{2}<\frac{1}{m}\mbox{ and }\\ \sup_{\boldsymbol{\theta}\in\overline{\Theta}_{0}}\left\|\frac{\partial}{\partial\theta_{i}}F_{n}(\boldsymbol{\theta},\boldsymbol{X}(w))-J_{i}(\boldsymbol{\theta})\right\|_{2}<\frac{1}{sm}

(35)

for all $n\geq N_{m}$ , $i$ and $w\in\Omega_{m}$ . Combining the second inequality of (35) with item $i)$ of Lemma .2 it follows that:

\sup_{\boldsymbol{\theta}\in\overline{\Theta}_{0}}\left\|\frac{\partial}{\partial\boldsymbol{\theta}}F_{n}(\boldsymbol{\theta},\boldsymbol{X}(w))-J(\boldsymbol{\theta})\right\|_{2}<\frac{1}{m}

Now, since $J$ is continuous at $\boldsymbol{\theta}_{0}$ , there exist an open set $\Theta_{m}\subset\bar{B}(\boldsymbol{\theta}_{0},\frac{1}{m})\subset\bar{\Theta}_{0}$ such that

\left\|J(\boldsymbol{\theta})-J(\boldsymbol{\theta}_{0})\right\|_{2}<\frac{1}{m}\mbox{ for all }\boldsymbol{\theta}\in\Theta_{m}

(36)

Combining the above inequalities with the triangle inequality we conclude that

\left\|F_{n}(\boldsymbol{\theta}_{0},\boldsymbol{X}(w))\right\|_{2}<\epsilon\mbox{ and }\\ \sup_{\boldsymbol{\theta}\in\Theta_{m}}\left\|\frac{\partial}{\partial\boldsymbol{\theta}}F_{n}(\boldsymbol{\theta},\boldsymbol{X}(w))-J(\boldsymbol{\theta}_{0})\right\|_{2}<\frac{2}{m}<\lambda

(37)

for all $n\geq N_{m}$ and $w\in\Omega_{m}$ . Thus from Lemma (.1) it follows that for each $w\in\Omega_{m}$ there exist $\boldsymbol{\bar{\theta}}(w)\in\Theta_{m}\subset\bar{B}\left(\boldsymbol{\theta}_{0},\frac{1}{m}\right)$ such that

F_{n}(\boldsymbol{\bar{\theta}}(w),\boldsymbol{X}(w))=0\mbox{ for all }w\in\Omega_{m}\mbox{ and }n\geq N_{m},

(38)

which, in special, proves that the generalized maximum likelihood equations has at least one solution with probability converging to one as $n\to\infty$ .

Construction of a measurable estimator:

We shall construct the estimator $\boldsymbol{\hat{\theta}}_{n}$ . Note that if (37) and (38) are valid for some $N_{m}>0$ then it is valid for any $N^{*}_{m}\geq N_{m}$ as well. Thus, without loss of generality we can suppose $N_{1}<N_{2}<N_{3}<\cdots$ . Now, given $n<N_{1}$ we define

\boldsymbol{\hat{\theta}}_{n}(x)=0\mbox{ for all }x\in\mathcal{X}^{n}.

On the other hand, to define $\boldsymbol{\hat{\theta}}_{n}(x)$ for $n\geq N_{1}$ , let $m_{n}$ be the only integer for which $N_{m_{n}}\leq n<N_{m_{n}+1}$ is satisfied. Since $N_{1}<N_{2}<N_{3}<\cdots$ it follows that $m_{n}$ is well defined and $m_{n}\to\infty$ as $n\to\infty$ . Now, note that $F_{n}(\boldsymbol{\theta},\boldsymbol{x})$ is continuous in $\boldsymbol{\theta}$ for all $\boldsymbol{x}$ in $\mathcal{X}^{n}$ and measurable in $\boldsymbol{x}$ for all $\boldsymbol{\theta}\in\Omega$ . Thus, $F_{n}$ is a Carathéodory function for all $n\geq N$ . Therefore, letting $\phi:\mathcal{X}^{n}\to\bar{B}\left(\boldsymbol{\theta}_{0},\frac{1}{m_{n}}\right)$ be the multivalued map defined by

\boldsymbol{\theta}\in\phi(\boldsymbol{x})\mbox{ if and only if }F^{*}_{n}(\boldsymbol{\theta},\boldsymbol{x})=0\mbox{ and }\left\|\boldsymbol{\theta}-\boldsymbol{\theta}_{0}\right\|_{2}\leq\frac{1}{m_{n}},

(39)

since $F_{n}$ is Carathéodory and $\bar{B}\left(\boldsymbol{\theta}_{0},\frac{1}{m_{n}}\right)$ is compact, it follows from the theory of measurable maps that $\phi$ is a measurable map (see [6], Corollary 18.8 p. 596). Now construct a second multivalued map $\phi^{*}:\mathcal{X}^{n}\to\bar{B}\left(\boldsymbol{\theta_{0}},\frac{1}{m_{n}}\right)$ defined by:

\displaystyle\phi^{*}(x)=\phi(x)\mbox{ if }\phi(x)\neq\emptyset\mbox{ and }\phi^{*}(x)=\{\boldsymbol{\theta}_{0}\}\mbox{ otherwise}.

From the measurability of $\phi$ it is clear that $\phi^{*}$ is measurable as well, and since $\phi^{*}(x)$ is always non-empty, we can apply the measurable selection theorem (see [6], Theorem 18.7 p. 603) to obtain a measurable function $\boldsymbol{\hat{\theta}}_{n}(x)$ satisfying

\boldsymbol{\hat{\theta}}_{n}(x)\in\phi^{*}(x)\mbox{ for all }x\in\mathcal{X}^{n}.

which concludes the construction of our estimator $\boldsymbol{\hat{\theta}}_{n}(x)$ .

By the construction it follows that $\boldsymbol{\hat{\theta}}_{n}(x)$ satisfy $F^{*}_{n}(\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{x}),\boldsymbol{x})=0$ in every point $\boldsymbol{x}\in\mathcal{X}^{n}$ in which the equation $F_{n}(\boldsymbol{\theta},\boldsymbol{x})=0$ has at least one solution $\boldsymbol{\theta}$ in $\bar{B}\left(\boldsymbol{\theta}_{0},\frac{1}{m_{n}}\right)$ . Thus, it follows that $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X})$ satisfy $F_{n}(\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}(w)),\boldsymbol{X}(w))=0$ at every point $w\in\Omega$ in which $F_{n}(\boldsymbol{\theta},\boldsymbol{X}(w))=0$ has at least one solution $\boldsymbol{\theta}$ in $\bar{B}\left(\boldsymbol{\theta}_{0},\frac{1}{m_{n}}\right)$ , and since $n\geq N_{m_{n}}$ , from what we proved earlier it follows this happens with probability greater or equal to $1-\frac{1}{m_{n}}$ .

Thus, since $1-\frac{1}{m_{n}}\to 1$ as $n\to\infty$ , it follows that, with probability converging to one as $n\to\infty$ , $\hat{\boldsymbol{\theta}}_{n}(\boldsymbol{X})$ satisfy $F_{n}\left(\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}),\boldsymbol{X}(w)\right)=0$ for all $n\geq N_{m_{n}}$ , which proves item $I)$ .

Now, by construction $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{x})\in\bar{B}\left(\boldsymbol{\theta}_{0},\frac{1}{m_{n}}\right)$ for all $n\geq 1$ and $\boldsymbol{x}\in\mathcal{X}^{n}$ and since $\frac{1}{m_{n}}\to 0$ as $n\to\infty$ it follows that $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}(w))\to\boldsymbol{\theta}_{0}$ as $n\to\infty$ for all $w\in\Omega$ , which, in special, proves item $II)$ .

Asymptotic normality:

From the mean value theorem, for each fixed $1\leq i\leq s$ , and $w\in\Omega$ there must exist a $\boldsymbol{y}_{in}(w)\in\mathbb{R}^{s}$ contained in the segment connecting $\boldsymbol{\theta}_{0}$ to $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}(w))$ such that

\begin{aligned} F_{nj}(\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}(w)),\boldsymbol{X}(w))=F_{nj}(\boldsymbol{\theta}_{0},\boldsymbol{X}(w))+\sum_{i=1}^{s}\frac{\partial}{\partial\theta_{i}}F_{nj}(\boldsymbol{y}_{in}(w),\boldsymbol{X}(w))(\hat{\theta}_{nj}(\boldsymbol{X}(w))-\theta_{0j})\end{aligned}.

(40)

On the other hand, letting

H_{n}(\boldsymbol{y},w)=F_{nj}(\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}(w)),\boldsymbol{X}(w))-F_{nj}(\boldsymbol{\theta}_{0},\boldsymbol{X}(w))-\sum_{i=1}^{s}\frac{\partial}{\partial\theta_{i}}F_{nj}(\boldsymbol{y},\boldsymbol{X}(w))(\hat{\theta}_{nj}(\boldsymbol{X}(w))-\theta_{0j})

for all $\boldsymbol{y}\in\Theta_{0}$ it follows by hypothesis $(B)$ that $H_{n}(y,w)$ is continuous in $y$ and measurable in $w$ , and thus is a Carathéodory function, from which it follows, once again due to the theory of measurable maps and the measurable selection theorem, that such $\boldsymbol{y}_{in}(w)$ can be chosen to be measurable in $w$ .

Now, letting $A_{n}(w)\in M_{s}(\mathbb{R})$ be defined by $A_{n}(w)=(a_{ij,n}(w))$ where $a_{ij,n}(w)=\frac{\partial}{\partial\theta_{i}}F_{nj}(\boldsymbol{y}_{in}(w),w)$ for all $1\leq i\leq s$ , $1\leq j\leq s$ and $w\in\Omega$ , since by construction $\hat{\boldsymbol{\theta}}_{n}(\boldsymbol{X})\in\Theta_{m_{n}}$ for all $n\geq N_{1}$ and since $\boldsymbol{y}_{in}(w)$ is contained in the segment connecting $\boldsymbol{\theta}_{0}$ to $\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X})$ , it follows that $\boldsymbol{y}_{in}(w)\in\Theta_{m_{n}}$ for all $n\geq N_{1}$ as well. Thus, once again combining the second inequality from (37) with item $i)$ of Lemma .2 it follows that

\left\|A_{n}(w)-J(\boldsymbol{\theta}_{0})\right\|_{2}<\frac{2}{m_{n}}<\lambda=\left\|J(\boldsymbol{\theta}_{0})^{-1}\right\|_{2}^{-1}\mbox{ for all }n\geq N_{1}

(41)

Thus, in particular, from lemma .2 it follows $A_{n}(w)$ is invertible for all $n\geq N_{1}$ and $w\in\Omega$ .

On the other hand, since by construction $F_{n}(\boldsymbol{\hat{\theta}}_{n}(\boldsymbol{X}(w)),\boldsymbol{X}(w))=0$ for all $w\in\Omega_{n}$ it follows from (40) that:

A_{n}(w)^{T}(\hat{\boldsymbol{\theta}}_{n}(\boldsymbol{X}(w))-\boldsymbol{\theta}_{0})^{T}=F_{n}(\boldsymbol{\theta}_{0},\boldsymbol{X}(w))\mbox{ for all }w\in\Omega_{n}.

(42)

Now, let $\boldsymbol{\theta}^{*}_{n}(w)$ be defined as

\boldsymbol{\theta}_{n}^{*}(w)^{T}=\boldsymbol{\theta}_{0}^{T}-(A_{n}(w)^{-1})^{T}F_{n}(\boldsymbol{\theta}_{0},\boldsymbol{X}(w))\mbox{ for all }w\in\Omega\mbox{ and }n\geq N_{1}.

(43)

From (42) it follows that $\hat{\boldsymbol{\theta}}_{n}(\boldsymbol{X}(w))=\boldsymbol{\theta}_{n}^{*}(w)$ for all $w\in\Omega_{n}$ and thus $\hat{\boldsymbol{\theta}}_{n}(\boldsymbol{X}(w))\overset{a.s.}{\to}\boldsymbol{\theta}_{n}^{*}(w)$ .

Since $\frac{2}{m_{n}}\to 0$ as $n\to\infty$ it follows from (41) that $A_{n}(w)\overset{a.s.}{\to}J(\boldsymbol{\theta}_{0})$ and thus from the invertibility of the matrices involved, it follows for $n\geq N_{1}$ that

(A_{n}(w))^{-1}\overset{a.s.}{\to}J(\boldsymbol{\theta}_{0})^{-1}

(44)

as well. Additionally, from the central limit theorem we know that

\sqrt{n}F_{n}(\boldsymbol{\theta}_{0},\boldsymbol{X})\overset{D}{\to}N_{s}(0,K(\boldsymbol{\theta}_{0})),

(45)

which combined with (45), (43) and the Slutsky’s Theorem, implies that

\sqrt{n}(\boldsymbol{\theta}_{n}^{*}(w)-\boldsymbol{\theta}_{0})^{T}\overset{D}{\to}(J(\boldsymbol{\theta}_{0})^{-1})^{T}N_{s}(0,K(\boldsymbol{\theta}_{0}))=N_{s}\left(0,(J(\boldsymbol{\theta}_{0})^{-1})^{T}K(\boldsymbol{\theta}_{0})J(\boldsymbol{\theta}_{0})^{-1}\right),

which concludes the proof, since we already proved that $\hat{\boldsymbol{\theta}}_{n}(\boldsymbol{X})\overset{a.s.}{\to}\boldsymbol{\theta}_{n}^{*}(w)$ . ∎

As a corollary of Theorem 2.1 We now prove Theorem 2.2.

Proof.

Item $(A)$ of Theorem 2.1 is the same as that of Theorem 2.2, and thus this item is satisfied.

Now, from hypothesis we see that

	$\displaystyle h_{j}(x\,;\,\boldsymbol{\theta})=\sum_{k=1}^{s}\frac{\partial}{\partial\theta_{j}}\eta_{k}(\boldsymbol{\theta})\left(T_{k}(x,\boldsymbol{\alpha}_{0})-\operatorname{E}_{\boldsymbol{\theta}}\left[T_{k}(X_{1},\boldsymbol{\alpha}_{0})\right]\right)+\frac{\partial}{\partial\theta_{j}}\log L(\boldsymbol{\theta},\boldsymbol{\alpha}_{0}),\mbox{ for }1\leq j\leq s-r$
	$\displaystyle h_{s-r+j}(x\,;\,\boldsymbol{\theta})=\sum_{k=1}^{s}\eta_{k}(\boldsymbol{\theta})\left(\frac{\partial}{\partial\alpha_{j}}T_{k}(x,\boldsymbol{\alpha}_{0})-\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\alpha_{j}}T_{k}(X_{1},\boldsymbol{\alpha}_{0})\right]\right)+\frac{\partial}{\partial\alpha_{j}}\log L(\boldsymbol{\theta},\boldsymbol{\alpha}_{0}),\mbox{ for }1\leq j\leq r.$

From these relations and the hypothesis, it is easy to see that $h_{j}(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})$ is measurable in $x$ and $\frac{\partial}{\partial\theta_{i}}h_{j}(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})$ is well defined and continuous in $\boldsymbol{\theta}$ , for all $j$ and $\theta\in\Theta$ , that is, item $(B)$ of Theorem 2.1 is also satisfied.

Finally, letting

\displaystyle M_{i,j}(x)=\sum_{k=1}^{s}\sup_{\boldsymbol{\theta}\in\overline{\Theta}_{0}}\left|\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}\eta_{k}(\boldsymbol{\theta})\right|\left(\left|T_{k}(x,\boldsymbol{\alpha}_{0})\right|+\left|\operatorname{E}_{\boldsymbol{\theta}}\left[T_{k}(X_{1},\boldsymbol{\alpha}_{0})\right]\right|\right)+\sup_{\boldsymbol{\theta}\in\overline{\Theta}_{0}}\left|\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}\log L(\boldsymbol{\theta},\boldsymbol{\alpha}_{0})\right|

for $1\leq i\leq s$ and $1\leq j\leq s-r$ , and letting

\displaystyle M_{i,s-r+j}(x)=\sum_{k=1}^{s}\sup_{\boldsymbol{\theta}\in\overline{\Theta}_{0}}\left|\frac{\partial}{\partial\theta_{i}}\eta_{k}(\boldsymbol{\theta})\right|\left(\left|\frac{\partial}{\partial\alpha_{j}}T_{k}(x,\boldsymbol{\alpha}_{0})\right|+\left|\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\alpha_{j}}T_{k}(X_{1},\boldsymbol{\alpha}_{0})\right]\right|\right)+\left|\frac{\partial^{2}}{\partial\theta_{i}\partial\alpha_{j}}\log L(\boldsymbol{\theta},\boldsymbol{\alpha}_{0})\right|

for $1\leq i\leq s$ and $1\leq j\leq r$ , one can check directly that

\displaystyle\left|\frac{\partial}{\partial\theta_{i}}h_{j}(x\,;\,\boldsymbol{\theta},\boldsymbol{\alpha}_{0})\right|\leq M_{ij}(x)\mbox{ for }1\leq i\leq s\mbox{ and }1\leq j\leq s.

Additionally, since $\operatorname{E}_{\boldsymbol{\theta}}\left[T_{i}(x,\boldsymbol{\alpha_{0}})\right]$ and $\operatorname{E}_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\alpha_{j}}T_{i}(x,\boldsymbol{\alpha}_{0})\right]$ are finite for all $\boldsymbol{\theta}\in\Theta$ , it follows that

E_{\boldsymbol{\theta}_{0}}\left[M_{ij}(X_{1})\right]<\infty,\mbox{ for all }1\leq i\leq s\mbox{ and }1\leq j\leq s.

which proves item $(C)$ of Theorem 2.1 is also satisfied. Thus we can apply the conclusions of Theorem 2.1, concluding the proof. ∎

References

Aldrich et al. [1997] Aldrich, J. et al. (1997). Ra fisher and the making of maximum likelihood 1912-1922. Statistical science 12(3), 162–176.
Andersen [1970] Andersen, E. B. (1970). Asymptotic properties of conditional maximum-likelihood estimators. Journal of the Royal Statistical Society: Series B (Methodological) 32(2), 283–301.
Anderson and Blair [1982] Anderson, J. and V. Blair (1982). Penalized maximum likelihood estimation in logistic regression and discrimination. Biometrika 69(1), 123–136.
Aryal and Nadarajah [2004] Aryal, G. and S. Nadarajah (2004). Information matrix for beta distributions. Serdica Mathematical Journal 30(4), 513p–526p.
Bierens [2004] Bierens, H. J. (2004). Introduction to the mathematical and statistical foundations of econometrics. Cambridge University Press.
Charalambos D. Aliprantis [2006] Charalambos D. Aliprantis, K. B. (2006). Infinite Dimensional Analysis: A Hitchhiker’s Guide (3rd ed.). Springer.
Cheng and Amin [1983] Cheng, R. and N. Amin (1983). Estimating parameters in continuous univariate distributions with a shifted origin. Journal of the Royal Statistical Society. Series B (Methodological), 394–403.
Cox [1975] Cox, D. R. (1975). Partial likelihood. Biometrika 62(2), 269–276.
Dempster et al. [1977] Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–22.
Firth [1993] Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika 80(1), 27–38.
Gourieroux et al. [1984] Gourieroux, C., A. Monfort, and A. Trognon (1984). Pseudo maximum likelihood methods: Theory. Econometrica: journal of the Econometric Society, 681–700.
Hager and Bain [1970] Hager, H. W. and L. J. Bain (1970). Inferential procedures for the generalized gamma distribution. Journal of the American Statistical Association 65(332), 1601–1609.
Hosking [1990] Hosking, J. R. (1990). L-moments: Analysis and estimation of distributions using linear combinations of order statistics. Journal of the Royal Statistical Society: Series B (Methodological) 52(1), 105–124.
Kao [1958] Kao, J. H. (1958). Computer methods for estimating weibull parameters in reliability studies. IRE Transactions on Reliability and Quality Control, 15–22.
Kao [1959] Kao, J. H. (1959). A graphical estimation of mixed weibull parameters in life-testing of electron tubes. Technometrics 1(4), 389–407.
Lehmann and Casella [2006] Lehmann, E. L. and G. Casella (2006). Theory of point estimation. Springer Science & Business Media.
Louzada et al. [2019] Louzada, F., P. L. Ramos, and E. Ramos (2019). A note on bias of closed-form estimators for the gamma distribution derived from likelihood equations. The American Statistician 73(2), 195–199.
Murphy and Van der Vaart [2000] Murphy, S. A. and A. W. Van der Vaart (2000). On profile likelihood. Journal of the American Statistical Association 95(450), 449–465.
Ramos et al. [2020] Ramos, P. L., F. Louzada, and E. Ramos (2020). Bias reduction in the closed-form maximum likelihood estimator for the nakagami-m fading parameter. IEEE Wireless Communications Letters 9(10), 1692–1695.
Redner et al. [1981] Redner, R. et al. (1981). Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions. Annals of Statistics 9(1), 225–228.
Zhao et al. [2022] Zhao, J., Y.-H. Jang, and H.-M. Kim (2022). Closed-form and bias-corrected estimators for the bivariate gamma distribution. Journal of Multivariate Analysis 191, 105009.

	$\displaystyle\operatorname{det}J(\alpha_{1},\alpha_{2},\beta)$	$\displaystyle=\frac{1}{\beta^{3}}\left(\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{6}\right]\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{1}\right]\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{2}\right]-\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{5}\right]\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{4}\right]-\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{2}\right]\operatorname{E}_{\boldsymbol{\theta}}\left[\bar{Z}_{4}\right]\right)$
		$\displaystyle=\frac{1}{\beta^{2}}\alpha_{2}^{}(\psi(\alpha_{1})+\log\beta)\left(\frac{\alpha_{2}^{}-1}{\alpha_{2}-1}\right)(\psi(\alpha_{2}^{*})+\log\beta)$
		$\displaystyle-\frac{1}{\beta^{2}}\left(\frac{\alpha_{2}^{}-1}{\alpha_{2}-1}(\psi(\alpha_{1})+\log\beta)+\frac{1}{\alpha_{2}-1}\right)\alpha_{2}^{}\left(\psi(\alpha_{2}^{})+\log\beta+\frac{1}{\alpha_{2}^{}}\right)$
		$\displaystyle=-\frac{1}{\beta^{2}}\left(\frac{\alpha_{2}^{}-1}{\alpha_{2}-1}(\psi(\alpha_{1})+\log\beta)+\frac{\alpha_{2}^{}}{\alpha_{2}-1}\left(\psi(\alpha_{2}^{*})+\log\beta\right)+\frac{1}{\alpha_{2}-1}\right)$
		$\displaystyle=-\frac{1}{\beta^{2}(\alpha_{2}-1)}\left((\alpha_{2}^{}-1)\psi(\alpha_{1})+\alpha_{2}^{}\psi(\alpha_{2}^{})+(2\alpha_{2}^{}-1)\log(\beta)+1\right)\neq 0$