¹¹institutetext: S. M. Fischer ²²institutetext: Department of Mathematical and Statistical Sciences, 632 Central Academic Building, University of Alberta, Edmonton, AB, T6G 2G1. ²²email: [email protected] ³³institutetext: M. A. Lewis ⁴⁴institutetext: Department of Mathematical and Statistical Sciences, 632 Central Academic Building, University of Alberta, Edmonton, AB, T6G 2G1.
Department of Biological Sciences, CW 405, Biological Sciences Building, University of Alberta, Edmonton, AB, T6G 2E9.

A Robust and Efficient Algorithm to Find Profile Likelihood Confidence Intervals

Samuel M. Fischer Mark A. Lewis

Abstract

Profile likelihood confidence intervals are a robust alternative to Wald’s method if the asymptotic properties of the maximum likelihood estimator are not met. However, the constrained optimization problem defining profile likelihood confidence intervals can be difficult to solve in these situations, because the likelihood function may exhibit unfavorable properties. As a result, existing methods may be inefficient and yield misleading results. In this paper, we address this problem by computing profile likelihood confidence intervals via a trust-region approach, where steps computed based on local approximations are constrained to regions where these approximations are sufficiently precise. As our algorithm also accounts for numerical issues arising if the likelihood function is strongly non-linear or parameters are not estimable, the method is applicable in many scenarios where earlier approaches are shown to be unreliable. To demonstrate its potential in applications, we apply our algorithm to benchmark problems and compare it with 6 existing approaches to compute profile likelihood confidence intervals. Our algorithm consistently achieved higher success rates than any competitor while also being among the quickest methods. As our algorithm can be applied to compute both confidence intervals of parameters and model predictions, it is useful in a wide range of scenarios.

Keywords:

computer algorithm, constrained optimization, parameter estimation, estimability, identifiability

Declarations

Funding.

SMF is thankful for the funding received from the Canadian Aquatic Invasive Species Network and the Natural Sciences and Engineering Research Council of Canada (NSERC); MAL gratefully acknowledges an NSERC Discovery Grant and Canada Research Chair.

Competing interests.

The authors declare no competing interests.

Availability of data and material.

Not applicable.

Code availability.

A Python implementation of the described algorithm and the test procedures can be retrieved from the python package index as package ci-rvm (see pypi.org/project/ci-rvm).

Authors’ contributions.

Samuel M. Fischer and Mark A. Lewis jointly conceived the project; Samuel M. Fischer conceived the algorithm, conducted the mathematical analysis, implemented the algorithm, and wrote the manuscript. Mark A. Lewis revised the manuscript.

Acknowledgements.

The authors would like to give thanks to the members of the Lewis Research Group at the University of Alberta for helpful feedback and discussions.

1 Introduction

1.1 Profile likelihood confidence intervals

Confidence intervals are an important tool for statistical inference, used not only to assess the range of predictions that are supported by a model and data but also to detect potential estimability issues (Raue et al., 2009). These estimability issues occur if the available data do not suffice to infer a statistical quantity on the desired confidence level, and the corresponding confidence intervals are infinite (Raue et al., 2009). Due to the broad range of applications, confidence intervals are an integral part of statistical model analysis and widely used across disciplines.

Often, confidence intervals are constructed via Wald’s method, which exploits the asymptotic normality of the maximum likelihood estimator (MLE). Though Wald’s method is accurate in “benign” use cases, the approach can be imprecise or fail if not enough data are available to reach the asymptotic properties of the MLE. This will be the case, in particular, if the MLE is not unique, i.e. parameters are not identifiable, or if the likelihood is very sensitive to parameter changes beyond some threshold, e.g. in dynamical systems undergoing bifurcations. Therefore, other methods, such as profile likelihood techniques (Cox and Snell, 1989), are favorable in many use cases.

Both Wald-type and profile likelihood confidence intervals are constructed by inverting the likelihood likelihood ratio test. That is, the confidence interval for a parameter $\theta_{0}$ encompasses all values $\bar{\theta}_{0}$ that might suit as acceptable null hypotheses if the parameter were to be fixed; i.e. $H_{0}:\theta_{0}=\bar{\theta}_{0}$ could not be rejected versus the alternative $H_{1}:\theta_{0}\neq\bar{\theta}_{0}$ . As the likelihood ratio statistic is, under regularity conditions, approximately $\chi^{2}$ distributed under the null hypothesis, the confidence interval is given by

\displaystyle\mathit{I}

\displaystyle=

\displaystyle\left[\bar{\theta}_{0}\,\Big{|}\,2\left(\underset{\boldsymbol{\theta}\in\Theta}{\max\,}\ell\mathopen{\left(\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}-\underset{\boldsymbol{\theta}\in\Theta\,:\,\theta_{0}=\bar{\theta}_{0}}{\max\,}\ell\mathopen{\left(\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}\right)\leq\chi_{1,1-\alpha}^{2}\right],

(1)

whereby $\Theta$ is the parameter space, $\ell$ denotes the log-likelihood function, $\alpha$ is the desired confidence level, and $\chi_{k,1-\alpha}^{2}$ is the $\left(1-\alpha\right)$ th quantile of the $\chi^{2}$ distribution with $k$ degrees of freedom.

The function that maps $\bar{\theta}_{0}$ to the constrained maximum

\displaystyle\ell_{\mathrm{PL}}\mathopen{\left(\bar{\theta}_{0}\right)}\mathclose{}\mathclose{}

\displaystyle:=

\displaystyle\underset{\boldsymbol{\theta}\in\Theta\,:\,\theta_{0}=\bar{\theta}_{0}}{\max\,}\ell\mathopen{\left(\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}

(2)

is called the profile log-likelihood. While Wald’s method approximates $\ell$ and $\ell_{\mathrm{PL}}$ as quadratic functions, profile likelihood confidence intervals are constructed by exact computation of the profile log-likelihood $\ell_{\mathrm{PL}}$ . This makes this method more accurate but also computationally challenging.

1.2 Existing approaches

Conceptually, the task of identifying the end points $\theta_{0}^{\min}$ and $\theta_{0}^{\max}$ of the confidence interval $I$ is equivalent to finding the maximal (or minimal) value for $\theta_{0}$ with

\ell_{\mathrm{PL}}\mathopen{\left(\theta_{0}\right)}\mathclose{}\mathclose{}\leavevmode\nobreak\ =\leavevmode\nobreak\ \ell^{*}\leavevmode\nobreak\ :=\leavevmode\nobreak\ \ell\mathopen{\left(\hat{\boldsymbol{\theta}}\right)}\mathclose{}\mathclose{}-\frac{1}{2}\chi_{1,1-\alpha}^{2},

(3)

Here, $\hat{\boldsymbol{\theta}}$ denotes the MLE; the value $\ell^{*}$ follows from rearranging the terms in the inequality characterizing $I$ (see equation (1)).

There are two major perspectives to address this problem. It could either be understood as a one-dimensional root finding problem on $\ell_{\mathrm{PL}}$ or as the constrained maximization (or minimization) problem

\displaystyle\theta_{0}^{\max}

\displaystyle=

\displaystyle\underset{\boldsymbol{\theta}\in\Theta\,:\,\ell\mathopen{\left(\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}\geq\ell^{*}}{\max\,}\theta_{0}

(4)

( $\theta_{0}^{\min}$ analog). Approaches developed from either perspective face the challenge of balancing robustness against efficiency.

The root finding perspective (Cook and Weisberg, 1990; DiCiccio and Tibshirani, 1991; Stryhn and Christensen, 2003; Moerbeek et al., 2004; Ren and Xia, 2019) is robust if small steps are taken and solutions of the maximization problem (2) are good initial guesses for the maximizations in later steps. Nonetheless, the step size should be variable if parameters might be not estimable and the confidence intervals large. At the same time, care must be taken with large steps, as solving (2) can be difficult if the initial guesses are poor, and algorithms may fail to converge. Therefore, conservative step choices are often advisable even though they may decrease the overall efficiency of the approaches.

The constrained maximization perspective (Neale and Miller, 1997; Wu and Neale, 2012) has the advantage that efficient solvers for such problems are readily implemented in many optimization packages. If the likelihood function is “well behaved”, these methods converge very quickly. However, in practical problems, the likelihood function may have local extrema, e.g. due to lack of data, or steep “cliffs” that may hinder these algorithms from converging to a feasible solution. Furthermore, general algorithms are typically not optimized for problems like (4), in which the target function is simple and the major challenge is in ensuring that the constraint is met. Therefore, an approach would be desirable that is specifically tailored to solve the constrained maximization (4) in a robust and efficient manner.

A first step in this direction is the algorithm by Venzon and Moolgavkar (1988), which solves (4) by repeated quadratic approximations of the likelihood surface. As the method is of Newton-Raphson type, it is very efficient as long as the local approximations are accurate. Therefore, the algorithm is fast if the asymptotic normality of the MLE is achieved approximately. Otherwise, the algorithm relies heavily on good initial guesses. Though methods to determine accurate initial guesses exist (Gimenez et al., 2005), the algorithm by Venzon and Moolgavkar (1988) (below abbreviated as VM) can get stuck in local extrema or fail to converge if the likelihood surface has unfavorable properties (see e.g. Ren and Xia, 2019). Moreover, the algorithm will break down if parameters are not identifiable. Thus, VM cannot be applied in important use cases of profile likelihood confidence intervals.

1.3 Our contributions

In this paper, we address the issues of VM by introducing an algorithm extending the ideas of Venzon and Moolgavkar (1988). Our algorithm, which we will call Robust Venzon-Moolgavkar Algorithm (RVM) below, combines the original procedure with a trust region approach (Conn et al., 2000). That is, the algorithm never steps outside of the region in which the likelihood approximation is sufficiently precise. Furthermore, RVM accounts for unidentifiable parameters, local minima and maxima, and sharp changes in the likelihood surface. Though RVM may not outcompete traditional approaches in problems with well-behaved likelihood functions or in the absence of estimability issues, we argue that RVM is a valuable alternative in the (common) cases that the likelihood function is hard to optimize and the model involves parameters that are not estimable.

Another well-known limitation of the approach by Venzon and Moolgavkar (1988) is that it is not directly applicable to construct confidence intervals for functions of parameters. Often the main research interest is not in identifying specific model parameters but in obtaining model predictions, which can be expressed as a function of the parameters. In addition to presenting a robust algorithm to find confidence intervals for model parameters, we show how RVM (and the original VM) can also be applied to determine confidence intervals for functions of parameters.

This paper is structured as follows: in the first section, we start by outlining the main ideas behind RVM before we provide details of the applied procedures. Furthermore, we briefly describe how the algorithm can be used to determine confidence intervals of functions of parameters. In the second section, we apply RVM and alternative algorithms to benchmark problems with simulated data. Thereby, we review the implemented alternative algorithms before we present the results. We conclude this paper with a discussion of the benchmark results and the benefits and limitations of RVM in comparison to earlier methods.

All code used in this study, including a Python implementation of RVM, can be retrieved from the python package index as package ci-rvm (see pypi.org/project/ci-rvm).

2 Algorithm

2.1 Basic ideas

Suppose we consider a model with an $n$ -dimensional parameter vector $\boldsymbol{\theta}:=\left(\theta_{0},\dots,\theta_{n-1}\right)$ and a twice continuously differentiable log-likelihood function $\ell$ . Assume without loss of generality that we seek to construct a level- $\alpha$ confidence interval for the parameter $\theta_{0}$ , and let $\widetilde{\boldsymbol{\theta}}:=\left(\theta_{1},\dots,\theta_{n-1}\right)^{\top}$ be the vector of all remaining parameters, called nuisance parameters. For convenience, we may write $\ell=\ell\mathopen{\left(\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}$ as a function of the complete parameter vector or $\ell=\ell\left(\theta_{0},\widetilde{\boldsymbol{\theta}}\right)$ as a function of the parameter of interest and the nuisance parameters.

The algorithm RVM introduced in this paper searches the right end point $\theta_{0}^{\max}$ (equation (4)) of the confidence interval $\mathinner{I}$ . The left end point can be identified with the same approach if a modified model is considered in which $\ell$ is flipped in $\theta_{0}$ . As RVM builds on the method by Venzon and Moolgavkar (1988), we start by recapitulating their algorithm VM below.

Let $\boldsymbol{\theta}^{*}\in\Theta$ be the parameter vector at which the parameter of interest is maximal, $\theta_{0}^{*}=\theta_{0}^{\max}$ , and $\ell\mathopen{\left(\boldsymbol{\theta}^{*}\right)}\mathclose{}\mathclose{}\geq\ell^{*}$ . Venzon and Moolgavkar (1988) note that $\boldsymbol{\theta}^{*}$ satisfies the following necessary conditions:

1.

$\ell\mathopen{\left(\boldsymbol{\theta}^{*}\right)}\mathclose{}\mathclose{}=\ell^{*}$ and
2.

$\ell$ is in a local maximum with respect to the nuisance parameters, which implies $\nicefrac{{\partial\ell}}{{\partial\widetilde{\boldsymbol{\theta}}}}\mathopen{\left(\boldsymbol{\theta}^{*}\right)}\mathclose{}\mathclose{}=0$ .

The algorithm VM searches for $\boldsymbol{\theta}^{*}$ by minimizing both the log-likelihood distance to the threshold $\left|\ell(\boldsymbol{\theta})-\ell^{*}\right|$ and the gradient of the nuisance parameters $\nicefrac{{\partial\ell}}{{\partial\widetilde{\boldsymbol{\theta}}}}$ . To this end, the algorithm repeatedly approximates the log-likelihood surface $\ell$ with second order Taylor expansions $\hat{\ell}$ . If $\boldsymbol{\theta}^{(i)}$ is the parameter vector in the $i^{\text{th}}$ iteration of the algorithm, expanding $\ell$ around $\boldsymbol{\theta}^{(i)}$ yields

$\displaystyle\hat{\ell}(\boldsymbol{\theta})$	$\displaystyle:=$	$\displaystyle\ell\mathopen{\left(\boldsymbol{\theta}^{(i)}\right)}\mathclose{}\mathclose{}+\boldsymbol{g}^{\top}\left(\boldsymbol{\theta}-\boldsymbol{\theta}^{(i)}\right)$	(5)
		$\displaystyle+\frac{1}{2}\left(\boldsymbol{\theta}-\boldsymbol{\theta}^{(i)}\right)^{\top}\underline{\boldsymbol{\mathrm{H}}}\left(\boldsymbol{\theta}-\boldsymbol{\theta}^{(i)}\right)$
	$\displaystyle=$	$\displaystyle\bar{\ell}+\widetilde{\boldsymbol{g}}^{\top}\widetilde{\boldsymbol{\delta}}+g_{0}\delta_{0}+\frac{1}{2}\widetilde{\boldsymbol{\delta}}^{\top}\widetilde{\underline{\boldsymbol{\mathrm{H}}}}\widetilde{\boldsymbol{\delta}}+\delta_{0}\widetilde{\boldsymbol{\mathrm{H}}}_{0}^{\top}\widetilde{\boldsymbol{\delta}}+\frac{1}{2}\delta_{0}\mathrm{H}_{00}\delta_{0}\leavevmode\nobreak\ \leavevmode\nobreak\$
	$\displaystyle{=:}$	$\displaystyle\hat{\ell}^{\delta}\left(\delta_{0},\widetilde{\boldsymbol{\delta}}\right).$

Here, $\boldsymbol{\delta}:=\boldsymbol{\theta}-\boldsymbol{\theta}^{(i)}$ , $\bar{\ell}:=\ell\mathopen{\left(\boldsymbol{\theta}^{(i)}\right)}\mathclose{}\mathclose{}$ ; $\boldsymbol{g}:=\frac{\partial\ell}{\partial\boldsymbol{\theta}}\mathopen{\left(\boldsymbol{\theta}^{(i)}\right)}\mathclose{}\mathclose{}$ is the gradient and $\underline{\boldsymbol{\mathrm{H}}}:=\nicefrac{{\partial^{2}\ell}}{{\partial\boldsymbol{\theta}^{2}}}\mathopen{\left(\boldsymbol{\theta}^{(i)}\right)}\mathclose{}\mathclose{}$ the Hessian matrix of $\ell$ at $\boldsymbol{\theta}^{(i)}$ . Analogously to notation used above, we split $\boldsymbol{\delta}$ into its first entry $\delta_{0}$ and the remainder $\widetilde{\boldsymbol{\delta}}$ , $\boldsymbol{g}$ into $g_{0}$ and $\widetilde{\boldsymbol{g}}$ , and write $\boldsymbol{\mathrm{\mathrm{H}}}_{0}$ for the first column of $\underline{\boldsymbol{\mathrm{H}}}$ , $\widetilde{\underline{\boldsymbol{\mathrm{H}}}}$ for $\underline{\boldsymbol{\mathrm{H}}}$ without its first column and row, and split $\boldsymbol{\mathrm{\mathrm{H}}}_{0}$ into $\mathrm{H}_{00}$ and $\widetilde{\boldsymbol{\mathrm{\mathrm{H}}}}_{0}$ .

In each iteration, VM seeks $\delta_{0}^{*}$ and $\widetilde{\boldsymbol{\delta}^{*}}$ that satisfy conditions 1 and 2. Applying condition 2 to the approximation $\hat{\ell}^{\delta}$ (equation (5)) yields

\displaystyle\widetilde{\boldsymbol{\delta}^{*}}

\displaystyle=

\displaystyle-\widetilde{\underline{\boldsymbol{\mathrm{H}}}}^{-1}\left(\widetilde{\boldsymbol{\mathrm{\mathrm{H}}}}_{0}\delta_{0}+\widetilde{\boldsymbol{g}}\right).

(6)

Inserting (5) and (6) into condition 1 gives us

	$\displaystyle\ell^{*}$	$\displaystyle=$	$\displaystyle\frac{1}{2}\left(\mathrm{H}_{00}-\widetilde{\boldsymbol{\mathrm{\mathrm{H}}}}_{0}^{\top}\widetilde{\underline{\boldsymbol{\mathrm{H}}}}^{-1}\widetilde{\boldsymbol{\mathrm{\mathrm{H}}}}_{0}\right)\delta_{0}^{*2}$		(7)
			$\displaystyle+\left(g_{0}-\widetilde{\boldsymbol{g}}^{\top}\widetilde{\underline{\boldsymbol{\mathrm{H}}}}^{-1}\widetilde{\boldsymbol{\mathrm{\mathrm{H}}}}_{0}\right)\delta_{0}^{*}+\bar{\ell}-\frac{1}{2}\widetilde{\boldsymbol{g}}^{\top}\widetilde{\underline{\boldsymbol{\mathrm{H}}}}^{-1}\widetilde{\boldsymbol{g}},$		(7)

which can be solved for $\delta_{0}^{*}$ if $\underline{\boldsymbol{\mathrm{H}}}$ is negative definite. If equation (7) has multiple solutions, Venzon and Moolgavkar (1988) choose the one that minimizes $\boldsymbol{\delta}$ according to some norm. Our algorithm RVM applies a different procedure and chooses the root that minimizes the distance to $\theta_{0}^{\max}$ without stepping into a region in which the approximation (5) is inaccurate. In section 2.5, we provide further details and discuss the case in which equation (7) has no real solutions.

After each iteration, $\boldsymbol{\theta}$ is updated according to the above results:

\displaystyle\boldsymbol{\theta}^{(i+1)}

\displaystyle=

\displaystyle\boldsymbol{\theta}^{(i)}+\boldsymbol{\delta}^{*}.

(8)

If $\ell\mathopen{\left(\boldsymbol{\theta}^{(i+1)}\right)}\mathclose{}\mathclose{}\approx\ell^{*}$ and $\nicefrac{{\partial\ell}}{{\partial\tilde{\boldsymbol{\theta}}}}\mathopen{\left(\boldsymbol{\theta}^{(i+1)}\right)}\mathclose{}\mathclose{}\approx 0$ up to the desired precision, the search is terminated and $\boldsymbol{\theta}^{(i+1)}$ is returned.

Refer to caption — Figure 1: Flow chart for RVM. The procedure is repeated until the termination criterion is met and the result is returned.

The need to extend the original algorithm VM outlined above comes from the following issues: (1) The quadratic approximation $\hat{\ell}$ may be imprecise far from the approximation point. In extreme cases, updating $\boldsymbol{\theta}$ as suggested could take us farther away from the target $\boldsymbol{\theta}^{*}$ rather than closer to it. (2) The approximation $\hat{\ell}$ may be constant in some directions or be not bounded above. In these cases, we may not be able to identify unique solutions for $\delta_{0}$ and $\widetilde{\boldsymbol{\delta}}$ , and the gradient criterion in condition 2 may not characterize a maximum but a saddle point or a minimum. (3) The limited precision of numerical operations can result in discontinuities corrupting the results of VM and hinder the algorithm from terminating.

To circumvent these problems, we introduce a number of extensions to VM. First, we address the limited precision of the Taylor approximation $\hat{\ell}$ with a trust region approach (Conn et al., 2000). That is, we constrain our search for $\boldsymbol{\delta}^{*}$ to a region in which the approximation $\hat{\ell}$ is sufficiently accurate. Second, we choose some parameters freely if $\hat{\ell}$ is constant in some directions and solve constrained maximization problems if $\hat{\ell}$ is not bounded above. In particular, we detect cases in which $\ell_{\mathrm{PL}}$ approaches an asymptote above $\ell^{*}$ , which means that $\theta_{0}$ is not estimable. Lastly, we introduce a method to identify and jump over discontinuities as appropriate. An overview of the algorithm is depicted as flow chart in Figure 1. Below, we describe each of our extensions in detail.

2.2 The trust region

In practice, the quadratic approximation (5) may not be good enough to reach a point close to $\boldsymbol{\theta}^{*}$ within one step. In fact, since $\ell$ may be very “non-quadratic”, we might obtain a parameter vector for which $\ell$ and $\nicefrac{{\partial\ell}}{{\partial\tilde{\boldsymbol{\theta}}}}$ are farther from $\ell^{*}$ and $\boldsymbol{0}$ than in the previous iteration. Therefore, we accept changes in $\theta$ only if the approximation is sufficiently accurate in the new point.

In each iteration $i$ , we compute the new parameter vector, compare the values of $\hat{\ell}$ and $\ell$ at the obtained point $\boldsymbol{\theta}^{(i)}+\boldsymbol{\delta}^{*}$ , and accept the step if, and only if, $\hat{\ell}$ and $\ell$ are close together with respect to a given distance measure. If $\bar{\ell}$ is near the target $\ell^{*}$ , we may also check the precision of the gradient approximation $\nicefrac{{\partial\hat{\ell}}}{{\partial\tilde{\boldsymbol{\theta}}}}$ to enforce timely convergence of the algorithm.

If we reject a step, we decrease the value $\delta_{0}^{*}$ obtained before, reduce the maximal admissible length $r$ of the nuisance parameter vector and solve the constrained maximization problem

\displaystyle\widetilde{\boldsymbol{\delta}^{*}}

\displaystyle=

\displaystyle\underset{\widetilde{\boldsymbol{\delta}}\,:\,\left|\widetilde{\boldsymbol{\delta}}\right|\leq r}{\max\,}\hat{\ell}^{\delta}\left(\delta_{0},\widetilde{\boldsymbol{\delta}}\right).

(9)

As the quadratic subproblem (9) appears in classical trust-region algorithms, efficient solvers are available (Conn et al., 2000) and implemented in optimization software, such as in the Python package Scipy (Jones et al., 2001).

We check the accuracy of the approximation at the resulting point $\boldsymbol{\theta}^{(i)}+\boldsymbol{\delta}^{*}$ , decrease the search radius if necessary, and continue with this procedure until the approximation is sufficiently precise. The metric and the tolerance applied to measure the approximation’s precision may depend on how far the current log-likelihood $\bar{\ell}$ is from the target $\ell^{*}$ . We suggest suitable precision measures in section 2.8.

Since it is often computationally expensive to compute the Hessian $\underline{\boldsymbol{\mathrm{H}}}$ , we desire to take as large steps $\delta_{0}$ as possible. However, it is also inefficient to adjust the search radius very often to find the maximal admissible $\delta_{0}^{*}$ . Therefore, RVM first attempts to make the unconstrained step given by equation (5). If this step is rejected, RVM determines the search radius with a log-scale binary search between the radius of the unconstrained step and the search radius accepted in the previous iteration. If even the latter radius does not lead to a sufficiently precise result, we update $\delta_{0}^{*}$ and $r$ by factors $\beta_{0},\beta_{1}\in\left(0,1\right)$ so that $\delta_{0}^{*}\leftarrow\beta_{0}\delta_{0}^{*}$ and $r\leftarrow\beta_{1}r$ .

2.3 Linearly dependent parameters

The right hand side of equation (6) is defined only if the nuisance Hessian $\widetilde{\underline{\boldsymbol{\mathrm{H}}}}$ is invertible. If $\widetilde{\underline{\boldsymbol{\mathrm{H}}}}$ is singular, the maximum with respect to the nuisance parameters is not uniquely defined or does not exist at all. We will consider the second case in the next section and focus on the first case here.

There are multiple options to compute a psudo-inverse of a singular matrix to solve underspecified linear equation systems (Rao, 1967). A commonly used approach is the Moore-Penrose inverse (Penrose, 1955), which yields a solution with minimal norm (Rao, 1967). This is a desirable property for our purposes, as the quadratic approximation is generally most precise close to the approximation point. The Moore-Penrose inverse can be computed efficiently with singular value decompositions (Golub and Kahan, 1965), which have also been applied to determine the number of identifiable parameters in a model (Eubank and Webster, 1985; Viallefont et al., 1998).

Whether or not a matrix is singular is often difficult to know precisely due to numerical inaccuracies. The Moore-Penrose inverse is therefore highly sensitive to a threshold parameter determining when the considered matrix is deemed singular. As the Hessian matrix is typically computed with numerical methods subject to error, it is often beneficial to choose a high value for this threshold parameter to increase the robustness of the method. Too large threshold values, however, can slow down or even hinder convergence of the algorithm.

An alternative method to account for singular Hessian matrices is to hold linearly dependent parameters constant until the remaining parameters form a non-singular system. In tests, this approach appeared to be more robust than applying the Moore-Penrose inverse. Therefore, we used this method in our implementation. We provide details on this method as well as test results in Supplementary Appendix A. Note that we write $\widetilde{\underline{\boldsymbol{\mathrm{H}}}}^{-1}$ for this generalized inverse below.

To determine whether the approximate system has any solution when $\widetilde{\underline{\boldsymbol{\mathrm{H}}}}$ is singular, we test whether $\widetilde{\boldsymbol{\delta}^{*}}$ computed according to equations (6) and (7) indeed satisfies the necessary conditions for a maximum in the nuisance parameters. That is, we check whether

0\leavevmode\nobreak\ \approx\leavevmode\nobreak\ \frac{\partial}{\partial\widetilde{\boldsymbol{\delta}}}\hat{\ell}^{\boldsymbol{\delta}}\leavevmode\nobreak\ =\leavevmode\nobreak\ \widetilde{\underline{\boldsymbol{\mathrm{H}}}}\widetilde{\boldsymbol{\delta}^{*}}+\widetilde{\boldsymbol{\mathrm{\mathrm{H}}}}_{0}\delta_{0}^{*}+\widetilde{\boldsymbol{g}}

(10)

holds up to a certain tolerance. If this is not the case, $\hat{\ell}$ is unbounded, and we proceed as outlined in the next section.

2.4 Solving unbounded subproblems

In each iteration, we seek the nuisance parameters $\tilde{\theta}$ that maximize $\ell$ for the computed value of $\theta_{0}$ . Since the log-likelihood function $\ell$ is bounded above, such a maximum must exist in theory. However, the approximate log-likelihood $\hat{\ell}$ could be unbounded at times, which would imply that the approximation is imprecise for large steps. Since we cannot identify a global maximum of $\hat{\ell}$ if it is unbounded, we instead seek the point maximizing $\hat{\ell}$ in the range where $\hat{\ell}$ is sufficiently accurate.

To test whether $\hat{\ell}$ is unbounded in the nuisance parameters, we first check whether $\widetilde{\underline{\boldsymbol{\mathrm{H}}}}$ is negative semi-definite. If $\widetilde{\underline{\boldsymbol{\mathrm{H}}}}$ is invertible, this test can be conducted by applying a Cholesky decomposition on $-\widetilde{\underline{\boldsymbol{\mathrm{H}}}}$ , which succeeds if and only if $\widetilde{\underline{\boldsymbol{\mathrm{H}}}}$ is negative definite. If $\widetilde{\underline{\boldsymbol{\mathrm{H}}}}$ is singular, we use an eigenvalue decomposition. If all eigenvalues are below a small threshold, $\widetilde{\underline{\boldsymbol{\mathrm{H}}}}$ is negative semi-definite. To confirm that $\hat{\ell}$ is bounded, we also test whether equation (10) holds approximately if $\widetilde{\underline{\boldsymbol{\mathrm{H}}}}$ is singular (see section 2.3).

If either of these tests fails, $\hat{\ell}$ is unbounded. In this case, we set $\delta_{0}^{*}\leftarrow r_{0}$ , $r\leftarrow r_{1}$ , for some parameters $r_{0},r_{1}>0$ and solve the maximization problem (9). The parameters $r_{0}$ and $r_{1}$ can be adjusted and saved for future iterations to efficiently identify the maximal admissible step. That is, we may increase (or reduce) $\delta_{0}^{*}$ and $r$ as long as (or until) $\hat{\ell}$ is sufficiently precise. Thereby, we adjust the ratio of $\delta_{0}^{*}$ and $r$ so that the likelihood increases: $\hat{\ell}^{\delta}\mathopen{\left(\delta_{0}^{*},\widetilde{\boldsymbol{\delta}^{*}}\right)}\mathclose{}\mathclose{}>\bar{\ell}$ .

2.5 Step choice for the parameter of interest

Whenever $\hat{\ell}$ has a unique maximum in the nuisance parameters, we compute $\delta_{0}^{*}$ by solving equation (7). This equation can have one, two, or no roots. To discuss how $\delta_{0}^{*}$ should be chosen in either of these cases, we introduce some helpful notation. First, we write $\hat{\ell}_{\mathrm{PL}}\mathopen{\left(\theta_{0}\right)}\mathclose{}\mathclose{}:=\underset{\tilde{\theta}}{\max\,}\hat{\ell}\mathopen{\left(\theta_{0},\tilde{\theta}\right)}\mathclose{}\mathclose{}$ for the profile log-likelihood function of the quadratic approximation. Furthermore, we write in accordance with previous notation

\hat{\ell}_{\mathrm{PL}}^{\delta}\mathopen{\left(\delta_{0}\right)}\mathclose{}\mathclose{}\leavevmode\nobreak\ :=\leavevmode\nobreak\ \hat{\ell}_{\mathrm{PL}}\mathopen{\left(\theta_{0}^{(i)}+\delta_{0}\right)}\mathclose{}\mathclose{}\leavevmode\nobreak\ =\leavevmode\nobreak\ a\delta_{0}^{2}+p\delta_{0}+q+\ell^{*}

(11)

with $a:=\frac{1}{2}\left(\mathrm{H}_{00}-\widetilde{\boldsymbol{\mathrm{\mathrm{H}}}}_{0}\widetilde{\underline{\boldsymbol{\mathrm{H}}}}^{-1}\widetilde{\boldsymbol{\mathrm{\mathrm{H}}}}_{0}\right)$ , $p:=g_{0}-\widetilde{g}^{\top}\widetilde{\underline{\boldsymbol{\mathrm{H}}}}^{-1}\widetilde{\boldsymbol{\mathrm{\mathrm{H}}}}_{0}$ , and $q:=\bar{\ell}-\frac{1}{2}\widetilde{g}^{\top}\widetilde{\underline{\boldsymbol{\mathrm{H}}}}^{-1}\widetilde{g}-\ell^{*}$ (see equation (7)).

Our choices of $\delta_{0}^{*}$ attempt to increase $\theta_{0}$ as much as possible while staying in a region in which the approximation $\hat{\ell}$ is reasonably accurate. The specific step choice depends on the slope of the profile likelihood $\hat{\ell}_{\mathrm{PL}}^{\delta}$ and on whether we have already exceeded $\theta_{0}^{\max}$ according to our approximation, i.e. $\hat{\ell}_{\mathrm{PL}}^{\delta}\mathopen{\left(0\right)}\mathclose{}\mathclose{}<\ell^{*}$ . Below, we will first assume that $\hat{\ell}_{\mathrm{PL}}^{\delta}\mathopen{\left(0\right)}\mathclose{}\mathclose{}>\ell^{*}$ and discuss the opposite case later.

2.5.1 Case 1: decreasing profile likelihood

If the profile likelihood decreases at the approximation point, i.e. $p<0$ , we select the smallest positive root:

\displaystyle\delta_{0}^{*}

\displaystyle=

\displaystyle\begin{cases}-\frac{q}{p}&\text{if }a=0\\ -\frac{1}{2a}\left(p+\sqrt{p^{2}-4aq}\right)&\text{else.}\end{cases}

(12)

Choosing $\delta_{0}^{*}>0$ ensures that the distance to the end point $\theta_{0}^{\max}$ decreases in this iteration. Choosing the smaller positive root increases our trust in the accuracy of the approximation and prevents potential convergence issues (see Figure 2a).

If $\hat{\ell}_{\mathrm{PL}}^{\delta}$ has a local minimum above the threshold $\ell^{*}$ , equation (11) does not have a solution, and we may attempt to decrease the distance between $\hat{\ell}_{\mathrm{PL}}^{\delta}$ and $\ell^{*}$ instead. This procedure, however, may let RVM converge to a local minimum in $\hat{\ell}_{\mathrm{PL}}^{\delta}$ rather than to a point with $\hat{\ell}_{\mathrm{PL}}^{\delta}=\ell^{*}$ . Therefore, we “jump” over the extreme point by doubling the value of $\delta_{0}^{*}$ . That is, we choose

\displaystyle\delta_{0}^{*}

\displaystyle=

\displaystyle-\frac{p}{a}

(13)

if $p^{2}<4aq$ (see Figure 2b).

2.5.2 Case 2: increasing profile likelihood

If the profile likelihood increases at the approximation point, i.e. $p>0$ , equation (11) has a positive root if and only if $\hat{\ell}_{\mathrm{PL}}$ is concave down; $a<0$ . We choose this root whenever it exists:

\displaystyle\delta_{0}^{*}

\displaystyle=

\displaystyle-\frac{1}{2a}\left(p+\sqrt{p^{2}-4aq}\right).

(14)

However, if $\hat{\ell}_{\mathrm{PL}}$ grows unboundedly, equation (11) does not have a positive root. In this case, we change the threshold value $\ell^{*}$ temporarily to a value $\ell^{*\prime}$ chosen so that equation (11) has a solution with the updated threshold (see Figure 2c). For example, we may set

\displaystyle\ell^{*\prime}

\displaystyle:=

\displaystyle\max\left\{\hat{\ell}_{\mathrm{PL}}^{\delta}\mathopen{\left(0\right)}\mathclose{}\mathclose{}+1,\frac{\bar{\ell}+\ell\mathopen{\left(\hat{\boldsymbol{\theta}}\right)}\mathclose{}\mathclose{}}{2}\right\}.

This choice ensures that a solution exists while at the same time reaching local likelihood maxima quickly. After resetting the threshold, we proceed as usual.

To memorize that we changed the threshold value $\ell^{*}$ , we set a flag $\mathtt{maximizing:=True}$ . In future iterations $j>i$ , we set the threshold $\ell^{*}$ back to its initial value $\ell_{0}^{*}$ and $\mathtt{maximizing:=False}$ as soon as $\ell\mathopen{\left(\boldsymbol{\theta}^{(j)}\right)}\mathclose{}\mathclose{}<\ell_{0}^{*}$ or $\hat{\ell}_{\mathrm{PL}}$ is concave down at the approximation point $\boldsymbol{\theta}^{(j)}$ .

2.5.3 Case 3: constant profile likelihood

If the profile likelihood has a local extremum at the approximation point, i.e. $p=0$ , $a\neq 0$ , we proceed as in cases 1 and 2: if $a>0$ , we proceed as if $\hat{\ell}_{\mathrm{PL}}$ were increasing, and if $a<0$ , we proceed as if $\hat{\ell}_{\mathrm{PL}}$ were decreasing. However, the approximate profile likelihood could also be constant, $a=p=0$ . In this case, we attempt to make a very large step to check whether we can push $\theta_{0}$ arbitrarily far. In section 2.6, we discuss this procedure in greater detail.

2.5.4 Profile likelihood below the threshold

If the profile likelihood at the approximation point is below the threshold, $\hat{\ell}_{\mathrm{PL}}^{\delta}\mathopen{\left(0\right)}\mathclose{}\mathclose{}<\ell^{*}$ , we always choose the smallest possible step:

\displaystyle\delta_{0}^{*}

\displaystyle=

\displaystyle\begin{cases}-\frac{1}{2a}\left(p+\sqrt{p^{2}-4aq}\right)&\text{if }a\neq 0,\,p<0\\ -\frac{q}{p}&\text{if }a=0,\,p\neq 0\\ -\frac{1}{2a}\left(p-\sqrt{p^{2}-4aq}\right)&\text{if }a\neq 0,\,p>0.\end{cases}

(15)

This shall bring us to the admissible parameter region as quickly as possible.

As RVM rarely steps far beyond the admissible region in practice, equation (15) usually suffices to define $\delta_{0}^{*}$ . Nonetheless, if we find that $\hat{\ell}_{PL}^{\boldsymbol{\delta}}$ has a local maximum below the threshold, i.e. $p^{2}<4qa$ , we may instead maximize $\hat{\ell}_{PL}^{\delta}$ as far as possible:

\displaystyle\delta_{0}^{*}

\displaystyle=

\displaystyle-\frac{p}{2a}.

(16)

If we have already reached a local maximum ( $p\approx 0$ ), we cannot make a sensible choice for $\delta_{0}$ . In this case, we may recall the iteration $k:=\underset{j\,:\,\ell(\theta^{(j)})\geq\ell^{*}}{\mathrm{argmax}\,}\theta_{0}^{(j)}$ , in which the largest admissible $\theta_{0}$ value with $\ell\mathopen{\left(\theta^{(k)}\right)}\mathclose{}\mathclose{}\geq\ell^{*}$ has been found so far, and conduct a binary search between $\theta^{(i)}$ and $\theta^{(k)}$ until we find a point $\theta^{(i+1)}$ with $\ell\mathopen{\left(\theta^{(i+1)}\right)}\mathclose{}\mathclose{}\geq\ell^{*}$ .

2.6 Identifying inestimable parameters

In some practical scenarios, the profile log-likelihood $\ell_{\mathrm{PL}}$ will never fall below the threshold $\ell^{*}$ , which means that the considered parameter is not estimable. In these cases, RVM may not converge. However, often it is possible to identify inestimable parameters by introducing a step size limit $\delta_{0}^{\max}$ . If the computed step exceeds the maximal step size, $\delta_{0}^{*}>\delta_{0}^{\max}$ and the current function value exceeds the threshold value, i.e. $\bar{\ell}\geq\ell^{*}$ , we set $\delta_{0}^{*}:=\delta_{0}^{\max}$ and compute the corresponding nuisance parameters. If the resulting log-likelihood $\ell\mathopen{\left(\theta^{(i)}+\boldsymbol{\delta}^{*}\right)}\mathclose{}\mathclose{}$ is not below the threshold $\ell^{*}$ , we let the algorithm terminate, raising a warning that the parameter $\theta_{0}$ is not estimable. If $\ell\mathopen{\left(\theta^{(i)}+\boldsymbol{\delta}^{*}\right)}\mathclose{}\mathclose{}<\ell^{*}$ , however, we cannot draw this conclusion and decrease the step size until the approximation is sufficiently close to the original function.

The criterion suggested above may not always suffice to identify inestimable parameters. For example, if the profile likelihood is constant but the nuisance parameters maximizing the likelihood change non-linearly, RVM may not halt. For this reason, and also to prevent unexpected convergence issues, it is advisable to introduce an iteration limit to the algorithm. If the iteration limit is exceeded, potential estimability issues issues may be investigated further.

2.7 Discontinuities

RVM is based on quadratic approximations and requires therefore that $\ell$ is differentiable twice. Nonetheless, discontinuities can occur due to numerical imprecision even if the likelihood function is continuous in theory. Though we may still be able to compute the gradient $g$ and the Hessian $\underline{\boldsymbol{\mathrm{H}}}$ in these cases, the resulting quadratic approximation will be inaccurate even if we take very small steps. Therefore, these discontinuities could hinder the algorithm from terminating.

To identify discontinuities, we define a minimal step size $\epsilon$ , which may depend on the gradient $\boldsymbol{g}$ . If we reject a step with small length $\left|\boldsymbol{\delta}^{*}\right|\leq\epsilon$ , we may conclude that $\ell$ is discontinuous at the current approximation point $\boldsymbol{\theta}^{(i)}$ . To determine the set $D$ of parameters responsible for the issue, we decompose $\boldsymbol{\delta}^{*}$ into its components. We initialize $D\leftarrow\emptyset$ and consider, with the $j^{\text{th}}$ unit vector $\boldsymbol{e}_{j}$ , the step $\boldsymbol{\delta}^{*\prime}:=\sum_{j\leq k,\,j\neq D}\boldsymbol{e}_{j}\delta_{j}^{*}$ until $\hat{\ell}^{\delta}\mathopen{\left(\boldsymbol{\delta}^{*\prime}\right)}\mathclose{}\mathclose{}\not\approx\ell^{\delta}\mathopen{\left(\boldsymbol{\delta}^{*\prime}\right)}\mathclose{}\mathclose{}$ for some $k<n$ . When we identify such a component, we add it to the set $D$ and continue the procedure.

If we find that $\ell$ is discontinuous in $\theta_{0}$ , we check whether the current nuisance parameters maximize the likelihood, i.e. $\ell$ is bounded above and $\widetilde{\boldsymbol{g}}$ is approximately $\boldsymbol{0}$ . If the nuisance parameters are not optimal, we hold $\theta_{0}$ constant and maximize $\ell$ with respect to the nuisance parameters. Otherwise, we conclude that the profile likelihood function has a jump discontinuity. In this case, our action depends on the current log-likelihood value $\bar{\ell}$ , the value of $\ell$ at the other end of the discontinuity, and the threshold $\ell^{*}$ .

•

If $\ell\mathopen{\left(\boldsymbol{\theta}^{(i)}+\boldsymbol{e}_{0}\delta_{0}^{*}\right)}\mathclose{}\mathclose{}\geq\ell^{*}$ or $\ell\mathopen{\left(\boldsymbol{\theta}^{(i)}\right)}\mathclose{}\mathclose{}<\ell\mathopen{\left(\boldsymbol{\theta}^{(i)}+\boldsymbol{e}_{0}\delta_{0}^{*}\right)}\mathclose{}\mathclose{}$ , we accept the step regardless of the undesirably large error.
•

If $\ell\mathopen{\left(\boldsymbol{\theta}^{(i)}+\boldsymbol{e}_{0}\delta_{0}^{*}\right)}\mathclose{}\mathclose{}<\ell^{*}$ and $\ell\mathopen{\left(\boldsymbol{\theta}^{(i)}\right)}\mathclose{}\mathclose{}\geq\ell^{*}$ , we terminate and return $\theta_{0}^{(i)}$ as the bound of the confidence interval.
•

Otherwise, we cannot make a sensible step and try to get back into the admissible region by conducting the binary search procedure we have described in section 2.5.4.

If $\ell$ is discontinuous in variables other than $\theta_{0}$ , we hold the variables constant whose change decreases the likelihood and repeat the iteration with a reduced system. After a given number of iterations, we release these parameters again, as $\theta$ may have left the point of discontinuity.

Since we may require that not only $\hat{\ell}$ but also its gradient are well approximated, a robust implementation of RVM should also handle potential gradient discontinuities. The nuisance parameters causing the issues can be identified analogously to the procedure outlined above. All components in which the gradient changes its sign from positive to negative should be held constant, as the likelihood appears to be in a local maximum in these components. The step in the remaining components may be accepted regardless of the large error.

2.8 Suitable parameters and distance measures

The efficiency of RVM depends highly on the distance measures and parameters applied when assessing the accuracy of the approximation and updating the search radius of the constrained optimization problems. If the precision measures are overly conservative, then many steps will be needed to find $\boldsymbol{\theta}^{*}$ . If the precision measure is too liberal, in turn, RVM may take detrimental steps and might not even converge.

We suggest the following procedure: (1) we always accept forward steps with $\delta_{0}^{*}\geq 0$ if the true likelihood is larger than the approximate likelihood, $\ell^{\delta}\mathopen{\left(\boldsymbol{\delta}^{*}\right)}\mathclose{}\mathclose{}\geq\hat{\ell}^{\delta}\mathopen{\left(\boldsymbol{\delta}^{*}\right)}\mathclose{}\mathclose{}$ . (2) If the approximate likelihood function is unbounded, we require that the likelihood increases $\ell^{\delta}\mathopen{\left(\boldsymbol{\delta}^{*}\right)}\mathclose{}\mathclose{}\geq\bar{\ell}$ . This requirement helps RVM to return quickly to a region in which the approximation is bounded. However, if the step size falls below the threshold used to detect discontinuities, we may relax this constraint so that less time must be spent to detect potential discontinuities. (3) If we are outside the admissible region, i.e. $\bar{\ell}<\ell^{*}$ , we enforce that we get closer to the target likelihood: $\left|\ell^{\delta}\mathopen{\left(\boldsymbol{\delta}^{*}\right)}\mathclose{}\mathclose{}-\ell^{*}\right|<\left|\bar{\ell}-\ell^{*}\right|$ . This reduces potential convergence issues. (4) We require that

\displaystyle\frac{\left|\hat{\ell}^{\delta}\mathopen{\left(\boldsymbol{\delta}^{*}\right)}\mathclose{}\mathclose{}-\ell^{\delta}\mathopen{\left(\boldsymbol{\delta}^{*}\right)}\mathclose{}\mathclose{}\right|}{\left|\bar{\ell}-\ell^{*}\right|}

\displaystyle\leq

\displaystyle\gamma

(17)

for a constant $\gamma$ . That is, the required precision depends on how close we are to the target. This facilitates fast convergence of the algorithm. The constant $\gamma\in\left(0,1\right)$ controls how strict the precision requirement is. In tests, $\gamma=\frac{1}{2}$ appeared to be a good choice. (5) If we are close to the target, $\ell^{\delta}\mathopen{\left(\boldsymbol{\delta}^{*}\right)}\mathclose{}\mathclose{}\approx\ell^{*}$ , we also require that the gradient estimate is precise:

\displaystyle\frac{\left|\frac{\partial\hat{\ell}^{\delta}}{\partial\tilde{\theta}}\mathopen{\left(\boldsymbol{\delta}^{*}\right)}\mathclose{}\mathclose{}-\frac{\partial\ell^{\delta}}{\partial\tilde{\theta}}\mathopen{\left(\boldsymbol{\delta}^{*}\right)}\mathclose{}\mathclose{}\right|}{\left|\boldsymbol{g}\right|}

\displaystyle\leq

\displaystyle\gamma.

(18)

This constraint helps us to get closer to a maximum in the nuisance parameters. Here, we use the $\mathcal{L}_{2}$ norm.

When we reject a step because the approximation is not sufficiently accurate, we adjust $\delta_{0}^{*}$ and solve the constrained maximization problem (9) requiring $\left|\widetilde{\boldsymbol{\delta}}\right|\leq r$ . To ensure that the resulting step does not push the log-likelihood below the target $\ell^{*}$ , the radius $r$ should not be decreased more strongly than $\delta_{0}^{*}$ . In tests, adjusting $r$ by a factor $\beta_{1}:=1.5$ whenever $\delta_{0}^{*}$ is adjusted by factor $\beta_{0}:=2$ appeared to be a good choice.

2.9 Confidence intervals for functions of parameters

Often, modelers are interested in confidence intervals for functions $f\mathopen{\left(\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}$ of the parameters. A limitation of VM and VMR is that such confidence intervals cannot be computed directly with these algorithms. However, this problem can be solved approximately by considering a slightly changed likelihood function. We aim to find

\displaystyle\phi^{\max}

\displaystyle=

\displaystyle\underset{\boldsymbol{\theta}\in\Theta\,:\,\ell\mathopen{\left(\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}\geq\ell^{*}}{\max\,}f\mathopen{\left(\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}

(19)

or the respective minimum. Define

\displaystyle\check{\ell}\mathopen{\left(\phi,\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}

\displaystyle:=

\displaystyle\ell\mathopen{\left(\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}-\frac{1}{2}\left(\frac{f\mathopen{\left(\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}-\phi}{\varepsilon}\right)^{2}\chi_{1,1-\alpha}^{2},

(20)

with a small constant $\varepsilon$ . Consider the altered maximization problem

\displaystyle\check{\phi}^{\max}

\displaystyle=

\displaystyle\underset{\boldsymbol{\theta}\in\Theta\,:\,\check{\ell}\mathopen{\left(\phi,\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}\geq\ell^{*}}{\max\,}\phi,

(21)

which can be solved with VM or RVM.

We argue that a solution to (21) is an approximate solution to (19), whereby the error is bounded by $\varepsilon$ . Let $\mathopen{\left(\phi^{\max},\boldsymbol{\theta}^{*}\right)}\mathclose{}\mathclose{}$ be a solution to problem (19) and $\mathopen{\left(\check{\phi}^{\max},\check{\boldsymbol{\theta}}^{*}\right)}\mathclose{}\mathclose{}$ a solution to problem (21). Since $\phi^{\max}=f\mathopen{\left(\boldsymbol{\theta}^{*}\right)}\mathclose{}\mathclose{}$ , it is $\check{\ell}\mathopen{\left(\phi^{\max},\boldsymbol{\theta}^{*}\right)}\mathclose{}\mathclose{}=\ell\mathopen{\left(\boldsymbol{\theta}^{*}\right)}\mathclose{}\mathclose{}\geq\ell^{*}$ . Therefore, $\mathopen{\left(\phi^{\max},\boldsymbol{\theta}^{*}\right)}\mathclose{}\mathclose{}$ is also a feasible solution to (19), and it follows that $\check{\phi}^{\max}\geq\phi^{\max}$ . At the same time, $\check{\ell}\mathopen{\left(\phi,\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}\leq\ell\mathopen{\left(\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}$ , which implies that $f\mathopen{\left(\check{\boldsymbol{\theta}}^{*}\right)}\mathclose{}\mathclose{}\leq f\mathopen{\left(\boldsymbol{\theta}^{*}\right)}\mathclose{}\mathclose{}$ , since $\boldsymbol{\theta}^{*}$ maximizes $f$ over a domain larger than the feasibility domain of (21). In conclusion, $f\mathopen{\left(\check{\boldsymbol{\theta}}^{*}\right)}\mathclose{}\mathclose{}\leq f\mathopen{\left(\boldsymbol{\theta}^{*}\right)}\mathclose{}\mathclose{}=\phi^{\max}\leq\check{\phi}^{\max}$ . Lastly,

	$\displaystyle\ell^{*}$	$\displaystyle=$	$\displaystyle\ell\mathopen{\left(\hat{\boldsymbol{\theta}}\right)}\mathclose{}\mathclose{}-\frac{1}{2}\chi_{1,1-\alpha}^{2}\leavevmode\nobreak\ \leq\leavevmode\nobreak\ \check{\ell}\mathopen{\left(\check{\phi}^{\max},\check{\boldsymbol{\theta}}^{*}\right)}\mathclose{}\mathclose{}$		(22)
		$\displaystyle=$	$\displaystyle\ell\mathopen{\left(\check{\boldsymbol{\theta}}^{}\right)}\mathclose{}\mathclose{}-\frac{1}{2}\left(\frac{f\mathopen{\left(\check{\boldsymbol{\theta}}^{}\right)}\mathclose{}\mathclose{}-\check{\phi}^{\max}}{\varepsilon}\right)^{2}\chi_{1,1-\alpha}^{2}.$		(22)

Simplifying (22) yields $\left|f\mathopen{\left(\check{\boldsymbol{\theta}}^{*}\right)}\mathclose{}\mathclose{}-\check{\phi}^{\max}\right|\leq\varepsilon$ . Thus, $\left|\phi^{\max}-\check{\phi}^{\max}\right|\leq\varepsilon$ .

Though it is possible to bound the error by an arbitrarily small constant $\varepsilon$ in theory, care must be taken if the function $f\mathopen{\left(\boldsymbol{\theta}\right)}\mathclose{}\mathclose{}$ is not well-behaved, i.e. strongly nonlinear. In theses cases, overly small values for $\varepsilon$ may slow down convergence.

Note that the suggested procedure may seem to resemble the approach of Neale and Miller (1997), who also account for constraints by adding the squared error to the target function. However, unlike Neale and Miller (1997), the approach suggested above bounds the error in the confidence interval bound, not the error of the constraint. Furthermore, we do not square the log-likelihood function, which would worsen nonlinearities and could thus make optimization difficult. Therefore, our approach is less error-prone than the method by Neale and Miller (1997).

3 Tests

To compare the presented algorithm to existing methods, we applied RVM, the classic VM, and five other algorithms to benchmark problems and compared the robustness and performance of the approaches. Below we review the implemented methods. Then we introduce the benchmark problems, before we finally present the benchmark results.

3.1 Methods implemented for comparison

Besides RVM and VM, we implemented three methods that repeatedly evaluate the profile likelihood function and two methods that search for the confidence intervals directly. We implemented all methods in the programming language Python version 3.7 and made use of different optimization routines implemented or wrapped in the scientific computing library Scipy (Jones et al., 2001).

First, we implemented a grid search for the confidence bounds. The approach uses repeated Lagrangian constrained optimizations and may resemble the method by DiCiccio and Tibshirani (1991); however, rather than implementing the algorithm by DiCiccio and Tibshirani (1991), we applied the constrained optimization algorithm by Lalee et al. (1998), which is a trust-region approach and may thus be more robust than the method by DiCiccio and Tibshirani (1991). Furthermore, the algorithm by Lalee et al. (1998) was readily implemented in Scipy.

We conducted the grid search with a naive step size of $0.2$ , which we repeatedly reduced by factor $2$ close to the threshold log-likelihood $\ell^{*}$ until the desired precision was achieved. To account for unidentifiable parameters, we attempted one large step ( $1000$ units) if the algorithm did not terminate in the given iteration limit. We considered a parameter as unbounded if this step yielded a log-likelihood above the target value $\ell^{*}$ .

Second, we implemented a quadratic bisection method for root finding on $\ell_{\mathrm{PL}}$ (cf. Ren and Xia, 2019). Initially we chose a step size of $1$ . Afterwards, we computed the step of $\theta_{0}$ based on a quadratic interpolation between the MLE $\hat{\theta}_{0}$ , the maximal value of $\theta_{0}$ for which we found $\ell_{\mathrm{PL}}\mathopen{\left(\theta_{0}\right)}\mathclose{}\mathclose{}>\ell^{*}$ and the smallest identified value of $\theta_{0}$ with $\ell_{\mathrm{PL}}\mathopen{\left(\theta_{0}\right)}\mathclose{}\mathclose{}<\ell^{*}$ . Until a point $\theta_{0}$ with $\ell_{\mathrm{PL}}\mathopen{\left(\theta_{0}\right)}\mathclose{}\mathclose{}<\ell^{*}$ was identified, we interpolated $\ell_{\mathrm{PL}}$ between $\hat{\theta}_{0}$ and the two largest evaluated values $\theta_{0}$ . When only two points were available or the approximation of $\ell_{\mathrm{PL}}$ did not assume the target value, we introduced the additional constraint $\nicefrac{{d\ell_{\mathrm{PL}}}}{{d\theta_{0}}}=0$ . Using a quadratic rather than a linear interpolation for bisection has the advantage that the algorithm converges faster if the profile log-likelihood function is convex or quadratic. To evaluate $\ell_{\mathrm{PL}}$ , we applied sequential least squares programming (Kraft, 1988), which is the default method for constrained optimization in Scipy.

Third, we implemented a binary search with an initial step of $1$ . Until a value $\theta_{0}$ with $\ell_{\mathrm{PL}}\mathopen{\left(\theta_{0}\right)}\mathclose{}\mathclose{}<\ell^{*}$ was found, we increased $\theta_{0}$ by factor $10$ . This preserves the logarithmic runtime of the algorithm if the problem has a solution. To broaden the range of tested internal optimization routines, we used a different method to evaluate $\ell_{\mathrm{PL}}$ than in the bisection method: we fixed $\theta_{0}$ at the desired value and performed an unconstrained optimization on the nuisance parameters. Here, we used the quasi-Newton method by Broyden, Fletcher, Goldfarb, and Shanno (BFGS; see Nocedal and Wright, 2006, pp. 136).

To test methods that search for the confidence interval end points directly, we solved problem (4) with sequential least squares programming (Kraft, 1988). Furthermore, we implemented the approximate method by Neale and Miller (1997). They transform the constrained maximization problem (9) to an unconstrained problem by considering the sum of the parameter of interest $\theta_{0}$ and the squared error between the target $\ell^{*}$ and the log-likelihood. Minimization of this target function yields a point in which the target log-likelihood is reached approximately and the parameter of interest is minimal. Again, we used the method BFGS for minimization (see above).

Finally, we implemented Wald’s method to assess the need to apply any profile likelihood method.

3.2 Benchmark problem

To investigate the performances of the implemented methods, we applied the algorithms to a benchmark problem with variable parameter number and data set size. We considered a logistic regression problem with $n$ count data covariates $c_{ij}$ , $j\in\left\{1,\dots,n\right\}$ for each data point $i\in\left\{1,\dots,N\right\}$ . We assumed that the impact of the covariates levels off at high values and considered therefore the transformed covariates $c_{ij}^{\alpha_{j}}$ with $\alpha\in\left(0,1\right)$ . This is not only reasonable in many real world problems but also makes likelihood maximization a computationally challenging problem if not enough data are available to achieve asymptotic normality of the MLE. Hence, this scenario gives insights into the performance of the implemented methods in challenging realistic problems. The benchmark model’s probability mass function for a data point $X_{i}$ was thus given by

\displaystyle\mathbb{P}\mathopen{\left(X_{i}=1\right)}\mathclose{}

\displaystyle=

\displaystyle\left(1+\exp\mathopen{\left(-\beta_{0}-\underset{j}{\sum}\beta_{j}c_{ij}^{\alpha_{j}}\right)}\mathclose{}\mathclose{}\right)^{-1}

(23)

and $\mathbb{P}\mathopen{\left(X_{i}=0\right)}\mathclose{}=1-\mathbb{P}\mathopen{\left(X_{i}=1\right)}\mathclose{}$ .

We drew the covariate values randomly from a negative binomial distribution with mean $5$ and variance $10$ . The negative binomial distribution is commonly used to model count data (Gardner et al., 1995) and thus suited to represent count covariates. To simulate the common case that covariates are correlated, we furthermore drew the value for every other covariate from a binomial distribution with the respective preceding covariate as count parameter. That is, for uneven $j$ ,

\displaystyle c_{i,j+1}

\displaystyle\sim

\displaystyle\mathrm{Binomial}\mathopen{\left(c_{i,j},p\right)}\mathclose{}\mathclose{},

with $p=0.2$ in our simulations. To avoid numerical problems arising when covariates with value $0$ are raised to the power $0$ , we added a small positive perturbation to the count values. That way, we achieved that $0^{0}$ was defined to be $1$ . We chose the parameters $\alpha_{j}$ and $\beta_{j}$ so that the data were balanced, i.e. the frequency of $0$ s and $1$ s was approximately even. Refer to Supplementary Appendix B for the parameter values we used.

3.3 Test procedure

To test the algorithms in a broad range of scenarios and assess how their performance is impacted by model characteristics, we considered a model with $1$ covariate ( $3$ parameters), a model with $5$ covariates ( $11$ parameters), and a generalized linear model (GLM) with $10$ covariates, in which the powers $\alpha_{j}$ were set to $1$ ( $11$ parameters). Furthermore, we varied the sizes of the simulated data sets, ranging between $N=500$ and $N=10000$ for the models with transformed covariates and $N=50$ and $N=1000$ for the GLM. In Figure 3, we depict the impact of $N$ on the shape of the likelihood function and thus the difficulty of the problem.

For each considered set of parameters, we generated $200$ realizations of covariates and training data from the model described in the previous section. We determined the maximum likelihood estimator by maximizing the log-likelihood with the method BFGS and refined the estimate with an exact trust region optimizer (Conn et al., 2000). Then, we applied each of the implemented algorithms to each data set and determined the algorithms’ success rates and efficiencies.

As the likelihood functions of the tested models decrease drastically at $\alpha_{j}=0$ , potentially causing some algorithms to fail, we constrained the $\alpha_{j}$ to non-negative values. Tho that end, we considered transformed parameters $\alpha_{j}^{\prime}:=\ln\mathopen{\left(\exp\mathopen{\left(\alpha_{j}\right)}\mathclose{}\mathclose{}-1\right)}\mathclose{}\mathclose{}$ . Such transformations are reasonable whenever the parameter range is naturally constrained from a modeling perspective. Nonetheless, we evaluated the results of the tested algorithms based on the back-transformed parameters $\alpha_{j}$ .

We measured the algorithms’ success based on their ability to solve problem (4) rather than their capability to determine the true confidence intervals for the parameters. Though profile likelihood confidence intervals are usually highly accurate, they rely on the limiting distribution of the likelihood ratio statistic. Therefore, algorithms could fail to solve optimization problem (4) but, by coincidence, return a result close to the true confidence interval bound and vice versa. To exclude such effects and circumvent the high computational effort required to determine highly precise confidence intervals with sampling methods, we determined the “true” confidence interval bound by choosing the widest confidence interval bound obtained by either of the tested methods provided it was admissible, i.e. $\ell\mathopen{\left(\theta^{\max}\right)}\mathclose{}\mathclose{}\geq\ell^{*}$ up to a permissible error of $0.001$ .

We considered an algorithm successful if (1) the returned result was within a $\pm 5\%$ range of the true confidence interval bound or had an error below $0.001$ , and (2) the algorithm reported convergence. That is, to be deemed successful, an algorithm had to both return the correct result and also claim that it found the correct solution. The latter constraint ensures that if none of the algorithms converges successfully, even the one with the best result is not considered successful.

As many of the tested methods rely on general optimizers without specific routines to identify situations with divergent solutions, we considered parameters with confidence interval bounds exceeding $\left[-1000,1000\right]$ in the transformed parameter space as unbounded. Consequently, all algorithms returning a larger confidence interval were considered successful.

We limited the runtime of all methods except the pre-implemented optimizers by introducing a step limit of $200$ . If convergence was not reached within this number of steps, the algorithms were viewed unsuccessful except for the case with inestimable parameters.

To test whether some methods tend return misleading results, we determined the mean absolute error between the returned and the true confidence interval bounds when algorithms reported success. As this quantity can be dominated by outliers, we also determined the mean of all errors below $10$ and the frequency of errors beyond $10$ .

We measured the computational speed of the different methods by recording the number of function evaluations required until termination. This provides us with precise benchmark results independent of hardware and implementation details. To display a potential trade-off between robustness (success rate) and speed (number of function evaluations), we did not consider cases in which convergence was not reached. That way, internal stopping criteria did not affect the results.

The specific advantage of some optimization algorithms is in not requiring knowledge of the Hessian matrix. As computing the Hessian is necessary for RVM and may reduce the algorithm’s performance compared to other methods, we included the number of function evaluations required to determine the Hessian and the gradient in the recorded count of function evaluations. We computed gradients and Hessian matrices with a complex step method (Lai et al., 2005) implemented in the Python package numdifftools (Brodtkorb and D’Errico, 2019).

3.4 Results

To get an impression of how RVM acts in practice, we plotted the trajectory of RVM along with ancillary function evaluations in Figure 3. It is visible that the algorithm stays on the “ridge” of the likelihood surface even if the admissible region is strongly curved. This makes RVM efficient.

In fact, for all considered quality measures, RVM yielded good and often the best results compared to the alternative methods (see Figure 4). In all considered scenarios, RVM was the algorithm with the highest success rate, which never fell below $90\%$ (second best: binary search, $52\%$ ). In scenarios with small data sets, the success rate of RVM was up to $37$ percent points higher than any other method. At the same time, RVM was among the fastest algorithms. In scenarios with large data sets, RVM often converged within three iterations. Furthermore, RVM was quick in the $3$ parameter model, in which the Hessian matrix is easy to compute. In the scenario with transformed covariates and $11$ parameters, RVM required about three times as many likelihood evaluations as the fastest algorithm but had a more than $56\%$ higher success rate. The error in the results returned by RVM was consistently low compared to other methods. The proportion of large errors was always below $1\%$ , and the mean error excluding these outliers never exceeded $0.05$ .

The algorithms that require repeated evaluations of the profile likelihood function performed second best in terms of the success rate. Except for the GLM with $50$ data points, the binary search, the grid search, and the bisection method consistently had success rates above $70\%$ , whereby the success rate increased with the size of the considered data set. However, these algorithms also required more function evaluations than other methods. In fact, the grid search was more than $5$ times slower than any other algorithm. The binary search was slightly less efficient than the bisection method, which exploits the approximately quadratic shape of the profile likelihood function if many data are available. In scenarios with large data sets, the bisection method was among the most efficient algorithms. The errors of the three root finding methods decreased the more data became available to fit the models. However, while the binary search had a consistently low error, both the grid search and the bisection method were more prone to large errors than all other tested methods.

The algorithms developed from the constrained maximization perspective (the method by Neale and Miller and direct constrained maximization) had success rates ranging between $45\%$ and $85\%$ in problems with transformed covariates. In the GLM scenario, the success rate was smaller in with $50$ data points and higher with more data. The constrained maximization procedure was slightly more successful than the method by Neale and Miller (1997). Both methods required relatively few function evaluations, whereby direct constrained maximization performed better. Both methods were less prone to large errors than the grid search and the bisection method. However, the outlier-reduced error was on average more than twice as large than with any other method except RVM (Neale and Miller: $0.16$ , constrained maximum $0.09$ , RVM: $0.07$ ).

The success of the algorithm VM depended highly on the properties of the likelihood function. In scenarios with few data and transformed covariates, VM had very low success rates (as low as $10\%$ ). When more data were added, VM became as successful as the method by Neale and Miller and direct constrained maximization. Thereby, VM was highly efficient whenever results were obtained successfully. Similar to the success rate, the mean error of VM decreased strongly as more data were considered.

Wald’s method had very low success rates and large errors except for the GLM with large data sets. In the models with transformed covariates, Wald’s method never had a success rate above $17\%$ .

4 Discussion

We presented an algorithm that determines the end points of profile likelihood confidence intervals both of parameters and functions of parameters with high robustness and efficiency. We tested the algorithm in scenarios varying in parameter number, size of the data set, and complexity of the likelihood function. In the tests, our algorithm RVM was more robust than any other considered method. At the same time, RVM was among the fastest algorithms in most scenarios. This is remarkable, because there is typically a trade-off between robustness and computational speed of optimization algorithms. RVM achieves this result by exploiting the approximately quadratic form of the log-likelihood surface in “benign” cases while maintaining high robustness with the trust-region approach. Consequently, RVM naturally extends the algorithm VM (Venzon and Moolgavkar, 1988), which appeared to be highly efficient but lacking robustness in our tests.

Surprisingly, RVM turned out to be even more robust than methods based on repeated evaluations of the profile likelihood. For the bisection method and the binary search, this may be due to failures of internal optimization routines, as initial guesses far from the solution can hinder accurate convergence. The grid search method, in turn, was often aborted due to the limited step size, which precluded the method from identifying confidence bounds farther than $40$ units away from the respective MLE. This, however, does not explain the comparatively high error in the results of the grid search, as only successful runs were considered. We therefore hypothesize that internal optimization issues were responsible for some failures.

As expected, the algorithms that searched for the confidence interval end points directly were more efficient but less robust than algorithms that repeatedly evaluate the profile likelihood. Remarkably, a “standard” algorithm for constrained optimization performed slightly better than an unconstrained optimizer operating on the modified target function suggested by Neale and Miller (1997). This indicates that the approximation introduced by Neale and Miller (1997) might not be necessary and even of disadvantage.

All methods implemented in this study (except RVM and VM) rely on general optimizers. Consequently, the performance of these methods depends on the chosen optimizers both in terms of computational speed and robustness. Careful adjustment of optimization parameters might make some of the implemented algorithms more efficient and thus more competitive in benchmark tests. Though we attempted to reduce potential bias by applying a variety of different methods, an exhaustive test of optimization routines was beyond the scope of this study. Nonetheless, the consistently good performance of RVM throughout our benchmark tests suggests that RVM is a good choice in many applications.

Though RVM performed well in our tests, there are instances in which the algorithm is not applicable or sufficiently efficient. This are scenarios in which (1) the log-likelihood cannot be computed directly, (2) the Hessian matrix of the log-likelihood function is hard to compute, (3) the dimension of the parameter space is very large, or (4) there are multiple points in the parameter space in which problem (4) is solved locally. Below, we briefly discuss each of these limitations.

(1) In hierarchical models, the likelihood function may not be known. As RVM needs to evaluate the log-likelihood, its gradient, and its Hessian matrix, the algorithm is not applicable in these instances. Consequently, sampling based methods, such as parametric bootstrap (Efron, 1981), Monte Carlo methods (Buckland, 1984), or data cloning (Ponciano et al., 2009) may then be the only feasible method to determine confidence intervals.

(2) Especially in problems with a large parameter space, it is computationally expensive to compute the Hessian matrix with finite difference methods, as the number of function calls increases in quadratic order with the length of the parameter vector. Though alternative differentiation methods, such as analytical or automatic differentiation (Griewank, 1989), are often applicable, there may be some instances in which finite difference methods are the only feasible alternative. In these scenarios, optimization routines that do not require knowledge of the Hessian matrix may be faster than RVM. Note, however, that the higher computational speed may come with decreased robustness, and sampling based methods might be the only remaining option if application of RVM is infeasible.

(3) If the parameter space has a very high dimension (exceeding $1000$ ), internal routines, such as inversion of the Hessian matrix, may become the dominant factor determining the speed of RVM. Though it may be possible in future to make RVM more efficient, sampling based methods or algorithms that do not use the Hessian matrix may be better suited in these scenarios.

(4) RVM as well as all other methods implemented in this study are local optimization algorithms. Therefore, the algorithms may converge to wrong results if maximization problem (4) has multiple local solutions. This is in particular the case if the confidence set $\left\{\theta_{0}:\ell_{\mathrm{PL}}\mathopen{\left(\theta_{0}\right)}\mathclose{}\mathclose{}\geq\ell^{*}\right\}$ is not connected and thus no interval. RVM reduces the issue of local extreme points by choosing steps carefully and ensuring that the point of convergence is indeed a maximum. This contrasts with VM, which could converge to the wrong confidence interval end point (e.g. maximum instead of minimum) if the initial guesses are not chosen with care. Nonetheless, stochastic optimization routines, such as genetic algorithms (Akrami et al., 2010), and sampling methods may be better suited if a local search is insufficient.

Despite these caveats, RVM is applicable to a broad class of systems. Especially when inestimable parameters are present, commonly used methods such as VM or grid search techniques can break down or be highly inefficient. Furthermore, optimization failures are commonly observed if not enough data are available to reach the asymptotic properties of the MLE (Ren and Xia, 2019). RVM is a particularly valuable tool in these instances.

5 Conclusion

We developed and presented an algorithm to determine profile likelihood confidence intervals. In contrast to many earlier methods, our algorithm is robust in scenarios in which lack of data or a complicated likelihood function make it difficult to find the bounds of profile likelihood confidence intervals. In particular, our methods is applicable in instances in which parameters are not estimable and in cases in which the likelihood function has strong nonlinearities. At the same time, our method efficiently exploits the asymptotic properties of the maximum likelihood estimator if enough data are available.

We tested our method on benchmark problems with different difficulty. Throughout our simulations, our method was the most robust while also being among the fastest algorithms. We therefore believe that RVM can be helpful to researchers and modelers across disciplines.

References

Akrami et al. (2010) Akrami Y, Scott P, Edsjö J, Conrad J, Bergström L (2010) A profile likelihood analysis of the constrained MSSM with genetic algorithms. Journal of High Energy Physics 2010(4):57, DOI 10.1007/JHEP04(2010)057, URL http://link.springer.com/10.1007/JHEP04(2010)057
Brodtkorb and D’Errico (2019) Brodtkorb PA, D’Errico J (2019) numdifftools 0.9.39. URL https://github.com/pbrod/numdifftools, retrieved from https://github.com/pbrod/numdifftools
Buckland (1984) Buckland ST (1984) Monte Carlo confidence intervals. Biometrics 40(3):811, DOI 10.2307/2530926, URL https://www.jstor.org/stable/2530926?origin=crossref
Conn et al. (2000) Conn AR, Gould NIM, Toint PL (2000) Trust-region methods. MPS-SIAM series on optimization, Society for Industrial and Applied Mathematics, Philadelphia, PA
Cook and Weisberg (1990) Cook RD, Weisberg S (1990) Confidence curves in nonlinear regression. Journal of the American Statistical Association 85(410):544–551, DOI 10.1080/01621459.1990.10476233, URL http://www.tandfonline.com/doi/abs/10.1080/01621459.1990.10476233
Cox and Snell (1989) Cox DR, Snell EJ (1989) Analysis of binary data, 2nd edn. No. 32 in Monographs on Statistics and Applied Probability, Routledge, Boca Raton, FL, DOI 10.1201/9781315137391, URL https://www.taylorfrancis.com/books/9781315137391
DiCiccio and Tibshirani (1991) DiCiccio TJ, Tibshirani R (1991) On the implementation of profile likelihood methods. Tech. rep., University of Toronto, Department of Statistics, URL http://www.utstat.utoronto.ca/WSFiles/technicalreports/9107.pdf
Efron (1981) Efron B (1981) Nonparametric standard errors and confidence intervals. Canadian Journal of Statistics 9(2):139–158, DOI 10.2307/3314608, URL http://doi.wiley.com/10.2307/3314608
Eubank and Webster (1985) Eubank RL, Webster JT (1985) The singular-value decomposition as a tool for solving estimability problems. The American Statistician 39(1):64, DOI 10.2307/2683912, URL https://www.jstor.org/stable/2683912?origin=crossref
Gardner et al. (1995) Gardner W, Mulvey EP, Shaw EC (1995) Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychological Bulletin 118(3):392–404, DOI 10.1037/0033-2909.118.3.392, URL http://doi.apa.org/getdoi.cfm?doi=10.1037/0033-2909.118.3.392
Gimenez et al. (2005) Gimenez O, Choquet R, Lamor L, Scofield P, Fletcher D, Lebreton JD, Pradel R (2005) Efficient profile-likelihood confidence intervals for capture-recapture models. Journal of Agricultural, Biological, and Environmental Statistics 10(2):184–196, DOI 10.1198/108571105X46462, URL http://link.springer.com/10.1198/108571105X46462
Golub and Kahan (1965) Golub G, Kahan W (1965) Calculating the singular values and pseudo-inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics Series B Numerical Analysis 2(2):205–224, DOI 10.1137/0702016, URL http://epubs.siam.org/doi/10.1137/0702016
Griewank (1989) Griewank A (1989) On automatic differentiation. Mathematical Programming: recent developments and applications 6(6):83–107, URL https://www.researchgate.net/profile/Andreas_Griewank/publication/2703247_On_Automatic_Differentiation/links/0c96052529013aed9e000000/On-Automatic-Differentiation.pdf
Jones et al. (2001) Jones E, Oliphant T, Peterson P (2001) SciPy: open source scientific tools for Python. URL https://scipy.org/, retrieved from https://scipy.org/
Kraft (1988) Kraft D (1988) A software package for sequential quadratic programming. Tech. Rep. DFVLR-FB 88-28, DLR German Aerospace Center – Institute for Flight Mechanics, Köln, Germany
Lai et al. (2005) Lai KL, Crassidis J, Cheng Y, Kim J (2005) New complex-step derivative approximations with application to second-order kalman filtering. In: AIAA Guidance, Navigation, and Control Conference and Exhibit, American Institute of Aeronautics and Astronautics, San Francisco, California, DOI 10.2514/6.2005-5944, URL http://arc.aiaa.org/doi/10.2514/6.2005-5944
Lalee et al. (1998) Lalee M, Nocedal J, Plantenga T (1998) On the implementation of an algorithm for large-scale equality constrained optimization. SIAM Journal on Optimization 8(3):682–706, DOI 10.1137/S1052623493262993, URL http://epubs.siam.org/doi/10.1137/S1052623493262993
Moerbeek et al. (2004) Moerbeek M, Piersma AH, Slob W (2004) A comparison of three methods for calculating confidence intervals for the benchmark dose. Risk Analysis 24(1):31–40, DOI 10.1111/j.0272-4332.2004.00409.x, URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.0272-4332.2004.00409.x
Neale and Miller (1997) Neale MC, Miller MB (1997) The use of likelihood-based confidence intervals in genetic models. Behavior Genetics 27(2):113–120, DOI 10.1023/A:1025681223921, URL http://link.springer.com/10.1023/A:1025681223921
Nocedal and Wright (2006) Nocedal J, Wright SJ (2006) Numerical optimization, 2nd edn. Springer series in operations research, Springer, New York
Penrose (1955) Penrose R (1955) A generalized inverse for matrices. Mathematical Proceedings of the Cambridge Philosophical Society 51(3):406–413, DOI 10.1017/S0305004100030401, URL https://www.cambridge.org/core/product/identifier/S0305004100030401/type/journal_article
Ponciano et al. (2009) Ponciano JM, Taper ML, Dennis B, Lele SR (2009) Hierarchical models in ecology: confidence intervals, hypothesis testing, and model selection using data cloning. Ecology 90(2):356–362, DOI 10.1890/08-0967.1, URL http://doi.wiley.com/10.1890/08-0967.1
Rao (1967) Rao CR (1967) Calculus of generalized inverse of matrices. Part 1: general theory. Sankhya Ser A 29:317–342, URL https://www.jstor.org/stable/25049483
Raue et al. (2009) Raue A, Kreutz C, Maiwald T, Bachmann J, Schilling M, Klingmüller U, Timmer J (2009) Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics 25(15):1923–1929, DOI 10.1093/bioinformatics/btp358, URL https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btp358
Ren and Xia (2019) Ren X, Xia J (2019) An algorithm for computing profile likelihood based pointwise confidence intervals for nonlinear dose-response models. PLOS ONE 14(1):e0210953, DOI 10.1371/journal.pone.0210953, URL http://dx.plos.org/10.1371/journal.pone.0210953
Stryhn and Christensen (2003) Stryhn H, Christensen J (2003) Confidence intervals by the profile likelihood method, with applications in veterinary epidemiology. Vina del Mar, Chile, URL https://pdfs.semanticscholar.org/716e/5366865e409a430bce73d26c8770e6bbbeab.pdf
Venzon and Moolgavkar (1988) Venzon DJ, Moolgavkar SH (1988) A method for computing profile-likelihood-based confidence intervals. Applied Statistics 37(1):87, DOI 10.2307/2347496, URL http://www.jstor.org/stable/10.2307/2347496?origin=crossref
Viallefont et al. (1998) Viallefont A, Lebreton JD, Reboulet AM, Gory G (1998) Parameter identifiability and model selection in capture-recapture models: a numerical approach. Biometrical Journal: Journal of Mathematical Methods in Biosciences 40(3):313–325, DOI 10.1002/(SICI)1521-4036(199807)40:3¡313::AID-BIMJ313¿3.0.CO;2-2
Wu and Neale (2012) Wu H, Neale MC (2012) Adjusted confidence intervals for a bounded parameter. Behavior Genetics 42(6):886–898, DOI 10.1007/s10519-012-9560-z, URL http://link.springer.com/10.1007/s10519-012-9560-z

	A	B	C
Success rate
Mean error
Function evaluations
	$3$ parameters, transformed covariates	$11$ parameters, transformed covariates	$11$ parameters, GLM