Distributional regression models for Extended Generalized Pareto distributions

Noémie Le Carrer¹¹1Corresponding author: [email protected] Lab-STICC, IMT Atlantique Brest, France Carlo Gaetan Dipartimento di Scienze Ambientali, Informatica e Statistica
Università Ca’ Foscari Venezia, Venice, Italy

Abstract

The Extended Generalized Pareto Distribution (EGPD) (Naveau et al. 2016) is a family of distribution that has been introduced to model the full range of a positive random variable but with the lower and the upper tails distributed according the peaks-over-threshold methodology. The aim of this article is to augment the scope of application of EGPD allowing the analyst to incorporate the effect of covariates on the model. In particular we introduce a specification where the parameters of EGPD can be modeled as additive functions of the covariates, e.g. space or time. As a related product we provide an add-on code written in R that it is flexible enough to implement the EGPD in a generic way, allowing to introduce new parametric forms. We show the potential of our add-on on the modeling of hourly rainfalls over the North-West region of France and discuss modeling strategies.

Keywords: Extreme events, Statistical modeling, Statistical software, Rainfall simulation, Geostatistics

1 Introduction

The Peak Over Threshold (POT) method, described in details by Coles (2001, Chapter 4), has been used in many fields to identify extremal events such as loads, wave heights, floods, wind velocities, etc. This method provides a model for independent exceedances over a high threshold. According the Pickands-Balkema-de Haan theorem (Balkema and De Haan, 1974; Pickands III, 1975) for a threshold $u$ sufficiently high, the unnormalized conditional tail distribution of a random variable $Y$ can be approximated as

\Pr(Y-u\leq y|Y>u)\approx H_{\xi}(y/\psi_{u}),\qquad\text{for}\,y\geq 0\quad\text{and}\quad\psi_{u}>0

where

H_{\xi}(z)=\begin{cases}1-{(1+\xi y)}_{+}^{-1/\xi}&\text{for $\xi\neq 0$}\\ 1-\exp\{-y\}&\text{for $\xi=0$}\end{cases}

(1)

and $a_{+}=\max(a,0)$ .

The limiting distribution (1) is referred to as the Generalized Pareto distribution (GPD) and the parameters $\xi$ and $\psi_{u}$ are respectively the shape and the scale parameters. The scale parameter depend on the threshold $u$ , but in remainder of the paper the subscript $u$ is dropped in the notation.

Modeling using GPD has become very popular since the paper of Davison and Smith (1990) where estimation and model-checking procedures for univariate and regression data are developed. Such procedure have then been extended by allowing the GPD parameters to be represented as smooth functions of covariates (Chavez-Demoulin and Davison, 2005).

Since then several R packages have been developed to fit GPDs and contributed to the Comprehensive R Archive Network (CRAN) in the field of extreme value analysis (R Core Team, 2022). Among them, ismev (Heffernan et al., 2018), providing tools for the extreme value analyses presented in Coles (2001), evd (Stephenson, 2002), evir (Pfaff and McNeil, 2018), extRemes (Gilleland and Katz, 2016) and mev (Belzile et al., 2020), have offered various approaches for fitting univariate and multivariate GPDs. Gilleland et al. (2013) and, more recently, Belzile et al. (2022) review them while the Dutang and Jaunatre (2022)’s CRAN Task View provides an up-to-date list of the R packages in this field. In particular, some of them offer regression-based models for extreme distributions, where GPD parameters are allowed to vary with covariates. Beyond the linear forms introduced in the packages ismev and evd, a few R packages allow to fit GPDs with parameters following additive forms i.e. when the parameters or their transformations can be modelled as additive smooth functions of the covariates. In the sequel we term this form a Generalized Additive Model (GAM) (Hastie and Tibshirani, 1990; Wood, 2017). Thus VGAM (Yee and Stephenson, 2007), gamlss (Rigby and Stasinopoulos, 2005) and evgam (Youngman, 2020) provide this functionality for GPDs, as reviewed among other alternatives in Youngman (2020).

It is clear that GPD needs the determination of a threshold which is neither too high. The choice of an appropriate threshold is challenging because the stability plots and the mean residual life (Coles, 2001, Chapter 4) plot do not give an obvious value of a suitable threshold. The problem is tackled in a regression setting by Eastoe and Tawn (2009). However the question of the threshold choice and then simulation of the whole spectrum of the random variable considered remains problematic.

Forming mixtures of Exponential distributions (Wilks, 1998; Keller et al., 2015) or conditional Gamma distributions (Kenabatho et al., 2012; Chen et al., 2014) remains a competitive alternative e.g. for heavy-tailed precipitation modeling and simulation. Approaches of this type are implemented in the package GSM, that uses a Bayesian approach for estimating a mixture of Gamma distributions in which the mixing occurs over the shape parameter and the procedure only requires to specify a prior distribution for a single parameter. The package evmix fits an extreme value mixture model with normal behavior for the bulk distribution up to the threshold and conditional GPD above the threshold. Options for profile likelihood estimation for threshold and fixed threshold approach are provided. Finally, distributionsrd provides the double Pareto-lognormal distribution.

Although interesting, in practice these packages do not allow to fit nonlinear models as flexible as the GAM options allowed for the GPD alone. Besides, the mixture approaches tend to quickly inflate the number of parameters and the inference of the latter may be problematical operationally (Naveau et al., 2016).

Papastathopoulos and Tawn (2013) opened another route in the field of heavy-tail distribution modeling by replacing the uniform draws in the inverse cumulative distribution function (CDF) transform to generate simulations of the GPD, by draws from a richer distribution e.g. type Beta distribution. Using this approach, Naveau et al. (2016) presented a concise parametric formulation (later extended to a semi-parametric formulation (Tencaliec et al., 2020)) for simulating the whole range of a positive random variable (namely precipitations), that makes use of GPD to model not only the high quantiles but the smallest ones as well and bypass the issue of threshold selection. The main idea rooted from composing a cumulative distribution function with the GPD (see section 2 for more details). Very recently, a similar idea was taken up by Stein (2021a, b).

In the sequel, we refer to the extension of GPD in the sense of Naveau et al. (2016) as Extended Generalized Pareto distribution (EGPD). An implementation of EGPD with four parametric families, together with the fitting procedure, can be found in the R package mev.

The aim of this work is to extend the applicability of the EGPD in the presence of covariates that can modify the parameters of the distribution. The extension takes place in the same spirit as GAM. We also want to present an add-on to the gamlss package that allows to make inference on the parameter.

We have chosen the gamlss package because it provides univariate distributional regression models, where parameters of the assumed distribution for the response can be modeled as additive functions of the covariates. It is flexible enough to allow the encoding of novel families of distribution, as long as their density, cumulative, quantile and random generative function and, optionally, first and second (cross-) derivatives of the likelihood can be computed. We seize this potential to implement the parametric EGPD family in a generic way, allowing to introduce new parametric forms, up to 4 parameters so far (due to limitations of gamlss), satisfying the requirement of the EGPD family.

The remaining of this article will first introduce the EGPD and the distribution regression model (Section 2), followed by the key commands to use the EGPD within the gamlss package and our add-on in Section 3. Section 4 implements them to perform an analysis of hourly positive rainfalls on the North-west of France and discuss specific advantages and limitations of distributional regression modeling and its implementation in R, while Section 5 concludes the paper.

2 A distributional regression model

2.1 Extended-GPD

The EGPD (Naveau et al., 2016) accounts simultaneously and in a parsimonious way for both extreme and non extreme values of a random variable $Y$ . Its cumulative distribution function (CDF) reads:

F(y)=G\{H_{\xi}(y/\psi)\}\>,\>\>\text{for all}\quad y>0.

(2)

The function $G(\cdot)$ is a continuous CDF on the unit interval. It is constrained by several assumptions: 1) we need to ensure that the upper-tail behavior of $F$ follows a GPD; 2) the low values of $Y$ , seen as the upper tail of $-Y$ when $y\rightarrow 0$ , follows a GPD as well.

This leads to the following consequences for any candidate CDF $G(\cdot)$ :

a.

the upper-tail behavior of $1-F(y)$ is equivalent to the original GPD tail used to build $F(y)$ ;
b.

when $y\rightarrow 0$ , $\displaystyle F(y)\sim\frac{c}{\psi^{s}}y^{s}$ where $\displaystyle c=\lim_{y\rightarrow 0}\frac{G(y)}{y^{s}}$ is positive and finite for some real $s$ .

Four parametric families, $G(\cdot,\kappa)$ , $\kappa=(\kappa_{1},\ldots,\kappa_{k})^{T}$ , that satisfies the above-mentioned constraints are proposed in Naveau et al. (2016).

1.

$G(u;\kappa)=u^{\kappa_{1}}$ , $\kappa_{1}>0$
2.

$G(u;\kappa)=\kappa_{3}u^{\kappa_{1}}+(1-\kappa_{3})u^{\kappa_{2}}$ , $\kappa_{2}\geq\kappa_{1}>0$ and $\kappa_{3}\in[0,1]$
3.

$G(u;\kappa)=1-Q_{\kappa_{1}}\{(1-u)^{\kappa_{1}}\}$ , $\kappa_{1}>0$ where $Q_{\kappa_{1}}$ is the CDF of a Beta random variable with parameters $1/\kappa_{1}$ and $2$ , that is:

$Q_{\kappa_{1}}(u)=\frac{1+\kappa_{1}}{\kappa_{1}}u^{1/\kappa_{1}}\Big{(}1-\frac{u}{1+\kappa_{1}}\Big{)}$
4.

$G(u;\kappa)=[1-Q_{\kappa_{1}}\{(1-u)^{\kappa_{1}}\}]^{\kappa_{2}/2}$ , $\kappa_{1},\kappa_{2}>0$

In model 1, $\xi$ controls the rate of upper tail decay, $\kappa_{1}$ the shape of the lower tail and $\psi$ is a scale parameter. $\kappa_{1}=1$ allows us to recover the GPD model, while varying it gives flexibility in the description of the lower tail.

Model 2, as a mixture of power laws, allows to increase the flexibility provided by model 1. Therein, $\kappa_{1}$ controls the lower tail behavior while $\kappa_{2}$ modifies the shape of the density in its central part.

Model 3 is connected to the work of Falk et al. (2010). $\xi$ keeps controlling the upper extreme tail, $\kappa_{1}$ the central part of the distribution in a threshold tuning way. However, this model imposes a behavior type $\chi^{2}$ for the lower tail (null density at $0$ ) instead of allowing the data to constrain it.

Hence Model 4, which adds an extra parameter $\kappa_{2}$ to circumvent this limitation. In this last model, $\kappa_{2},\kappa_{1},\xi$ control respectively the lower, moderate and upper parts of the distribution. By construction, the lower and upper tails are GPD of shape parameters $\kappa_{2}$ and $\xi$ respectively.

In the sequel we denote the vector of parameters $(\xi,\psi,\kappa^{T})^{T}$ with $\theta=(\theta_{1},...,\theta_{d})^{T}$ and the density of a parametrized EGPD with $EGPD(y;\theta)$ . To simulate from Equation (2), the user randomly draws a uniform variable $U$ in $[0,1]$ and then applies the quantile function $Y=F^{-1}(U)$ , given for $p\in[0,1]$ by:

y_{p}=F^{-1}(p)=\begin{cases}\frac{\psi}{\xi}[\{1-G^{-1}(p)\}^{-\xi}-1]&\text{for $\xi\neq 0$}\\ -\psi\log{\{1-G^{-1}(p)\}}&\text{for $\xi=0$}\end{cases}

(3)

2.2 Additive model specification

In practice, the parameters of the distribution of $Y$ may depend on some covariates $\boldsymbol{x}=(x_{1},\ldots,x_{m})^{T}$ , i.e.

Y|\boldsymbol{x}\sim EGPD(\cdot,\theta(\boldsymbol{x}))

(4)

where $\theta(\boldsymbol{x})=(\theta_{1}(\boldsymbol{x}),\ldots,\theta_{d}(\boldsymbol{x}))^{T}$ . The previous specification is an instance of a distributional regression model (Stasinopoulos et al., 2018).

For relating the different distributional parameters $(\theta_{1}(\boldsymbol{x}),\ldots,\theta_{d}(\boldsymbol{x}))$ to the covariates, we rely on additive predictors of the form

\eta_{i}(\boldsymbol{x})=f_{i1}(\boldsymbol{x})+\cdots+f_{iJ_{i}}(\boldsymbol{x})

(5)

where $f_{i1}(\cdot),\ldots,f_{iJ_{i}}(\cdot)$ are smooth functions of the covariates $\boldsymbol{x}$ .

The representation (5) is particularly flexible and allows to capture complex dependence patterns between distribution parameters and covariates $\boldsymbol{x}$ . For instance a semiparametric model with two covariates $(x_{1},x_{2})$ can be written as

\eta_{i}(\boldsymbol{x})=\beta_{i1}+\beta_{i2}x_{1}+f_{i2}(x_{2})

Another example is when we want to model space-time data observed at site $(s_{1},s_{2})$ and at time $t$ . In this case $\boldsymbol{x}=(s_{1},s_{2},t)^{T}$

\eta_{i}(\boldsymbol{x})=f_{i1}(t)+f_{i2}(s_{1},s_{2})

In analogy to Generalized Linear Model the predictors are linked to the distributional parameters via known monotonic and twice differentiable link functions $h_{i}(\cdot)$ .

In the case of model 1, i.e. $G(u,\kappa)=u^{\kappa}_{1}$ , common link functions are

\xi(\boldsymbol{x})=\eta_{\xi}(\boldsymbol{x})\mbox{ (i.e. the identity function)},\quad\psi(\boldsymbol{x})=\exp(\eta_{\psi}(\boldsymbol{x})),\quad\kappa_{1}(\boldsymbol{x})=\exp(\eta_{\kappa_{1}}(\boldsymbol{x})).

\theta_{i}(\boldsymbol{x})=h_{i}(\eta_{i}(\boldsymbol{x})),\qquad i=1,\ldots,d

(6)

The functions $f_{ij}$ in (5) are approximated in terms of basis function expansions

f_{ij}(\boldsymbol{x})=\sum_{k=1}^{K_{ij}}\beta_{ij,k}B_{k}(\boldsymbol{x}),

(7)

where $B_{k}(\boldsymbol{x})$ are the basis functions and $\beta_{ij,k}$ denote the corresponding basis coefficients. These basis can be of different types, thoroughly presented e.g. in Stasinopoulos et al. (2017) or Wood (2017).

The basis function expansions can be written as

f_{ij}(\boldsymbol{x})=\boldsymbol{z}_{ij}(\boldsymbol{x})^{T}\boldsymbol{\beta}_{ij}

where $\boldsymbol{z}_{ij}(\boldsymbol{x})$ is still a vector of transformed covariates that depends on the basis functions. and $\boldsymbol{\beta}_{ij}=(\beta_{ij,1},\ldots,\beta_{ij,K_{ij}})^{T}$ is a parameter vector to be estimated.

To ensure regularization of the functions $f_{ij}(\boldsymbol{x})$ so-called penalty terms are added to the objective log-likelihood function. Usually, the penalty for each function $f_{ij}(\boldsymbol{x})$ are quadratic penalty $\lambda\boldsymbol{\beta}_{ij}^{T}\boldsymbol{G}_{ij}(\boldsymbol{\lambda}_{ij})\boldsymbol{\beta}_{ij}$ where $\boldsymbol{G}_{ij}(\boldsymbol{\lambda}_{ij})$ is a known semi-definite matrix and the vector $\boldsymbol{\lambda}_{ij}$ regulates the amount of smoothing needed for the fit. A special case is when $\boldsymbol{G}_{ij}(\boldsymbol{\lambda}_{ij})=\lambda_{ij}\boldsymbol{G}_{ij}$ . Therefore the type and properties of the smoothing functions are controlled by the vectors $\boldsymbol{z}_{ij}(\boldsymbol{x})$ and the matrices $\boldsymbol{G}_{ij}(\boldsymbol{\lambda}_{ij})$ .

The penalized log-likelihood function for the latter models reads:

l_{p}=l-\frac{1}{2}\sum_{i=1}^{d}\sum_{j=1}^{J_{i}}\boldsymbol{\beta}_{ij}^{T}\boldsymbol{G}_{ij}(\boldsymbol{\lambda}_{ij})\boldsymbol{\beta}_{ij}

(8)

where $l$ represents the log-likelihood function for the model (4).

3 Distributional regression model for EGPD within the R package gamlss

In this section we introduce an add-on script in R that implements the EGPD regression model described in the previous section. The script, available in github.com/noemielc/egpd4gamlss, is flexible enough as it only requires the specification of a parametric CDF $G(\cdot)$ .

The Generalized Additive Model for Location, Scale and Shape (GAMLSS) is an example of distributional regression model and the companion package in R gamlss (Stasinopoulos et al., 2017) allows a straightforward fitting of multiple built-in and well-known family distributions with parameters that vary flexibly with $\boldsymbol{x}$ as in (5)-(6).

An interesting feature is the possibility to extend the family of distributions described in Rigby et al. (2019). The only limitation is the dimension of the parameter vector $\theta$ , $d=4$ , where the four parameters are denoted with $\mu$ , $\sigma$ , $\nu$ and $\tau$ in gamlss.

Due to the restriction on the number of parameters, our code allows the implementation of parametric distributions for $G$ with only 2 parameters, namely $\kappa=(\kappa_{1},\kappa_{2})$ . In the following, the parameters $(\xi,\psi,\kappa_{1},\kappa_{2})$ of the EGPD distribution are identified with $(\mu,\sigma,\nu,\tau)$ .

The generic implementation of a EGPD is provided in the file GenericEGPDg.R. The MakeEGPD function assembles a new EGPD family distribution exploiting an user-built $G(\cdot)$ as a function of $u$ and (at most) two parameters $\nu$ and $\tau$ . The required first, second and cross derivatives of the density in (4), necessary for the fitting procedure in gamlss, are obtained by the symbolic derivation using the R package Deriv. Moreover the inverse function $G^{-1}(u)$ is computed numerically.

In the following, we exemplify the use of MakeEGPD by assembling Model 1, $G(u;\kappa)=u^{\kappa_{1}},\kappa_{1}>0$ .

⬇

1library(gamlss)

2source ("GenericEGPDg.R")

3EGPD1Family <- MakeEGPD (function (z,nu) z^nu, Gname = "Model1")

4EGPDModel1 <- EGPD1Family()

The last command is necessary and allows to create the object EGPDModel1 (name concatenating EGPD and Gname) from the EGPD1Family function.

For given values $\mu_{0},\sigma_{0}$ and $\nu_{0}$ , the R functions of density, cumulative distribution function, quantile function and random generation are ready to use.

⬇

1dEGPDModel1(x,mu=mu0,sigma=sigma0,nu=nu0)

2pEGPDModel1(x,mu=mu0,sigma=sigma0,nu=nu0)

3qEGPDModel1(u,mu=mu0,sigma=sigma0,nu=nu0)

4rEGPDModel1(n,mu=mu0,sigma=sigma0,nu=nu0)

The reserved names for the distribution parameters in gamlss are $\mu$ mu ( $\mu$ ), sigma ( $\sigma$ ), nu ( $\nu$ ), and tau ( $\tau$ ). To change the link functions $\eta_{i}=g_{i}(\theta_{i})$ by means of which each EGPD parameter $\theta_{i}$ (here $\mu,\sigma$ or $\nu$ ) is indirectly optimised, or the default starting values for the parameters to fit, or to print these links, we use the commands:

⬇

1EGPD1Family(mu.link="identity",mu.init=1, # log , inverse, own , ...

2 sigma.link="log",sigma.init=3, nu.link="log",nu.init=0.5)

3show.link(EGPD1Family)

It is possible to create an original link function ”own”, by gathering the expressions of the link function $\eta_{i}$ , its inverse, the derivative of the latter w.r.t. parameter $\theta_{i}$ and its area of validity. For instance to create a log-link function shifted to the right for $\mu$ :

⬇

1# the link function defining the predictor eta as a function of the current distribution parameter (referred to as mu), i.e. eta = g(mu)

2own.linkfun <- function (mu) {

3 .shift = 0.0001

4 log (mu - .shift)

6# the inverse of the link function as a function of the predictor eta, i.e.mu = g-1 (eta)

7own.linkinv <- function (eta) {

8 shift = 0.0001

9 thresh <- - log (.Machine$double.eps)

10 eta <- pmin (thresh, pmax(eta, -thresh))

11 exp (eta) + shift

12}

13# the first derivative of the inverse link with respect to eta

14own.mu.eta <- function (eta) {

15 shift = 0.0001

16 pmax (exp (eta), .Machine$double.eps)

17}

18# the range in which values of eta are defined.

19own.valideta <- function (eta) TRUE

20EGPD1Family(mu.link="own")

For more details on link function definition, we refer to Stasinopoulos et al. (2017, pag. 179).

4 Application to a hourly rainfall dataset in France

4.1 Data

We illustrate our add-on to package gamlss with the analysis of the hourly precipitations recorded over the north-west area of France from 2016 to 2018 both included. This database, containing precipitations as well as other meteorological variables whose values are registered every 6 minutes over the 3 years of interest for each of the 274 stations of the network (see Figure 1), is freely accessible (see Larvor et al., 2020, fro more details).

Let us note $Y(\boldsymbol{x})$ the random variable for strictly positive hourly precipitations (dry and wet events are traditionally treated separately when it comes to rainfall simulation (Ailliot et al., 2015)), with covariates $\boldsymbol{x}=(x_{l},x_{L},x_{t})$ , where $x_{l}$ and $x_{L}$ represents the position in space (longitude, latitude) and $x_{t}$ a cyclic time index, namely the day or the month of the year. The work of Naveau et al. (2016) and our tests show that EGPD fitting is improved by censoring the data, so in practice we remove values $Y(\boldsymbol{x})<0.5$ (mm). Other thresholds ( $\leq 0.2$ and $\leq 0.1$ mm) have been tested but $Y(\boldsymbol{x})<0.5$ (mm) gives the best results, in particular for the upper tails. Besides, as noted in Evin et al. (2018); Naveau et al. (2016), negative shape parameters $\xi$ are not expected for rainfall distributions at these time scales, so we use a link function constraining the parameter $\xi$ to positive values. Finally, as performed in Naveau et al. (2016), we only keep one every three hourly records of precipitation so as to ensure independency between samples.

4.2 Model fitting

We fit different specifications of EGPD-based models, where EGPD takes form 1, 3 or 4, and compare them with equivalently specified Gamma-based models. The Gamma (GA) distribution is a classical choice that works well for modeling the bulk of precipitations at a given site (Katz, 1977; Vrac and Naveau, 2007; Vlček and Huth, 2009). However non-negligible deviations are observed when one is interested in capturing extreme rainfalls (Katz et al., 2002), typically underestimated due to the lightness of the GA tail. For a marginal model M=EGPD1,EGPD3,EGPD4,GA, the different specifications read:

M.0 where

\eta_{a}(\boldsymbol{x})=\beta_{0},\qquad a=\mu,\sigma,\nu,\tau,

i.e. the parameters $\theta(\boldsymbol{x})$ do not depend on the covariates ;

M.t where

\eta_{a}(\boldsymbol{x})=\beta_{0}+s_{t}(x_{t}),\qquad a=\mu,\sigma,\nu,\tau,

uses cyclic cubic splines $s_{t}$ to model the nonlinear dependency in time while considering no variation in space ;

M.tnomu where

\eta_{a}(\boldsymbol{x})=\beta_{0}+s_{t}(x_{t}),\qquad a=\sigma,\nu,\tau,

does the same as before but considers as constant the parameter $\mu$ ;

M.st where

\eta_{a}(\boldsymbol{x})=\beta_{0}+TP(x_{l},x_{L})+s_{t}(x_{t}),\qquad a=\mu,\sigma,\nu,\tau,

uses smooth surface fitting (thin-plate splines) to model the dependency in space as well as cyclic cubic splines for the dependency in time ;

M.st2mu where

\eta_{a}(\boldsymbol{x})=\beta_{0}+TP(x_{l},x_{L})+s_{t}(x_{t}),\qquad a=\mu,\sigma,

does the same as before but considers constant in space the parameters that are not $\mu$ or $\sigma$ ;

M.st2nomu where

\eta_{a}(\boldsymbol{x})=\beta_{0}+TP(x_{l},x_{L})+s_{t}(x_{t}),\qquad a=\sigma,\nu,

does the same as before but considers constant in space the parameters $\mu$ and $\tau$ ;

M.st3nomu where

\eta_{a}(\boldsymbol{x})=\beta_{0}+TP(x_{l},x_{L})+s_{t}(x_{t}),\qquad a=\sigma,\nu,\tau,

does the same as before but applies to M=EGPD4 only and considers parameter $\mu$ as a constant over space.

These variations within each model class will allow us to assess whether the smooth regression of distribution parameters over covariates is justified or not. Below we reproduce the code to fit megpd1.0, megpd1.tnomu and mga.st on the training set (training), that contains 60 % of the stations, as reported on Figure 1. The dataset consisting of the remaining stations form the validation set (validation). The gamlss() function reads:

⬇

1# Notations: lat, lon and cyc represent, respectively, covariates xl, xL and xt, while y is given by precip

2# Set algorithm control parameters:

3con <- gamlss.control (n.cyc = 200, mu.step = 0.01, sigma.step = 0.01, nu.step = 0.01,tau.step = 0.01, autostep=TRUE)

4megpd1.0 <- gamlss(precip ~ 1,

5 data = training,

6 family = EGPD1Family(mu.link = "own"),

7 control = con,

8 method=CG())

9megpd1.tnomu <- gamlss(precip ~ 1,

10 sigma.fo =~ pbc(cyc),

11 nu.fo =~ pbc(cyc),

12 data = training,

13 family = EGPD1Family(mu.link = "own"),

14 control = con,

15 method=CG())

17# Load library to interface with the mgcv package and get ga() for multidimensional tensor spline smoothers:

18library(gamlss.add)

19fmla_ga <- list(~s(lon,lat,bs=’tp’,k=30)+s(cyc,bs="cc",k = 50),

20 ~s(lon,lat,bs=’tp’,k=30)+s(cyc,bs="cc",k = 50))

21GA.st <- gamlss(precip ~ga(~s(lon,lat,bs=’tp’,k=30)+s(cyc,bs="cc",k = 50), method="REML"),

22 ~ga(~s(lon,lat,bs=’tp’,k=30)+s(cyc,bs="cc",k = 50),

23 method="REML"),

24 ~ga(~s(lon,lat,bs=’tp’,k=30)+s(cyc,bs="cc",k = 50),

25 method="REML"),

26 data = training,

27 family = GA,

28 control = con,

29 method=CG())

4.3 Model assessment

In order to evaluate the goodness of fit on the training set, we use summary statistics: Global Deviance (GD), Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). The GAIC(modelName,k=a) function allows to compute GD (minus twice the fitted log-likelihood, $a=0$ ), the AIC ( $a=2$ ) and the BIC ( $a=\log(\mbox{number of observations})$ ) for each fitted model modelName and to compare them as performed on Figure 2. Besides, Table 1 indicates the effective degree of freedom (df) computed for each model. The function call is reproduced below for the models of class EGPD1.

⬇

1GAIC_BIC_m1 <-GAIC(megpd1.0,megpd1.t,megpd1.tnomu,megpd1.st,megpd1.st2mu,megpd1.st2nomu,k=log(length(training$precip))

Refer to caption — Figure 1: French stations from the MeteoNet public dataset used for our experiments, split into training and validation sets.

Model	df
megpd4.st	242.0
megpd4.st3nomu	199.3
megpd4.st2nomu	138.3
megpd4.st2mu	122.5
megpd4.t	72.0
megpd4.tnomu	38.0
megpd4.0	4
megpd1.st	183.0
megpd1.st2nomu	139.3
megpd1.st2mu	121.3
megpd1.t	54.0
megpd1.tnomu	37.0
megpd1.0	3
megpd3.st	181.6
megpd3.st2nomu	137.3
megpd3.st2mu	122.9
megpd3.t	54.0
megpd3.tnomu	37.0
megpd3.0	3
mga.st	120.2
mga.stnomu	78.0
mga.t	36.0
mga.0	2

Table 1: Effective degree of freedom (df) for the fitted models.

From Figure 2, we can see that, as could be expected, for each classes of models (where EGPD1-based models represent one class, EGPD3-based ones another, etc) the GD ranks first models with the full space-time description M.st. All other variations are rather close in fitting performance, at the exception of the constant-parameters one M.0, that is largely behind. This shows the clear interest of allowing a smooth variation of distribution parameters as a function of the chosen covariates. However, when it comes to modeling efficiency, where the trade-off between number of parameters and performances is taken into account (through the AIC and BIC), the full space-time model M.st becomes much less desirable, falling just above and occasionally behind the constant parameter models M.0. This is all the more through that we consider the BIC indice, that penalizes more the number of parameters than the AIC, and a model with a large number of parameters (EGPD4 versus GA). The best trade-off, according to the BIC, are the time-only full model M.t for the GA class and the time-only without the shape parameter $\mu$ otherwise. This may also be an indication that the seasonal variation of distribution parameters is more important than their spatial variation over the area of interest.

The ranking of model variations within a same model class allows us to draw conclusions w.r.t. the advantages or challenges of each class. Thus, for EGPD4 the GD ranks model variations following decreasing complexity: megpd4.st followed by megpd4.st3nomu, etc. We can deduce that the estimation of $\mu$ is not that sensitive when it comes to the fitting set (at least with enough data). On the contrary, megpd1.st2nomu is preferred to megpd1.st and megpd1.st2mu which can be interpreted as $\mu$ being a rather sensitive parameter to estimate for EGPD1. Considering it constant in space consequently improves the global fitting.

Performances on the validation set are obtained by means of the getTGD() and TGD() functions, providing the validation global deviance (GD-v), that is the GD evaluated using predictive values for the parameters at the validation sample.

⬇

1 gg.megpd4.st <- getTGD(megpd4.st,newdata=validation)

2 gg.megpd4.0 <- getTGD(megpd4.0,newdata=validation)

3 ...

4 TGD(gg.megpd4.0,gg.megpd4.t,gg.megpd4.tnomu,gg.megpd4.st,gg.megpd4.st2mu,gg.megpd4.st2nomu,gg.megpd4.st3nomu)

Thus, when it comes to extrapolating parameters for new covariate values, the GD-v ranking displayed in Figure 2 reveals that models considering only the dependence in time (M.t, M.tnomu) are systematically preferred, followed by space-time models not considering $\mu$ (M.st2nomu), before the full space-time models (M.st). This supports the fact that the rainfall distributions are relatively close over the studied geographical area, compared to their time evolution that shows stronger variations. Consequently, given the limited size of the dataset, keeping low the number of parameters to regress on covariates (by considering the seasonality only) and in particular avoiding in this category the shape parameter $\mu$ , makes estimations more robust. This agrees with observations from Evin et al. (2018), who noted that fitting the shape parameter is the most challenging part of (extended)-GPD modeling. They concluded that mono-site fitting is not robust, leading e.g. to negative shape parameters that do not match observations on rainfall intensities. To circumvent this issue, the authors use constant parameters within regions of influence defined by homogeneity tests.

4.3.1 Model verification and comparison

Now that it is admitted that regressing distributional parameters over time and possibly space is justified for the case study at hand, we want to assess how good is the modeling and compare model classes at hand. A way to verify whether the fitted models actually represents well the empirical distribution is to draw so-called P-P plots. A simple result in probability states if $Y(\mathbf{x})\sim F(\cdot;\theta(\mathbf{x})$ then $U(\mathbf{x}):=F(Y(\mathbf{x});\theta(\mathbf{x}))$ is uniformly distributed on the unit interval.

Therefore we can consider the residuals $u(\mathbf{x})=F(y(\mathbf{x}),\widehat{\theta}(\mathbf{x}))$ where $\widehat{\theta}(\mathbf{x})$ are the parameters of the fitted model and, after having ordered them, compare them with $\Big{(}\frac{1}{n+1},...,\frac{n}{n+1}\Big{)}$ . Here $n$ is the number of observations $y(\mathbf{x})$ at hand. The closer points are to the diagonal, the better the model fit. Results are broken down at each season for model variations M.st2nomu on the verification set in Figure 3. They show that at all seasons, EGPD4 outperforms the other models. This is all the more true that we consider not too small rainfalls, namely hourly amounts $x\geq 1$ mm. Below, we keep wiggly artefacts of the discrete sampling scale of observations. Indeed, rainfall amounts are sampled with a precision of $0.2$ mm, which makes the distribution of small values more discrete than continuous. Behind, come models GA, EGPD3 and finally EGPD1. EGPD3 is often better than GA for small hourly rain amounts (below around $2$ mm) however it may be outperformed for larger ones, especially in winter. EGPD1 performs as EGPD3 for intermediate and large amounts of rain, but is significantly behind all other models for small amounts.

4.3.2 Visualisation of the fitted functions

Figure 4 represents the fitted functions for the variations megpd4.0 (constant parameters), megpd4.t (smooth time dependence only), and megpd4.st (full smooth space time dependence) of the EGPD4 model in two a priori rather different stations when it comes to the rainfall regime: the north-west coast of Brittany ( $x_{L}=48.7$ $x_{l}=-4.33$ ) and the south-east countryside of our area of interest ( $x_{L}=46.7$ $x_{l}=1.25$ ).

The functions predict() or predictAll() are used to provide fitted values of the functions of the covariates. Standard errors can be extracted for fitted values by means of the argument se.fit=TRUE. However this option is not supported by gamlss for parameter estimates on new covariate values at the moment of writing this paper.

⬇

1# Fitted parameter values:

2muFit <- predict(megpd4.st,what="mu",type="response")

4# Estimates for new covariate values, gathered in the database "validation":

5all <- predictAll(megpd4.st,data=training,newdata=validation)

6muVal <- all$mu

7sigmaVal <- all$sigma

8nuVal <- all$nu

10muFit <- predict(megpd4.st,what="mu",type="response",data=training,se.fit=TRUE)

11StandardErrorMu <- muFit$se.fit

We observe rather similar temporal behaviors of the parameters estimated in both stations, highlighting the relative spatial homogeneity of distributions compared to their seasonal dependence. The standard errors decrease with the simplicity of the model ; they are generally negligible for for parameters $\sigma$ and $\tau$ but can become significant for parameters $\nu$ and $\mu$ when we are dealing with space-time models. Beyond that, the estimations for the three different versions of the EGPD4-based model are consistent. We note (second station) that a shift between megpd4.t and megpd4.st’s estimates of parameter $\sigma$ is compensated by an opposite shift between the associated estimates of parameter $\tau$ . From our observations and experience on EGPD fitting, a given EGPD-based distribution is matched by several local optima (i.e. parameter combinations). Along this ”Pareto front”, $\nu$ is increasing for EGPD1 when $\sigma$ decreases. Similarly $\tau$ is increasing for EGPD4 when $\sigma$ decreases.

Finally, Figure 5 represents the spatial distribution of the median of the EGPD4 parameters estimated from megpd4.st at three different times of the year: december, april and august. We indeed observe spatial variations of the parameter values, in addition to the monthly changes already discussed. $\mu$ the most spatially homogeneous parameter, which goes in line with previous observations that favored spatially constant $\mu$ -models over non-constant ones. We also note that the $\tau$ map follows an opposite trend w.r.t. $\sigma$ ’s map: the larger $\tau$ , the smaller $\sigma$ , which follows the general idea of parameter Pareto front developed earlier.

4.3.3 Goodness-of-fit w.r.t. the upper tail

We finish the presentation of the results by comparing in Figure 6 the EGPD4 model to a GA fit, for all stations of the validation set and for each season. We want to check the added value of the EGPD4 model over the GA when it comes to extremes, as well as to assess the best modeling option for the EGPD4 shape parameter in order to ensure a correct representation of extremes. The series of P-P plots presented on Figure 6 focus on the upper tail of the distribution (above the quantiles 0.99 and 0.995 respectively) and allow us to do that. We first note that whatever the season, EGPD4 performs significantly better than GA when it comes to extremes. Most often, the constant parameter model megpd4.0 is enough to outperform the GA, and sometimes is clearly the best option as in spring and autumn. In winter and summer, it is not enough to capture correctly the upper tail of the empirical distribution. Making EGPD4 parameters dependent on time only is sufficient to obtain correct upper tail representations, whether $\mu$ is considered constant (megpd4.tnomu) or not (megpd4.t). The difference between the last two models is very small however using the constant shape parameter option can slightly improve results, especially for larger quantiles (0.995 and above versus 0.99). Adding spatial dependence to the temporal one improves even more the accuracy of the EGPD4 upper tail modeling. Making spatially dependent only two parameters usually performs better. However $\mu$ needs to be included in these two parameters (megpd4.st2 versus megpd4.st2nomu). This highlights the fact that a smooth space-time estimation of the shape parameter $\mu$ is important to capture correctly the non-stationary behavior of the upper tail of the rainfall distributions. However, due to limited data and numerical sensitivity of that estimation, it may be preferable to only limit the smooth spatial dependence to two parameters instead of four, including $\mu$ , in order to make it more robust for extrapolation.

5 Conclusions

This article introduced an add-on to the R-package gamlss, to model non-stationary fields whose margins are EGPD distributed. A number of R packages target the modeling of heavy-tailed distributions. Yet, although some allow distributional regression models, for which distribution parameters can be modeled as additive functions of covariates, none of these packages address the extended-GPD distribution nor another form of unique and closed-form formulation for heavy-tailed distributions that models small, intermediate and large values in a synthetic manner. This add-on implements the EGPD family in a generic way in the gamlss package, allowing to test any new parametric form, up to 4 parameters so far, satisfying the requirement of the EGPD family. Applications include space-time modeling of rainfall marginal distributions. The example developed in this paper shows 1) that modeling hourly rainfall amounts is significantly improved when distribution parameters are non-stationary in space and time; and that 2) the EGPD class clearly improves the modeling of intermediate values and upper tail compared to a classical Gamma option. This illustrates the benefits of our add-on with respect to existing GAM forms for modeling non-stationary margins (typically the Gamma model for our application). We compare three parametric EGPD models and discuss the modeling strategies to best capture non stationary upper tails with limited datasets. This case study also reveals the added-value of EGPD4 (and to a less extent EGPD3) over EGPD1 to correctly model hourly rainfall distributions in a space-time context, while previous works Naveau et al. (2016); Evin et al. (2018); Bertolacci et al. (2018) restricted themselves to EGPD1 as the best option.

Software and data availability

The add-on to the R-package gamlss (Stasinopoulos et al., 2022) presented here for the generic implementation of EGPD consists in the script GenericEGPDg.R, available at https://github.com/noemielc/egpd4gamlss. On this same webpage, supplementary material consisting in a reproducible tutorial with code is presented.

The data used in this paper is freely accessible on the MeteoNet French database (Larvor et al., 2020): https://meteofrance.github.io/meteonet/english/data/summary/

Acknowledgement

Scientific activity was performed as part of the research program Venezia2021, coordinated by CORILA, with the contribution of the Provveditorato for the Public Works of Veneto, Trentino Alto Adige and Friuli Venezia Giulia (Italy). The authors thank Nicolas Berthier for helpful R suggestions.

References

Ailliot et al. (2015) Ailliot, P., Allard, D., Monbet, V., Naveau, P., 2015. Stochastic weather generators: an overview of weather type models. Journal de la Société Française de Statistique 156, 101–113.
Balkema and De Haan (1974) Balkema, A.A., De Haan, L., 1974. Residual life time at great age. The Annals of Probability 2, 792–804.
Belzile et al. (2020) Belzile, L., Wadsworth, J.L., Northrop, P.J., Grimshaw, S.D., Huser, R., 2020. mev: Multivariate Extreme Value Distributions. URL: https://CRAN.R-project.org/package=mev. r package version 1.13.1.
Belzile et al. (2022) Belzile, L.R., Dutang, C., Northrop, P.J., Opitz, T., 2022. A modeler’s guide to extreme value software. URL: https://arxiv.org/abs/2205.07714.
Bertolacci et al. (2018) Bertolacci, M., Cripps, E., Rosen, O., Cripps, S., 2018. A comparison of methods for modeling marginal non-zero daily rainfall across the Australian continent. arXiv preprint arXiv:1804.08807 .
Chavez-Demoulin and Davison (2005) Chavez-Demoulin, V., Davison, A.C., 2005. Generalized additive modelling of sample extremes. Journal of the Royal Statistical Society: Series C (Applied Statistics) 54, 207–222.
Chen et al. (2014) Chen, J., Brissette, F.P., Zhang, X.J., 2014. A multi-site stochastic weather generator for daily precipitation and temperature. Transactions of the ASABE 57, 1375–1391.
Coles (2001) Coles, S., 2001. An Introduction to Statistical Modeling of Extreme Values. Springer, New York.
Davison and Smith (1990) Davison, A.C., Smith, R.L., 1990. Models for exceedances over high thresholds. Journal of the Royal Statistical Society: Series B (Methodological) 52, 393–425.
Dutang and Jaunatre (2022) Dutang, C., Jaunatre, K., 2022. CRAN task view: extreme value analysis. Technical Report. Version 2022-02-08, URL https://CRAN. R-project. org/view= ExtremeValue.
Eastoe and Tawn (2009) Eastoe, E., Tawn, J., 2009. Modelling non-stationary extremes with application to surface-level ozone. Journal of the Royal Statistical Society: Series C (Applied Statistics) 58, 25–45.
Evin et al. (2018) Evin, G., Favre, A.C., Hingray, B., 2018. Stochastic generation of multi-site daily precipitation focusing on extreme events. Hydrology and Earth System Sciences 22, 655–672.
Falk et al. (2010) Falk, M., Hüsler, J., Reiss, R.D., 2010. Laws of Small Numbers: Extremes and Rare Events. Springer, Basel.
Gilleland and Katz (2016) Gilleland, E., Katz, R.W., 2016. extRemes 2.0: an extreme value analysis package in R. Journal of Statistical Software 72, 1–39.
Gilleland et al. (2013) Gilleland, E., Ribatet, M., Stephenson, A.G., 2013. A software review for extreme value analysis. Extremes 16, 103–119.
Hastie and Tibshirani (1990) Hastie, T.J., Tibshirani, R.J., 1990. Generalized Additive Models. Chapman & Hall/CRC.
Heffernan et al. (2018) Heffernan, J.E., Stephenson, A.G., Gilleland, E., 2018. ismev: An introduction to statistical modeling of extreme values. URL: https://CRAN.R-project.org/package=ismev. r package version 1.42.
Katz (1977) Katz, R.W., 1977. Precipitation as a chain-dependent process. Journal of Applied Meteorology 16, 671–676.
Katz et al. (2002) Katz, R.W., Parlange, M.B., Naveau, P., 2002. Statistics of extremes in hydrology. Advances in Water Resources 25, 1287–1304.
Keller et al. (2015) Keller, D.E., Fischer, A.M., Frei, C., Liniger, M.A., Appenzeller, C., Knutti, R., 2015. Implementation and validation of a wilks-type multi-site daily precipitation generator over a typical alpine river catchment. Hydrology and Earth System Sciences 19, 2163–2177.
Kenabatho et al. (2012) Kenabatho, P.K., McIntyre, N., Chandler, R., Wheater, H., 2012. Stochastic simulation of rainfall in the semi-arid Limpopo basin, Botswana. International Journal of Climatology 32, 1113–1127.
Larvor et al. (2020) Larvor, G., Berthomier, L., Chabot, V., Le Pape, B., Pradel, B., Perez, L., 2020. MeteoNet, an open reference weather dataset by METEO FRANCE. https://meteofrance.github.io/meteonet/english/data/summary/. Accessed: 2022-02-16.
Naveau et al. (2016) Naveau, P., Huser, R., Ribereau, P., Hannart, A., 2016. Modeling jointly low, moderate, and heavy rainfall intensities without a threshold selection. Water Resources Research 52, 2753–2769.
Papastathopoulos and Tawn (2013) Papastathopoulos, I., Tawn, J.A., 2013. Extended generalised Pareto models for tail estimation. Journal of Statistical Planning and Inference 143, 131–143.
Pfaff and McNeil (2018) Pfaff, B., McNeil, A., 2018. evir: Extreme Values in R. URL: https://CRAN.R-project.org/package=evir. r package version 1.7-4.
Pickands III (1975) Pickands III, J., 1975. Statistical inference using extreme order statistics. The Annals of Statistics , 119–131.
R Core Team (2022) R Core Team, 2022. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. URL: https://www.R-project.org/.
Rigby and Stasinopoulos (2005) Rigby, R.A., Stasinopoulos, D.M., 2005. Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics) 54, 507–554.
Rigby et al. (2019) Rigby, R.A., Stasinopoulos, M.D., Heller, G.Z., De Bastiani, F., 2019. Distributions for modeling location, scale, and shape: Using GAMLSS in R. CRC press.
Stasinopoulos et al. (2022) Stasinopoulos, M., Rigby, B., Voudouris, V., Akantziliotou, C., Enea, M., Kiose, D., 2022. gamlss: Generalised additive models for location scale and shape. https://cran.r-project.org/web/packages/gamlss/index.html. Accessed: 2022-07-12.
Stasinopoulos et al. (2018) Stasinopoulos, M.D., Rigby, R.A., Bastiani, F.D., 2018. GAMLSS: A distributional regression approach. Statistical Modelling 18, 248–273.
Stasinopoulos et al. (2017) Stasinopoulos, M.D., Rigby, R.A., Heller, G.Z., Voudouris, V., De Bastiani, F., 2017. Flexible regression and smoothing: using GAMLSS in R. CRC Press.
Stein (2021a) Stein, M.L., 2021a. A parametric model for distributions with flexible behavior in both tails. Environmetrics 32, e2658.
Stein (2021b) Stein, M.L., 2021b. Parametric models for distributions when interest is in extremes with an application to daily temperature. Extremes 24, 293–323.
Stephenson (2002) Stephenson, A.G., 2002. evd: Extreme value distributions. R news 2, 31–32.
Tencaliec et al. (2020) Tencaliec, P., Favre, A.C., Naveau, P., Prieur, C., Nicolet, G., 2020. Flexible semiparametric generalized pareto modeling of the entire range of rainfall amount. Environmetrics 31, e2582.
Vlček and Huth (2009) Vlček, O., Huth, R., 2009. Is daily precipitation Gamma-distributed ?: Adverse effects of an incorrect use of the Kolmogorov–Smirnov test. Atmospheric Research 93, 759–766.
Vrac and Naveau (2007) Vrac, M., Naveau, P., 2007. Stochastic downscaling of precipitation: From dry events to heavy rainfalls. Water Resources Research 43.
Wilks (1998) Wilks, D., 1998. Multisite generalization of a daily stochastic precipitation generation model. journal of Hydrology 210, 178–191.
Wood (2017) Wood, S.N., 2017. Generalized Additive Models: An Introduction with R. CRC press.
Yee and Stephenson (2007) Yee, T.W., Stephenson, A.G., 2007. Vector generalized linear and additive extreme value models. Extremes 10, 1–19.
Youngman (2020) Youngman, B.D., 2020. evgam: An R package for Generalized Additive extreme value models. URL: https://arxiv.org/abs/2003.04067.