This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

RankSEG: A Consistent Ranking-based Framework for Segmentation

\nameBen Dai \email[email protected]
\addrDepartment of Statistics
The Chinese University of Hong Kong
Hong Kong SAR \AND\nameChunlin Li \email[email protected]
\addrSchool of Statistics
University of Minnesota
MN 55455 USA
Abstract

Segmentation has emerged as a fundamental field of computer vision and natural language processing, which assigns a label to every pixel/feature to extract regions of interest from an image/text. To evaluate the performance of segmentation, the Dice and IoU metrics are used to measure the degree of overlap between the ground truth and the predicted segmentation. In this paper, we establish a theoretical foundation of segmentation with respect to the Dice/IoU metrics, including the Bayes rule and Dice-/IoU-calibration, analogous to classification-calibration or Fisher consistency in classification. We prove that the existing thresholding-based framework with most operating losses are not consistent with respect to the Dice/IoU metrics, and thus may lead to a suboptimal solution. To address this pitfall, we propose a novel consistent ranking-based framework, namely RankDice/RankIoU, inspired by plug-in rules of the Bayes segmentation rule. Three numerical algorithms with GPU parallel execution are developed to implement the proposed framework in large-scale and high-dimensional segmentation. We study statistical properties of the proposed framework. We show it is Dice-/IoU-calibrated, and its excess risk bounds and the rate of convergence are also provided. The numerical effectiveness of RankDice/mRankDice is demonstrated in various simulated examples and Fine-annotated CityScapes, Pascal VOC and Kvasir-SEG datasets with state-of-the-art deep learning architectures. Python module and source code are available on Github at https://github.com/statmlben/rankseg.

Keywords: Segmentation, Bayes rule, ranking, Dice-calibrated, excess risk bounds, Poisson-binomial distribution, normal approximation, GPU computing

1 Introduction

Segmentation is one of the key tasks in the field of computer vision and natural language processing, which groups together similar pixels/features of an input that belong to the same class (Ronneberger et al., 2015; Badrinarayanan et al., 2017). It has become an essential part of image and text understanding with applications in autonomous vehicles (Assidiq et al., 2008), medical image diagnostics (Wang et al., 2018), face/fingerprint recognition (Xin et al., 2018), and video action localization (Shou et al., 2017).

The primary aim of segmentation is to label each foreground feature/pixel of an input with a corresponding class. Specifically, for a feature vector or an image 𝐗d\mathbf{\bm{X}}\in\mathbb{R}^{d}, a segmentation function 𝜹:d{0,1}d\boldsymbol{\delta}:\mathbb{R}^{d}\to\{0,1\}^{d} yields a predicted segmentation 𝜹(𝐗)=(δ1(𝐗),,δd(𝐗))\boldsymbol{\delta}(\mathbf{\bm{X}})=(\delta_{1}(\mathbf{\bm{X}}),\cdots,\delta_{d}(\mathbf{\bm{X}}))^{\intercal}, where δj(𝐗)\delta_{j}(\mathbf{\bm{X}}) represents the predicted segmentation for the jj-th feature XjX_{j}, and I(𝜹(𝐗))={j:δj(𝐗)=1; for j=1,,d}I(\boldsymbol{\delta}(\mathbf{\bm{X}}))=\{j:\delta_{j}(\mathbf{\bm{X}})=1;\text{ for }j=1,\cdots,d\} is the index set of the segmented features of 𝐗\mathbf{\bm{X}} provided by 𝜹\boldsymbol{\delta}. Correspondingly, 𝐘{0,1}d\mathbf{\bm{Y}}\in\{0,1\}^{d} is a feature-wise label of a ground truth segmentation, where Yj=1Y_{j}=1 indicates that the jj-th feature/pixel XjX_{j} is segmented, and I(𝐘)={j:𝐘j=1; for j=1,,d}I(\mathbf{\bm{Y}})=\big{\{}j:\mathbf{\bm{Y}}_{j}=1;\text{ for }j=1,\cdots,d\big{\}} is the index set of the ground-truth features.

To access the performance for a segmentation function 𝜹\boldsymbol{\delta}, the Dice and IoU metrics are introduced and widely used in the literature (Milletari et al., 2016), both of which measure the overlap between the ground truth and the predicted segmentation:

Diceγ(𝜹)\displaystyle{{\rm Dice}}_{\gamma}(\boldsymbol{\delta}) =𝔼(2|I(𝐘)I(𝜹(𝐗))|+γ|I(𝐘)|+|I(𝜹(𝐗))|+γ)=𝔼(2𝐘𝜹(𝐗)+γ𝐘1+𝜹(𝐗)1+γ),\displaystyle=\mathbb{E}\Big{(}\frac{2\big{|}I(\mathbf{\bm{Y}})\cap I(\boldsymbol{\delta}(\mathbf{\bm{X}}))\big{|}+\gamma}{|I(\mathbf{\bm{Y}})|+|I(\boldsymbol{\delta}(\mathbf{\bm{X}}))|+\gamma}\Big{)}=\mathbb{E}\Big{(}\frac{2\mathbf{\bm{Y}}^{\intercal}\boldsymbol{\delta}(\mathbf{\bm{X}})+\gamma}{\|\mathbf{\bm{Y}}\|_{1}+\|\boldsymbol{\delta}(\mathbf{\bm{X}})\|_{1}+\gamma}\Big{)},
IoUγ(𝜹)\displaystyle{{\rm IoU}}_{\gamma}(\boldsymbol{\delta}) =𝔼(|I(𝐘)I(𝜹(𝐗))|+γ|I(𝐘)I(𝜹(𝐗))|+γ)=𝔼(𝐘𝜹(𝐗)+γ𝐘1+𝜹(𝐗)1𝐘𝜹(𝐗)+γ),\displaystyle=\mathbb{E}\Big{(}\frac{\big{|}I(\mathbf{\bm{Y}})\cap I({\boldsymbol{\delta}}(\mathbf{\bm{X}}))\big{|}+\gamma}{\big{|}I(\mathbf{\bm{Y}})\cup I({\boldsymbol{\delta}}(\mathbf{\bm{X}}))\big{|}+\gamma}\Big{)}=\mathbb{E}\Big{(}\frac{\mathbf{\bm{Y}}^{\intercal}\boldsymbol{\delta}(\mathbf{\bm{X}})+\gamma}{\|\mathbf{\bm{Y}}\|_{1}+\|\boldsymbol{\delta}(\mathbf{\bm{X}})\|_{1}-\mathbf{\bm{Y}}^{\intercal}\boldsymbol{\delta}(\mathbf{\bm{X}})+\gamma}\Big{)}, (1)

where |||\cdot| is the cardinality of a set, and γ0\gamma\geq 0 is a smoothing parameter. When γ=0\gamma=0, Diceγ(𝜹)=𝔼(2TP/(2TP+FP+FN)){\rm Dice}_{\gamma}(\boldsymbol{\delta})=\mathbb{E}\big{(}2\text{TP}/(2\text{TP}+\text{FP}+\text{FN})\big{)}, IoUγ(𝜹)=𝔼(TP/(TP+FP+FN)){\rm IoU}_{\gamma}(\boldsymbol{\delta})=\mathbb{E}\big{(}\text{TP}/(\text{TP}+\text{FP}+\text{FN})\big{)} where TP is the true positive, FP is the false positive, and FN is the false negative. Both metrics are similar and will be treated in a unified fashion; however, as to be seen in the sequel, searching for the optimal segmentation function with respect to the IoU metric may require extra computation than its Dice counterpart. Thus, for clarity of presentation, we first focus on the Dice metric and postpone the discussion on the relationships between the Dice and IoU metrics in Section 4.2.

The recent success of fully convolutional networks has enabled significant progress in segmentation. In literature, the mainstream of recent works devoted to designing and developing neural network architectures under different segmentation scenarios, including FCN (Long et al., 2015), U-Net (Ronneberger et al., 2015), DeepLab (Chen et al., 2018), and PSPNet (Zhao et al., 2017). Despite their remarkable performance, most existing models primarily focus on predicting segmentation using a classification framework, without considering the inherent disparities between classification and segmentation (as discussed in Section 1.1). We find this framework leading to an inconsistent solution and suboptimal performance with respect to the Dice/IoU metrics, and we address this pitfall by developing a novel consistent ranking-based framework, namely RankSEG (RankDice to the Dice metric and RankIoU to the IoU metric), to improve the segmentation performance.

1.1 Existing methods

Most existing segmentation methods are developed under a threshold-based framework with two types of loss functions.

Refer to caption
Figure 1: The existing and the proposed (RankDice) frameworks for segmentation. The upper panel is the existing threshold-based segmentation framework, and the lower panel is the proposed RankDice framework.

As indicated in Figure 1, the existing threshold-based segmentation framework, inspired by binary classification, provides a predicted segmentation via a two-stage procedure: (i) estimating a decision function or a probability function based on a loss; (ii) predicting feature-wise labels by thresholding the estimated decision function or probabilities. Specifically, given a training dataset (𝐱i,𝐲i)i=1n(\mathbf{\bm{x}}_{i},\mathbf{\bm{y}}_{i})_{i=1}^{n}, the prediction provided by the threshold-based framework for a new instance 𝐱\mathbf{\bm{x}} can be formulated as:

𝐪^=argmin𝐪𝒬1ni=1nl(𝐲i,𝐪(𝐱i))+λ𝐪2,𝜹^(𝐱)=𝟏(𝐪^(𝐱)0.5),\widehat{\mathbf{\bm{q}}}=\mathop{\rm argmin}_{\mathbf{\bm{q}}\in\mathcal{Q}}\frac{1}{n}\sum_{i=1}^{n}l\big{(}\mathbf{\bm{y}}_{i},\mathbf{\bm{q}}(\mathbf{\bm{x}}_{i})\big{)}+\lambda\|\mathbf{\bm{q}}\|^{2},\qquad\widehat{\boldsymbol{\delta}}(\mathbf{\bm{x}})=\mathbf{\bm{1}}(\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}})\geq 0.5), (2)

where l(,)l(\cdot,\cdot) is an operating loss, 𝐪:d[0,1]d\mathbf{\bm{q}}:\mathbb{R}^{d}\to[0,1]^{d} is a candidate probability function with qjq_{j} being the candidate probability of the jj-th pixel, 𝒬\mathcal{Q} is a class of candidate probability functions, 𝐪\|\mathbf{\bm{q}}\| is a regularization term, λ>0\lambda>0 is a tuning parameter to balance the overfitting and underfitting, and 𝟏()\mathbf{\bm{1}}(\cdot) is an indicator function. For ease of presentation, qj(𝐱)q_{j}(\mathbf{\bm{x}}) is specified as a probability function and a predicted segmentation is produced by thresholding at 0.5, yet it can be equally extended to a general decision function. For example, we may formulate qj(𝐱)q_{j}(\mathbf{\bm{x}}) as a decision function with range in \mathbb{R}, and the prediction is produced by thresholding at 0, analogous to SVMs in classification (Cortes and Vapnik, 1995). Next, under the framework (2), two different types of operating loss functions are considered, namely the classification-based losses and the Dice-approximating losses.

Classification-based losses completely characterize segmentation as a classification problem, with examples such as the cross-entropy (CE; Cox (1958)) and the focal loss (Focal; Lin et al. (2017)):

(CE)\displaystyle(\text{CE}) lCE(𝐲,𝐪(𝐱))\displaystyle l_{\text{CE}}\big{(}\mathbf{\bm{y}},\mathbf{\bm{q}}(\mathbf{\bm{x}})\big{)} =j=1d(yjlog(𝐪j(𝐱))+(1yj)log(1𝐪j(𝐱))),\displaystyle=-\sum_{j=1}^{d}\Big{(}y_{j}\log\big{(}\mathbf{\bm{q}}_{j}(\mathbf{\bm{x}})\big{)}+(1-y_{j})\log\big{(}1-\mathbf{\bm{q}}_{j}(\mathbf{\bm{x}})\big{)}\Big{)}, (3)
(Focal)\displaystyle(\text{Focal}) lfocal(𝐲,𝐪(𝐱))\displaystyle l_{\text{focal}}\big{(}\mathbf{\bm{y}},\mathbf{\bm{q}}(\mathbf{\bm{x}})\big{)} =j=1d(yj(1𝐪j(𝐱))ϑlog(𝐪j(𝐱))+(1yj)𝐪jϑ(𝐱)log(1𝐪j(𝐱))),\displaystyle=-\sum_{j=1}^{d}\Big{(}y_{j}(1-\mathbf{\bm{q}}_{j}(\mathbf{\bm{x}}))^{\vartheta}\log\big{(}\mathbf{\bm{q}}_{j}(\mathbf{\bm{x}})\big{)}+(1-y_{j})\mathbf{\bm{q}}^{\vartheta}_{j}(\mathbf{\bm{x}})\log\big{(}1-\mathbf{\bm{q}}_{j}(\mathbf{\bm{x}})\big{)}\Big{)},

where ϑ0\vartheta\geq 0 is a hyperparameter for the focal loss (Lin et al., 2017). Other margin-based losses such as the hinge loss, in principle, can be included as classification-based losses with a decision function ranged in \mathbb{R} thresholding at 0, although they are less frequently used in a multiclass problem (Tewari and Bartlett, 2007). Therefore, we focus on the probability-based classification loss in the sequel.

Dice-approximating losses aim to approximate the Dice/IoU metric and conduct a direct optimization. Typical examples are the soft-Dice (Sudre et al., 2017) and the Lovasz-softmax loss (Berman et al., 2018):

(Soft-Dice)\displaystyle(\text{Soft-Dice}) lSoftD(𝐲,𝐪(𝐱))\displaystyle l_{\text{SoftD}}\big{(}\mathbf{\bm{y}},\mathbf{\bm{q}}(\mathbf{\bm{x}})\big{)} =12𝐲𝐪(𝐱)𝐲1+𝐪(𝐱)1,\displaystyle=1-\frac{2\mathbf{\bm{y}}^{\intercal}\mathbf{\bm{q}}(\mathbf{\bm{x}})}{\|\mathbf{\bm{y}}\|_{1}+\|\mathbf{\bm{q}}(\mathbf{\bm{x}})\|_{1}},
(Lovasz-softmax)\displaystyle(\text{Lovasz-softmax}) ll-softmax(𝐲,𝐪(𝐱))\displaystyle l_{\text{l-softmax}}\big{(}\mathbf{\bm{y}},\mathbf{\bm{q}}(\mathbf{\bm{x}})\big{)} =V¯(𝐲(1𝐪(𝐱))+(1𝐲)𝐪(𝐱)),\displaystyle=\mkern 1.5mu\overline{\mkern-1.5muV\mkern-1.5mu}\mkern 1.5mu\big{(}\mathbf{\bm{y}}\circ(1-\mathbf{\bm{q}}(\mathbf{\bm{x}}))+(1-\mathbf{\bm{y}})\circ\mathbf{\bm{q}}(\mathbf{\bm{x}})\big{)},

where V¯()\mkern 1.5mu\overline{\mkern-1.5muV\mkern-1.5mu}\mkern 1.5mu(\cdot) is the Lovasz extension of the mis-IoU error (Berman et al., 2018), and \circ is the element-wise product. Specifically, the soft-Dice loss replaces the binary segmentation indicator 𝜹(𝐱){0,1}d\boldsymbol{\delta}(\mathbf{\bm{x}})\in\{0,1\}^{d} in the Dice metric by a candidate probability function 𝐪(𝐱)[0,1]d\mathbf{\bm{q}}(\mathbf{\bm{x}})\in[0,1]^{d} to make the computation feasible. The Lovasz-softmax directly takes a convex extension of IoU based on a softmax transformation. Moreover, other losses including the Tversky loss (Salehi et al., 2017), the Lovasz-hinge loss (Berman et al., 2018), and the log-Cosh Dice loss (Jadon, 2021) can be also categorized as Dice-approximating losses.

The threshold-based framework (2) with a classification-based loss or a Dice-approximating loss is a commonly used approach for segmentation. Although encouraging performance is delivered, we show that the minimizer from (2) (based on the cross-entropy and the focal loss) is inconsistent (or suboptimal) to the Dice metric, see Lemma 9. For Lovasz convex losses, it is still unclear if they are able to yield an optimal segmentation (convex closure is usually not enough to ensure the consistency (Bartlett et al., 2006)). Moreover, in practice, only a small number of pixel predictions are taken into account in one stochastic gradient step. Therefore, the Lovasz-softmax loss cannot directly optimize the segmentation metric (see Section 3.1 in Berman et al. (2018)).

Another line of work (Bao and Sugiyama, 2020; Nordström et al., 2020; Lipton et al., 2014) has been centered on a linear-fractional approximation of the Dice and IoU metrics which are not decomposable per instance. In particular, Bao and Sugiyama (2020) and Nordström et al. (2020) proposed to approximate the utility by tractable surrogate functions with a sample-splitting procedure and showed that their methods are consistent in optimizing the target utility. Yet, splitting the sample may undermine the efficiency. Lipton et al. (2014) indicated that the optimal threshold maximizing the F1F_{1} metric is equal to half of the optimal F1F_{1} value. However, their definitions of Dice and IoU metrics may overlook small instances, which is undesirable in many applications; see Appendix A for technical discussion.

To summarize, the current frameworks with existing losses may either yield a suboptimal solution or suffer from an inappropriate target metric, demanding efforts to further improve the performance, robustness and sustainability of the existing segmentation framework.

1.2 Our contribution

In this paper, we propose a novel Dice-calibrated ranking-based segmentation framework, namely RankDice, to address the inconsistency of the existing framework. RankDice is primarily inspired by the Bayes rule of Dice-segmentation. We summarize our major contribution as follows:

  1. 1.

    To our best knowledge, the proposed ranking-based segmentation framework RankDice, is the first consistent segmentation framework with respect to the Dice metric (Dice-calibrated).

  2. 2.

    Three numerical algorithms with GPU parallel execution are developed to implement the proposed framework in large-scale and high-dimensional segmentation.

  3. 3.

    We establish a theoretical foundation of segmentation with respect to the Dice metric, such as the Bayes rule and Dice-calibration. Moreover, we present Dice-calibrated consistency (Lemma 10) and a convergence rate of the excess risk (Theorem 11) for the proposed RankDice framework, and indicate inconsistent results for the existing methods (Lemma 9).

  4. 4.

    Our experiments in three simulated examples and three real datasets (CityScapes dataset, Pascal VOC 2021 dataset, and Kvasir-SEG dataset) suggest that the improvement of RankDice over the existing framework is significant for various loss functions and network architectures (see Tables 9-6).

It is worth noting that the results are equally applicable to the proposed RankIoU framework in terms of the IoU metric.

2 RankDice

It is reasonable to assume that all information on a feature-wise label is solely based on input features, that is, YiYj|𝐗Y_{i}\perp Y_{j}|\mathbf{\bm{X}} for any iji\neq j. In Appendix B.3, we provide a probabilistic perspective to suggest the necessity of this assumption in segmentation tasks. Without loss of generality, we further assume that pj(𝐗):=(Yj=1|𝐗)p_{j}(\mathbf{\bm{X}}):=\mathbb{P}(Y_{j}=1|\mathbf{\bm{X}}) are distinct for j=1,,dj=1,\cdots,d with probability one.

2.1 Bayes segmentation rule

To begin with, we call segmentation with respect to the Dice metric as “Dice-segmentation”. Then, we discuss Dice-segmentation at the population level, and present its Bayes (optimal) segmentation rule in Theorem 1 akin to the Bayes classifier for classification.

Theorem 1 (The Bayes rule for Dice-segmentation)

A segmentation rule 𝛅\boldsymbol{\delta}^{*} is a global maximizer of Diceγ(𝛅){\rm Dice}_{\gamma}(\boldsymbol{\delta}) if and only if it satisfies that

δj(𝐱)={1if pj(𝐱) ranks top τ(𝐱),0otherwise.\delta_{j}^{*}(\mathbf{\bm{x}})=\begin{cases}1&\text{if }p_{j}(\mathbf{\bm{x}})\text{ ranks top }\tau^{*}(\mathbf{\bm{x}}),\\ 0&\text{otherwise}.\end{cases}

The optimal volume (the optimum total number of segmented features) τ(𝐱)\tau^{*}(\mathbf{\bm{x}}) is given as

τ(𝐱)=argmaxτ{0,1,,d}(jJτ(𝐱)l=0d12τ+l+γ+1pj(𝐱)(Γj(𝐱)=l)+l=0dγτ+l+γ(Γ(𝐱)=l)),\tau^{*}(\mathbf{\bm{x}})=\mathop{\rm argmax}_{\tau\in\{0,1,\cdots,d\}}\ \Big{(}\sum_{j\in J_{\tau}(\mathbf{\bm{x}})}\sum_{l=0}^{d-1}\frac{2}{\tau+l+\gamma+1}p_{j}(\mathbf{\bm{x}})\mathbb{P}\big{(}\Gamma_{\sm j}(\mathbf{\bm{x}})=l\big{)}+\sum_{l=0}^{d}\frac{\gamma}{\tau+l+\gamma}\mathbb{P}\big{(}\Gamma(\mathbf{\bm{x}})=l\big{)}\Big{)}, (4)

where Jτ(𝐱)={j:j=1d𝟏(pj(𝐱)pj(𝐱))τ}J_{\tau}(\mathbf{\bm{x}})=\big{\{}j:\sum_{j^{\prime}=1}^{d}\mathbf{\bm{1}}\big{(}p_{j^{\prime}}(\mathbf{\bm{x}})\geq p_{j}(\mathbf{\bm{x}})\big{)}\leq\tau\big{\}} is the index set of the τ\tau-largest conditional probabilities with J0(𝐱)=J_{0}(\mathbf{\bm{x}})=\emptyset, Γ(𝐱)=j=1dBj(𝐱)\Gamma(\mathbf{\bm{x}})=\sum_{j=1}^{d}{B}_{j}(\mathbf{\bm{x}}), and Γj(𝐱)=jjBj(𝐱){\Gamma}_{\sm j}(\mathbf{\bm{x}})=\sum_{j^{\prime}\neq j}{B}_{j^{\prime}}(\mathbf{\bm{x}}) are Poisson-binomial random variables, and Bj(𝐱){B}_{j}(\mathbf{\bm{x}}) is a Bernoulli random variable with the success probability pj(𝐱)p_{j}(\mathbf{\bm{x}}). See the definition of the Poisson-binomial distribution in Appendix B.2.

Two remarkable observations emerge from Theorem 1. First, the Bayes segmentation operator can be obtained via a two-stage procedure: (i) ranking the conditional probability pj(𝐱)p_{j}(\mathbf{\bm{x}}), and (ii) searching for the optimal volume of the segmented features τ(𝐱)\tau(\mathbf{\bm{x}}). Second, both the Bayes segmentation rule 𝜹(𝐱)\boldsymbol{\delta}^{*}(\mathbf{\bm{x}}) and the optimal volume function τ(𝐱)\tau^{*}(\mathbf{\bm{x}}) are achievable when the conditional probability 𝐩(𝐱)=(p1(𝐱),,pd(𝐱))\mathbf{\bm{p}}(\mathbf{\bm{x}})=(p_{1}(\mathbf{\bm{x}}),\cdots,p_{d}(\mathbf{\bm{x}}))^{\intercal} is well-estimated. Therefore, our proposed framework RankDice is directly inspired by a general plug-in rule of the Bayes segmentation rule.

Moreover, Lemma 9 (and examples provided in its proof) indicates that the segmentation rule produced by existing frameworks, such as the threshold-based framework, can significantly differ from the optimum in Theorem 1. In fact, Lemma 9 further proves that the cross-entropy loss, the focal loss, and even a general classification-calibrated loss (Zhang, 2004; Bartlett et al., 2006), are not Dice-calibrated. See the definition of Dice-calibrated (Definition 8), and the negative results for existing frameworks (Lemma 9) in Section 4.1. Besides, the Bayes rule for IoU-segmentation is presented in Lemma 13.

Remark 2 (Suboptimal of a fixed-thresholding framework)

The Bayes rule of Dice-segmentation can be also regarded as adaptive thresholding of conditional probabilities. Specifically, for each input 𝐱\mathbf{\bm{x}}, the optimal segmentation rule can be rewritten as:

δj(𝐱)=𝟏(pj(𝐱)pjτ(𝐱)(𝐱)),where pjτ(𝐱)(𝐱) is the top-τ(𝐱) largest probability over 𝐩(𝐱).\delta_{j}^{*}(\mathbf{\bm{x}})=\mathbf{\bm{1}}\big{(}p_{j}(\mathbf{\bm{x}})\geq p_{j_{\tau^{*}(\mathbf{\bm{x}})}}(\mathbf{\bm{x}})\big{)},\ \text{where }p_{j_{\tau^{*}(\mathbf{\bm{x}})}}(\mathbf{\bm{x}})\text{ is the top-}\tau^{*}(\mathbf{\bm{x}})\text{ largest probability over }\mathbf{\bm{p}}(\mathbf{\bm{x}}).

Alternatively, Theorem 1 indicates that the Bayes rule for Dice-segmentation is unlikely to be obtained by a fixed thresholding framework, since the optimal threshold pjτ(𝐱)(𝐱)p_{j_{\tau^{*}(\mathbf{\bm{x}})}}(\mathbf{\bm{x}}) varies greatly across different inputs. Therefore, (tuning a threshold on) a fixed thresholding-based framework leads to a suboptimal solution.

Remark 2 also explains the heterogeneity of optimal thresholds in various datasets indicated in Bice et al. (2021), and the suboptimality of a fixed-thresholding framework is also empirically supported by Table 10 and Figure 7 for simulated examples, and Table 7 and Figure 4 for real datasets. Furthermore, the fact that fixed-thresholding is suboptimal for Dice-segmentation should be compared with the existing results in classification. For binary classification, the optimal threshold maximizing the F1F_{1} metric is equal to half of the optimal F1F_{1} value, which is fixed (Lipton et al., 2014). This disparity stems from a different definition of the Dice metric (or F1F_{1}) for binary classification, where F1F_{1}-score (for binary classification) can be regard as a linear fractional utility of Dice defined in (1); see more discussion in Appendix A.

2.2 Proposed framework

Suppose a training dataset (𝐱i,𝐲i)i=1n(\mathbf{\bm{x}}_{i},\mathbf{\bm{y}}_{i})_{i=1}^{n} is given, where 𝐱id\mathbf{\bm{x}}_{i}\in\mathbb{R}^{d} and 𝐲i{0,1}d\mathbf{\bm{y}}_{i}\in\{0,1\}^{d} are the input features and the true label for the ii-th instance. Inspired by Theorem 1, we develop a ranking-based framework RankDice for Dice-segmentation (Steps 1-3).

Step 1 (Conditional probability estimation): Estimate the conditional probability based on logistic regression (the cross-entropy loss):

𝐪^(𝐱)=argmin𝐪𝒬i=1nj=1d(yijlog(qj(𝐱i))+(1yij)log(1qj(𝐱i)))+λ𝐪2,\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}})=\mathop{\rm argmin}_{\mathbf{\bm{q}}\in\mathcal{Q}}-\sum_{i=1}^{n}\sum_{j=1}^{d}\Big{(}y_{ij}\log\big{(}q_{j}(\mathbf{\bm{x}}_{i})\big{)}+(1-y_{ij})\log\big{(}1-q_{j}(\mathbf{\bm{x}}_{i})\big{)}\Big{)}+\lambda\|\mathbf{\bm{q}}\|^{2}, (5)

where 𝒬\mathcal{Q} is a class of candidate probability functions, 𝐪\|\mathbf{\bm{q}}\| is a regularization for a candidate function, and λ>0\lambda>0 is a hyperparameter to balance the loss and regularization. For example, 𝐪𝒬\mathbf{\bm{q}}\in\mathcal{Q} is usually a deep convolutional neural network for image segmentation, and 𝐪\|\mathbf{\bm{q}}\| can be a matrix norm of weight matrices in the network.

Step 2 (Ranking): Given a new instance 𝐱\mathbf{\bm{x}}, rank its estimated conditional probabilities decreasingly, and denote the corresponding indices as j1,,jdj_{1},\cdots,j_{d}, that is, q^j1(𝐱)q^j2(𝐱)q^jd(𝐱)\widehat{{q}}_{j_{1}}(\mathbf{\bm{x}})\geq\widehat{{q}}_{j_{2}}(\mathbf{\bm{x}})\geq\cdots\geq\widehat{{q}}_{j_{d}}(\mathbf{\bm{x}}).

Step 3 (Volume estimation): From (4), we estimate the volume τ^(𝐱)\widehat{\tau}(\mathbf{\bm{x}}) by replacing the true conditional probability 𝐩(𝐱)\mathbf{\bm{p}}(\mathbf{\bm{x}}) by the estimated one 𝐪^(𝐱)\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}}):

τ^(𝐱)\displaystyle\widehat{\tau}(\mathbf{\bm{x}}) =argmaxτ{0,,d}s=1τl=0d12τ+l+γ+1q^js(𝐱)(Γ^js(𝐱)=l)+l=0dγτ+l+γ(Γ^(𝐱)=l),\displaystyle=\mathop{\rm argmax}_{\tau\in\{0,\cdots,d\}}\sum_{s=1}^{\tau}\sum_{l=0}^{d-1}\frac{2}{\tau+l+\gamma+1}\widehat{{q}}_{j_{s}}(\mathbf{\bm{x}})\mathbb{P}\big{(}\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})=l\big{)}+\sum_{l=0}^{d}\frac{\gamma}{\tau+l+\gamma}\mathbb{P}\big{(}\widehat{\Gamma}(\mathbf{\bm{x}})=l\big{)}, (6)

where Γ^(𝐱)=j=1dB^j(𝐱)\widehat{\Gamma}(\mathbf{\bm{x}})=\sum_{j=1}^{d}\widehat{B}_{j}(\mathbf{\bm{x}}) and Γ^js(𝐱)=jjsB^j(𝐱)\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})=\sum_{j\neq j_{s}}\widehat{B}_{j}(\mathbf{\bm{x}}) are Poisson-binomial random variables, and B^j(𝐱)\widehat{B}_{j}(\mathbf{\bm{x}}) are independent Bernoulli random variables with success probabilities q^j(𝐱)\widehat{q}_{j}(\mathbf{\bm{x}}); for j=1,,dj=1,\cdots,d.

Finally, the predicted segmentation 𝜹^(𝐱)=(δ^1(𝐱),,δ^d(𝐱))\widehat{\boldsymbol{\delta}}(\mathbf{\bm{x}})=(\widehat{\delta}_{1}(\mathbf{\bm{x}}),\cdots,\widehat{\delta}_{d}(\mathbf{\bm{x}}))^{\intercal} is given by selecting the indices of the top-τ^(𝐱)\widehat{\tau}(\mathbf{\bm{x}}) conditional probabilities:

δ^j(𝐱)=1, if j{j1,,jτ^(𝐱)};δ^j(𝐱)=0, otherwise.\widehat{\delta}_{j}(\mathbf{\bm{x}})=1,\text{ if }j\in\{j_{1},\cdots,j_{\widehat{\tau}(\mathbf{\bm{x}})}\};\quad\widehat{\delta}_{j}(\mathbf{\bm{x}})=0,\text{ otherwise.} (7)

The proposed RankDice framework (Steps 1-3) provides a feasible solution to the Bayes segmentation rule in terms of the Dice metric. Note that Step 1 is a standard conditional probability estimation, and Step 2 simply ranks the estimated conditional probabilities. Next, we focus on developing a scalable computing scheme for Step 3.

2.3 Scalable computing schemes

This section develops scalable computing schemes for volume estimation in (6). Note that (6) can be rewritten as:

τ^(𝐱)\displaystyle\widehat{\tau}(\mathbf{\bm{x}}) =argmaxτ{0,,d}π¯τ(𝐱);π¯τ(𝐱)=ω¯τ(𝐱)+ν¯τ(𝐱),\displaystyle=\mathop{\rm argmax}_{\tau\in\{0,\cdots,d\}}\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu_{\tau}(\mathbf{\bm{x}});\qquad\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu_{\tau}(\mathbf{\bm{x}})=\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau}(\mathbf{\bm{x}})+\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}(\mathbf{\bm{x}}),
ω¯τ(𝐱)\displaystyle\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau}(\mathbf{\bm{x}}) =l=0d12ωτ,l(𝐱)τ+l+γ+1,ωτ,l(𝐱)=s=1τq^js(𝐱)(Γ^js(𝐱)=l),ν¯τ(𝐱)=l=0dγ(Γ^(𝐱)=l)τ+l+γ.\displaystyle=\sum_{l=0}^{d-1}\frac{2\omega_{\tau,l}(\mathbf{\bm{x}})}{\tau+l+\gamma+1},\quad\omega_{\tau,l}(\mathbf{\bm{x}})=\sum_{s=1}^{\tau}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\mathbb{P}(\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})=l),\quad\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}(\mathbf{\bm{x}})=\sum_{l=0}^{d}\frac{\gamma\mathbb{P}(\widehat{\Gamma}(\mathbf{\bm{x}})=l)}{\tau+l+\gamma}. (8)

The computational complexity of solving (2.3) is intimately related to the dimension of input features. Therefore, we develop numerical algorithms for low- and high-dimensional segmentation separately.

2.3.1 Exact algorithms for low-dimensional segmentation

Exact algorithm based on FFT. When the dimension is low (d500d\leq 500), we consider an exact algorithm to evaluate ω¯τ\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau} and ν¯τ\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}. According to the definition of ωτ,l\omega_{\tau,l}, it can be computed by the following recursive formula (τ=1,,d\tau=1,\cdots,d):

𝝎τ(𝐱)=𝝎τ1(𝐱)+q^jτ(𝐱)((Γ^jτ(𝐱)=0),,(Γ^jτ(𝐱)=d1)),𝝎0(𝐱)=𝟎,\boldsymbol{\omega}_{\tau}(\mathbf{\bm{x}})=\boldsymbol{\omega}_{\tau-1}(\mathbf{\bm{x}})+\widehat{q}_{j_{\tau}}(\mathbf{\bm{x}})\big{(}\mathbb{P}(\widehat{\Gamma}_{-j_{\tau}}(\mathbf{\bm{x}})=0),\cdots,\mathbb{P}(\widehat{\Gamma}_{-j_{\tau}}(\mathbf{\bm{x}})=d-1)\big{)}^{\intercal},\quad\boldsymbol{\omega}_{0}(\mathbf{\bm{x}})=\mathbf{\bm{0}}, (9)

where 𝝎τ(𝐱)=(ωτ,0(𝐱),,ωτ,d1(𝐱))\boldsymbol{\omega}_{\tau}(\mathbf{\bm{x}})=(\omega_{\tau,0}(\mathbf{\bm{x}}),\cdots,\omega_{\tau,d-1}(\mathbf{\bm{x}}))^{\intercal}. On this ground, it suffices to evaluate (Γ^(𝐱)=l)\mathbb{P}(\widehat{\Gamma}(\mathbf{\bm{x}})=l) and (Γ^jτ(𝐱)=l)\mathbb{P}(\widehat{\Gamma}_{\sm j_{\tau}}(\mathbf{\bm{x}})=l), which are the probability mass functions of Poisson-binomial random variables Γ^(𝐱)\widehat{\Gamma}(\mathbf{\bm{x}}) and Γ^jτ(𝐱)\widehat{\Gamma}_{\sm j_{\tau}}(\mathbf{\bm{x}}), respectively. As indicated in Hong (2013), they can be efficiently evaluated by a fast Fourier transformation (FFT). Based on the numerical results in Hong (2013), the computing time for FFT evaluation with d500d\leq 500 is generally negligible (less than ten milliseconds). Moreover, it is worth noting that our algorithm in (9) needs not store the entire auxiliary matrix (𝝎1(𝐱),,𝝎d(𝐱))(\boldsymbol{\omega}_{1}(\mathbf{\bm{x}}),\cdots,\boldsymbol{\omega}_{d}(\mathbf{\bm{x}})), since the τ\tau-th row 𝝎τ\boldsymbol{\omega}_{\tau} can be computed from the previous row 𝝎τ1\boldsymbol{\omega}_{\tau-1}. Hence, only O(d)O(d) storage is required in (9). The detailed algorithm is summarized in Algorithm 1 with approx=False.

2.3.2 Approximation algorithms for high-dimensional segmentation

For high-dimensional segmentation, especially for image segmentation, it is challenging to solve (6) by a grid searching over τ{0,,d}\tau\in\{0,\cdots,d\}. To address this issue, we use shrinkage and approximation techniques to reduce the computational complexity of (6). First, Lemma 3 is developed to shrink the searching range of τ\tau.

Lemma 3 (Shrinkage)

If s=1τq^js(𝐱)(τ+γ+d)q^jτ+1(𝐱)\sum_{s=1}^{\tau}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\geq(\tau+\gamma+d)\widehat{q}_{j_{\tau+1}}(\mathbf{\bm{x}}), then π¯τ(𝐱)π¯τ(𝐱)\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu_{\tau}(\mathbf{\bm{x}})\geq\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu_{\tau^{\prime}}(\mathbf{\bm{x}}) for all τ>τ\tau^{\prime}>\tau.

Lemma 3 can be viewed as an early stopping of the grid searching, which draws from the property of Poisson binomial distribution (c.f. Lemma 17). Accordingly, we can shrink the grid search in (6) from {0,,d}\{0,\cdots,d\} to {0,,d0(𝐱)}\{0,\cdots,d_{0}(\mathbf{\bm{x}})\}, which significantly reduces the computational complexity. Specifically,

τ^(𝐱)=argmaxτ{0,,d0(𝐱)}π¯τ(𝐱);d0(𝐱)=min{τ=1,,d:s=1τq^js(𝐱)/q^jτ+1(𝐱)τ+γ+d}.\widehat{\tau}(\mathbf{\bm{x}})=\mathop{\rm argmax}_{\tau\in\{0,\cdots,d_{0}(\mathbf{\bm{x}})\}}\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu_{\tau}(\mathbf{\bm{x}});\quad d_{0}(\mathbf{\bm{x}})=\min\big{\{}\tau=1,\cdots,d:\sum_{s=1}^{\tau}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})/\widehat{q}_{j_{\tau+1}}(\mathbf{\bm{x}})\geq\tau+\gamma+d\big{\}}. (10)

In many applications, d0(𝐱)d_{0}(\mathbf{\bm{x}}) is upper bounded by a small integer, that is, d0(𝐱)<dUdd_{0}(\mathbf{\bm{x}})<d_{U}\ll d, called the well-separated segmentation. For example, there exists a small number of features/pixels whose probabilities (close to 1) are much larger than the others.

Truncated refined normal approximation (T-RNA). Note that the cumulative distribution function (CDF), and thus the probability mass function, of a Poisson-binomial random variable can be efficiently evaluated by a refined normal approximation when the dimension is large (Hong, 2013; Neammanee, 2005). For instance,

(Γ^(𝐱)l)\displaystyle\mathbb{P}(\widehat{\Gamma}(\mathbf{\bm{x}})\leq l) ~(Γ^(𝐱)l)Ψ(σ^1(l+1/2μ^(𝐱))),\displaystyle\ \leftarrow\ \widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})\leq l)\coloneqq\Psi\big{(}\widehat{\sigma}^{-1}(l+1/2-\widehat{\mu}(\mathbf{\bm{x}}))\big{)},
(Γ^j(𝐱)l)\displaystyle\mathbb{P}(\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})\leq l) ~(Γ^j(𝐱)l)Ψj(σ^j1(l+1/2μ^j(𝐱))),\displaystyle\ \leftarrow\ \widetilde{\mathbb{P}}(\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})\leq l)\coloneqq\Psi_{\sm j}\big{(}\widehat{\sigma}_{\sm j}^{-1}(l+1/2-\widehat{\mu}_{\sm j}(\mathbf{\bm{x}}))\big{)}, (11)

where Ψ(u)=Φ(u)+η^(𝐱)(1u2)ϕ(u)/6\Psi(u)=\Phi(u)+\widehat{\eta}(\mathbf{\bm{x}})(1-u^{2})\phi(u)/6 and Ψj(u)=Φ(u)+η^j(𝐱)(1u2)ϕ(u)/6\Psi_{\sm j}(u)=\Phi(u)+\widehat{\eta}_{\sm j}(\mathbf{\bm{x}})(1-u^{2})\phi(u)/6 are two skew-corrected refined normal CDFs, Φ()\Phi(\cdot) is the CDF of the standard normal distribution, and (μ^(𝐱),μ^j(𝐱))(\widehat{\mu}(\mathbf{\bm{x}}),\widehat{\mu}_{\sm j}(\mathbf{\bm{x}})), (σ^2(𝐱),σ^j2(𝐱))(\widehat{\sigma}^{2}(\mathbf{\bm{x}}),\widehat{\sigma}_{\sm j}^{2}(\mathbf{\bm{x}})), (η^(𝐱),η^j(𝐱))(\widehat{\eta}(\mathbf{\bm{x}}),\widehat{\eta}_{\sm j}(\mathbf{\bm{x}})) are mean, variance and skewness of the Poisson-binomial random variables Γ^(𝐱)\widehat{\Gamma}(\mathbf{\bm{x}}) and Γ^j(𝐱)\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}}), respectively. See the definitions of variance and skewness of the Poisson-binomial distribution in Appendix B.2.

On this ground, it is unnecessary to compute all (Γ^j(𝐱)=l)\mathbb{P}(\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})=l) and (Γ^(𝐱)=l)\mathbb{P}(\widehat{\Gamma}(\mathbf{\bm{x}})=l) for l=1,,dl=1,\cdots,d, since they are negligibly close to zero when ll is too small or too large. In other words, many (Γ^j(𝐱)=l)\mathbb{P}(\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})=l) and (Γ^(𝐱)=l)\mathbb{P}(\widehat{\Gamma}(\mathbf{\bm{x}})=l) can be ignored when evaluating ω¯τ\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau} and ν¯τ\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}. Therefore, according to the refined normal approximation in (2.3.2), ω¯τ\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau} and ν¯τ\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau} can be approximated by only taking a partial sum over a subset of l=0,,dl=0,\cdots,d:

ω~τ(𝐱)\displaystyle\widetilde{\omega}_{\tau}(\mathbf{\bm{x}}) =l(ϵ)2ω~τ,l(𝐱)τ+l+γ+1,ν~τ(𝐱)=l(ϵ)γ~(Γ^(𝐱)=l)τ+l+γ,ω~τ,l(𝐱)=s=1τq^js(𝐱)~(Γ^js(𝐱)=l),\displaystyle=\sum_{l\in\mathcal{L}(\epsilon)}\frac{2\widetilde{\omega}_{\tau,l}(\mathbf{\bm{x}})}{\tau+l+\gamma+1},\quad\widetilde{\nu}_{\tau}(\mathbf{\bm{x}})=\sum_{l\in\mathcal{L}(\epsilon)}\frac{\gamma\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})=l)}{\tau+l+\gamma},\quad\widetilde{\omega}_{\tau,l}(\mathbf{\bm{x}})=\sum_{s=1}^{\tau}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\widetilde{\mathbb{P}}(\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})=l),
(ϵ)\displaystyle\mathcal{L}(\epsilon) ={σ^(𝐱)Ψ1(ϵ)+μ^(𝐱)32,,σ^(𝐱)Ψ1(1ϵ)+μ^(𝐱)12}{0,,d},\displaystyle=\Big{\{}\lfloor\widehat{\sigma}(\mathbf{\bm{x}})\Psi^{-1}(\epsilon)+\widehat{\mu}(\mathbf{\bm{x}})-\frac{3}{2}\rfloor,\cdots,\lfloor\widehat{\sigma}(\mathbf{\bm{x}})\Psi^{-1}(1-\epsilon)+\widehat{\mu}(\mathbf{\bm{x}})-\frac{1}{2}\rfloor\Big{\}}\bigcap\{0,\cdots,d\}, (12)

where ~(Γ^(𝐱)=l)~(Γ^(𝐱)l)~(Γ^(𝐱)l1)\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})=l)\coloneqq\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})\leq l)-\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})\leq l-1), ~(Γ^j(𝐱)=l)~(Γ^j(𝐱)l)~(Γ^j(𝐱)l1)\widetilde{\mathbb{P}}(\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})=l)\coloneqq\widetilde{\mathbb{P}}(\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})\leq l)-\widetilde{\mathbb{P}}(\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})\leq l-1), Ψ1()\Psi^{-1}(\cdot) is the quantile function of the refined normal distribution, and ϵ\epsilon is a custom tolerance error. The detailed algorithm is summarized in Algorithm 1 with approx=‘T-RNA’. Moreover, the following lemma quantifies the approximation error of the proposed approximate algorithm.

Lemma 4

For (ω¯τ,ν¯τ)(\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau},\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}) and (ω~τ,ν~τ)(\widetilde{\omega}_{\tau},\widetilde{\nu}_{\tau}) defined in (2.3) and (2.3.2) respectively, if σ^2(𝐱)25\widehat{\sigma}^{2}(\mathbf{\bm{x}})\geq 25, then for any τ{0,,d}\tau\in\{0,\cdots,d\}, we have

|ω~τω¯τ|\displaystyle|\widetilde{\omega}_{\tau}-\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau}| 4ττ+γ+1(ϵ+C0σ^2(𝐱))+C0min(μ^(𝐱),τ)σ^2(𝐱)1/4(log(1+d)+1),\displaystyle\leq\frac{4\tau}{\tau+\gamma+1}(\epsilon+\frac{C_{0}}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})})+\frac{C_{0}\min(\widehat{\mu}(\mathbf{\bm{x}}),\tau)}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4}\Big{(}\log\big{(}1+d\big{)}+1\Big{)},
|ν~τν¯τ|\displaystyle|\widetilde{\nu}_{\tau}-\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}| 2γτ+γ(ϵ+C0σ^2(𝐱))+C0γσ^2(𝐱)(log(1+d)+1),\displaystyle\leq\frac{2\gamma}{\tau+\gamma}(\epsilon+\frac{C_{0}}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})})+\frac{C_{0}\gamma}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})}\Big{(}\log\big{(}1+d\big{)}+1\Big{)},

where C0=0.1618C_{0}=0.1618 if σ^2(𝐱)100\widehat{\sigma}^{2}(\mathbf{\bm{x}})\geq 100, and C0=0.3056C_{0}=0.3056 if σ^2(𝐱)25\widehat{\sigma}^{2}(\mathbf{\bm{x}})\geq 25. Moreover, when dd\to\infty, we have

|π~τπ¯τ|(4ττ+γ+1+2γτ+γ)ϵ+O(min(μ^(𝐱),τ)log(d)σ^2(𝐱)).|\widetilde{\pi}_{\tau}-\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu_{\tau}|\leq\big{(}\frac{4\tau}{\tau+\gamma+1}+\frac{2\gamma}{\tau+\gamma}\big{)}\epsilon+O(\frac{\min(\widehat{\mu}(\mathbf{\bm{x}}),\tau)\log(d)}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})}).

Here we define 0/000/0\coloneqq 0 for γ/(τ+γ)\gamma/(\tau+\gamma) when τ=γ=0\tau=\gamma=0.

Note that σ^2(𝐱)=j=1dq^j(𝐱)(1q^j(𝐱))\widehat{\sigma}^{2}(\mathbf{\bm{x}})=\sum_{j=1}^{d}\widehat{q}_{j}(\mathbf{\bm{x}})(1-\widehat{q}_{j}(\mathbf{\bm{x}})) is the variance of Γ^(𝐱)\widehat{\Gamma}(\mathbf{\bm{x}}), which often tends to infinity as dd\to\infty. In image segmentation, the dimension dd varies from 1024 (32x32) to 262144 (512x512). Therefore, the error bound is practical for high-dimensional segmentation. Moreover, ϵ\epsilon is the tolerance error to balance the approximation error and computation complexity. For instance, when ϵ\epsilon becomes smaller, the approximation error decreases, the computation complexity increases with an enlarged (ϵ)\mathcal{L}(\epsilon). Typically, Ψ1(1ϵ)Ψ1(ϵ)7.438\Psi^{-1}(1-\epsilon)-\Psi^{-1}(\epsilon)\leq 7.438 when ϵ=1e4\epsilon=1\mathrm{e}{-4}. Based on the approximation formula (2.3.2), the worst-case computational complexity is reduced to O(dσ^(𝐱))O(d\widehat{\sigma}(\mathbf{\bm{x}})) for general segmentation and O(σ^(𝐱))O(\widehat{\sigma}(\mathbf{\bm{x}})) for well-separated segmentation (d0(𝐱)dUd_{0}(\mathbf{\bm{x}})\leq d_{U}). The computational complexity is summarized in Table 1 (based on ϵ=1e4\epsilon=1\mathrm{e}{-4}).

Input : Training set: (𝐱i,𝐲i)i=1n(\mathbf{\bm{x}}_{i},\mathbf{\bm{y}}_{i})_{i=1}^{n}; A new testing instance: 𝐱\mathbf{\bm{x}}; Approx method: approx
Output : The predicted segmentation for the testing instance 𝜹^(𝐱)\widehat{\boldsymbol{\delta}}(\mathbf{\bm{x}})
1 Conditional prob est. Estimate conditional prob 𝐪^\widehat{\mathbf{\bm{q}}} via (5) based on training set (𝐱i,𝐲i)i=1n(\mathbf{\bm{x}}_{i},\mathbf{\bm{y}}_{i})_{i=1}^{n};
2 Ranking. Rank estimated conditional probabilities for 𝐱\mathbf{\bm{x}}, and denote the indices in sorted order as j1,,jdj_{1},\cdots,j_{d}, that is, q^j1(𝐱)q^j2(𝐱)q^jd(𝐱)\widehat{q}_{j_{1}}(\mathbf{\bm{x}})\geq\widehat{q}_{j_{2}}(\mathbf{\bm{x}})\geq\cdots\geq\widehat{q}_{j_{d}}(\mathbf{\bm{x}});
3 Volume estimation.
4 Compute and store the cumulative sum of sorted estimated probabilities 𝐬^(𝐱)\widehat{\mathbf{\bm{s}}}(\mathbf{\bm{x}});
5 Compute d0(𝐱)d_{0}(\mathbf{\bm{x}}) based on 𝐬^(𝐱)\widehat{\mathbf{\bm{s}}}(\mathbf{\bm{x}}) via (10);
6 if approx is None then
7      ={0,,d}\mathcal{L}=\{0,\cdots,d\};
8else
9      =(ϵ)={lL,,lU}\mathcal{L}=\mathcal{L}(\epsilon)=\{l_{L},\cdots,l_{U}\} based on (2.3.2);
10 end if
11Compute and store (Γ^(𝐱)=l)\mathbb{P}(\widehat{\Gamma}(\mathbf{\bm{x}})=l) for ll\in\mathcal{L};
12 if approx is BA then
13       Compute and store 𝝎~1:d0(𝐱)b\widetilde{\mathbf{\bm{\omega}}}_{1:d_{0}(\mathbf{\bm{x}})}^{b} and 𝝂~1:d0(𝐱)b\widetilde{\mathbf{\bm{\nu}}}_{1:d_{0}(\mathbf{\bm{x}})}^{b} based on (2.3.2);
14      
15else
16       Initialize 𝝎old=𝟎\mathbf{\bm{\omega}}_{old}=\mathbf{\bm{0}};
17       for τ=1,,d0(𝐱)\tau=1,\cdots,d_{0}(\mathbf{\bm{x}}) do
18             Update 𝝎new\mathbf{\bm{\omega}}_{new} as
𝝎new,l𝝎old,l+q^jτ(𝐱)(Γ^jτ(𝐱)=l), for l,𝝎old𝝎new,\mathbf{\bm{\omega}}_{new,l}\leftarrow\mathbf{\bm{\omega}}_{old,l}+\widehat{q}_{j_{\tau}}(\mathbf{\bm{x}})\mathbb{P}(\widehat{\Gamma}_{-j_{\tau}}(\mathbf{\bm{x}})=l),\text{ for }l\in\mathcal{L},\quad\mathbf{\bm{\omega}}_{old}\leftarrow\mathbf{\bm{\omega}}_{new},
where Γ^j(𝐱)\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}}) is Poisson binomial r.v. with the success probabilities 𝐪^j(𝐱)\widehat{\mathbf{\bm{q}}}_{\sm j}(\mathbf{\bm{x}});
19             Compute and store π¯τ=ω¯τ+ν¯τ=l1τ+l+1ωnew,l+lγτ+l+γ(Γ^(𝐱)=l)\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu_{\tau}=\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau}+\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}=\sum_{l\in\mathcal{L}}\frac{1}{\tau+l+1}\omega_{new,l}+\sum_{l\in\mathcal{L}}\frac{\gamma}{\tau+l+\gamma}\mathbb{P}(\widehat{\Gamma}(\mathbf{\bm{x}})=l);
20            
21       end for
22      
23 end if
24Estimate τ^(𝐱)=argmaxτ=0,,d0(𝐱)π¯τ\widehat{\tau}(\mathbf{\bm{x}})=\mathop{\rm argmax}_{\tau=0,\cdots,d_{0}(\mathbf{\bm{x}})}\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu_{\tau} via (2.3);
Prediction. The predicted segmentation is provided as:
δ^j(𝐱)=1, if j{j1,,jτ^(𝐱)};δ^j(𝐱)=0, otherwise.\widehat{\delta}_{j}(\mathbf{\bm{x}})=1,\text{ if }j\in\{j_{1},\cdots,j_{\widehat{\tau}(\mathbf{\bm{x}})}\};\quad\widehat{\delta}_{j}(\mathbf{\bm{x}})=0,\text{ otherwise.}
return The predicted segmentation 𝛅^(𝐱)=(δ^1(𝐱),,δ^d(𝐱))\widehat{\boldsymbol{\delta}}(\mathbf{\bm{x}})=(\widehat{\delta}_{1}(\mathbf{\bm{x}}),\cdots,\widehat{\delta}_{d}(\mathbf{\bm{x}}))^{\intercal}.
Algorithm 1 Computing schemes for the proposed RankDice framework.

Although T-RNA significantly reduces the computational complexity, yet this algorithm uses recursive updates, making it difficult in parallel computing on GPUs. For example, due to the memory issues, recursive function calls are restricted in the CUDA (Compute Unified Device Architecture) kernels (Nickolls et al., 2008). Next, we propose the blind approximation algorithm to exploit GPU-computing for the proposed RankDice.

Blind approximation (BA). In high-dimensional segmentation, the difference in distributions between Γ^(𝐱)\widehat{\Gamma}(\mathbf{\bm{x}}) and Γ^j(𝐱)\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}}) is negligible. Inspired by this fact, we propose a novel approximation algorithm, namely the blind approximation, to simultaneously evaluate the score functions with a small error tolerance. Specifically, based on the refined normal approximation, we further simplify the evaluation formulas by replacing ~(Γ^js(𝐱)=l)\widetilde{\mathbb{P}}(\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})=l) as ~(Γ^(𝐱)=l)\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})=l):

ω~τb(𝐱)\displaystyle\widetilde{\omega}^{b}_{\tau}(\mathbf{\bm{x}}) =2s=1τq^js(𝐱)l(ϵ)~(Γ^(𝐱)=l)τ+l+γ+1,ν~τb(𝐱)=ν~τ(𝐱)=l(ϵ)γ~(Γ^(𝐱)=l)τ+l+γ,\displaystyle=2\sum_{s=1}^{\tau}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\sum_{l\in\mathcal{L}(\epsilon)}\frac{\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})=l)}{\tau+l+\gamma+1},\quad\widetilde{\nu}^{b}_{\tau}(\mathbf{\bm{x}})=\widetilde{\nu}_{\tau}(\mathbf{\bm{x}})=\sum_{l\in\mathcal{L}(\epsilon)}\frac{\gamma\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})=l)}{\tau+l+\gamma},

where (ϵ)={lL,,lU}\mathcal{L}(\epsilon)=\{l_{L},\cdots,l_{U}\} is defined in (2.3.2). Then, for any 1d0d1\leq d_{0}\leq d, 𝝎~1:Tb=(ω~1b,,ω~Tb)\widetilde{\boldsymbol{\omega}}_{1:T}^{b}=(\widetilde{\omega}_{1}^{b},\cdots,\widetilde{\omega}_{T}^{b})^{\intercal} and 𝝂~1:Tb=(ν~1b,,ν~Tb)\widetilde{\boldsymbol{\nu}}_{1:T}^{b}=(\widetilde{\nu}_{1}^{b},\cdots,\widetilde{\nu}_{T}^{b})^{\intercal} can be simultaneously computed over τ=0,,d0\tau=0,\cdots,d_{0} based on cross-correlation (flipped convolution (Tahmasebi et al., 2012)):

𝝎~1:d0b\displaystyle\widetilde{\boldsymbol{\omega}}_{1:d_{0}}^{b} =2𝐬^1:d0(𝐱)((~(Γ^(𝐱)=lL),,~(Γ^(𝐱)=lU))(1ll+γ+1,,1lU+γ+d0+1)),\displaystyle=2\widehat{\mathbf{\bm{s}}}_{1:d_{0}}(\mathbf{\bm{x}})\circ\Big{(}\big{(}\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})=l_{L}),\cdots,\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})=l_{U})\big{)}^{\intercal}\mathbf{\bm{\star}}\big{(}\frac{1}{l_{l}+\gamma+1},\cdots,\frac{1}{l_{U}+\gamma+d_{0}+1}\big{)}\Big{)},
𝝂~1:d0b\displaystyle\widetilde{\boldsymbol{\nu}}_{1:d_{0}}^{b} =γ((~(Γ^(𝐱)=lL),,~(Γ^(𝐱)=lU))(1ll+γ,,1lU+γ+d0)),\displaystyle=\gamma\Big{(}\big{(}\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})=l_{L}),\cdots,\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})=l_{U})\big{)}^{\intercal}\mathbf{\bm{\star}}\big{(}\frac{1}{l_{l}+\gamma},\cdots,\frac{1}{l_{U}+\gamma+d_{0}}\big{)}\Big{)}, (13)

where 𝐬^(𝐱)=(s^1(𝐱),,s^d(𝐱))\widehat{\mathbf{\bm{s}}}(\mathbf{\bm{x}})=(\widehat{s}_{1}(\mathbf{\bm{x}}),\cdots,\widehat{s}_{d}(\mathbf{\bm{x}}))^{\intercal} and s^K(𝐱)=k=1Kq^jk(𝐱)\widehat{s}_{K}(\mathbf{\bm{x}})=\sum_{k=1}^{K}\widehat{q}_{j_{k}}(\mathbf{\bm{x}}) is the cumulative sum of sorted estimated probabilities, \circ is element-wise product of two vectors, and \mathbf{\bm{\star}} is the cross-correlation operator (flipped convolution) of two vectors (Tahmasebi et al., 2012). Notably, the cross-correlation can be efficiently implemented by a fast Fourier transform (Bracewell and Bracewell, 1986) with time complexity O((d0+σ^(𝐱))log(d0+σ^(𝐱)))O((d_{0}+\widehat{\sigma}(\mathbf{\bm{x}}))\log(d_{0}+\widehat{\sigma}(\mathbf{\bm{x}}))). Besides, the overall time complexity with CUDA implementation on GPUs can be further reduced by parallel computing. The detailed algorithm is summarized in Algorithm 1 with approx=‘BA’. Next, Lemma 5 shows the approximation error of the proposed blind approximation algorithm.

Lemma 5

For (ω¯τ,ν¯τ)(\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau},\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}) and (ω~τb,ν~τb)(\widetilde{\omega}^{b}_{\tau},\widetilde{\nu}^{b}_{\tau}) defined in (2.3) and (2.3.2) respectively, if σ^2(𝐱)25\widehat{\sigma}^{2}(\mathbf{\bm{x}})\geq 25, then for any τ{0,,d}\tau\in\{0,\cdots,d\}, and any γ0\gamma\geq 0, we have

|ω~τbω¯τ|\displaystyle\big{|}\widetilde{\omega}^{b}_{\tau}-\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau}\big{|} 4ττ+γ+1(ϵ+C0σ^2(𝐱))+C0min(μ^(𝐱),τ)σ^2(𝐱)1/4(log(1+d)+1)\displaystyle\leq\frac{4\tau}{\tau+\gamma+1}(\epsilon+\frac{C_{0}}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})})+\frac{C_{0}\min(\widehat{\mu}(\mathbf{\bm{x}}),\tau)}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4}\Big{(}\log\big{(}1+d\big{)}+1\Big{)}
+142π(1/(2e)σ^2(𝐱)1/4+4σ^2(𝐱)1/4+4m^3(𝐱)(σ^2(𝐱)1/4)3/2)(log(1+d)+1),\displaystyle\quad+\frac{1}{4\sqrt{2\pi}}\Big{(}\frac{1/(2\sqrt{e})}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4}+\frac{4}{\sqrt{\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4}}+\frac{4\widehat{m}_{3}(\mathbf{\bm{x}})}{(\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4)^{3/2}}\Big{)}\Big{(}\log\big{(}1+d\big{)}+1\Big{)},

where m^3(𝐱)=j=1dq^j(𝐱)(1q^j(𝐱))(12q^j(𝐱))\widehat{m}_{3}(\mathbf{\bm{x}})=\sum_{j=1}^{d}\widehat{q}_{j}(\mathbf{\bm{x}})(1-\widehat{q}_{j}(\mathbf{\bm{x}}))(1-2\widehat{q}_{j}(\mathbf{\bm{x}})), C0=0.1618C_{0}=0.1618 if σ^2(𝐱)100\widehat{\sigma}^{2}(\mathbf{\bm{x}})\geq 100, and C0=0.3056C_{0}=0.3056 if σ^2(𝐱)25\widehat{\sigma}^{2}(\mathbf{\bm{x}})\geq 25. Specifically, when dd\to\infty, we have

|π~τπ¯τ|(4ττ+γ+1+2γτ+γ)ϵ+O((min(μ^(𝐱),τ)σ^2(𝐱)+1σ^(𝐱))log(d)).|\widetilde{\pi}_{\tau}-\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu_{\tau}|\leq\big{(}\frac{4\tau}{\tau+\gamma+1}+\frac{2\gamma}{\tau+\gamma}\big{)}\epsilon+O\Big{(}\big{(}\frac{\min(\widehat{\mu}(\mathbf{\bm{x}}),\tau)}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})}+\frac{1}{\widehat{\sigma}(\mathbf{\bm{x}})}\big{)}\log(d)\Big{)}.

In contrast to the truncated refined normal approximation, the blind approximation algorithm significantly reduces the time complexity and enables GPU parallel execution. On the other hand, the blind approximation leads to an additional σ^1(𝐱)\widehat{\sigma}^{-1}(\mathbf{\bm{x}}) in Lemma 5 compared with that of T-RNA in Lemma 4, yet the error bound is still acceptable in practice.

Taken together, we summarize the foregoing computational schemes in Algorithm 1, and their inference (after obtaining conditional probabilities) computational complexity (worst-case) is indicated in Table 1. For the threshold-based framework, thresholding the estimated probabilities takes O(d)O(d) time complexity. For the proposed method, Step 2 takes O(dlog(d))O(d\log(d)) based on the merge sort (Ajtai et al., 1983), and Step 3 takes O(d0(𝐱)σ^(𝐱))O(d_{0}(\mathbf{\bm{x}})\widehat{\sigma}(\mathbf{\bm{x}})) based on T-RNA in (10) and (2.3.2), and takes O(d0(𝐱)log(d0(𝐱)))O\big{(}d_{0}(\mathbf{\bm{x}})\log(d_{0}(\mathbf{\bm{x}}))\big{)} based on BA in (2.3.2).

SEG framework Time Time (well-separated) Memory
Threshold-based SEG O(d)O(d) O(d)O(d) O(d)O(d)
RankDice (our) O(dlog(d)+dd0(𝐱))O(d\log(d)+dd_{0}(\mathbf{\bm{x}})) O(dlog(d))O(d\log(d)) O(d)O(d)
RankDice-RNA (our) O(dlog(d)+σ^(𝐱)d0(𝐱))O(d\log(d)+\widehat{\sigma}(\mathbf{\bm{x}})d_{0}(\mathbf{\bm{x}})) O(dlog(d))O(d\log(d)) O(d)O(d)
RankDice-BA (our) O(d0(𝐱)log(d0(𝐱)))O(d_{0}(\mathbf{\bm{x}})\log(d_{0}(\mathbf{\bm{x}}))) O(dlog(d))O(d\log(d)) O(d)O(d)
Table 1: Inference (prediction) computational complexity for the segmentation frameworks. Here “Time” denotes the time complexity, “Memory” denotes the amount of storage needed including probability outcomes, “Time (well-separated)” denotes the time complexity on well-separated segmentation (d0(𝐱)dUd_{0}(\mathbf{\bm{x}})\leq d_{U}), and d0(𝐱)dd_{0}(\mathbf{\bm{x}})\leq d is the reduced dimension defined in (10), σ^(𝐱)\widehat{\sigma}(\mathbf{\bm{x}}) is the standard deviation of Γ^(𝐱)\widehat{\Gamma}(\mathbf{\bm{x}}) with at most an order of O(d)O(\sqrt{d}). Note that the time complexity of RankDice-BA can be further reduced by GPU implementation. See more detailed discussion in Section 2.3.

3 mDice-segmentation and mRankDice

In this section, we discuss the extension of the proposed RankDice framework to segmentation with multiclass/multilabel outcomes evaluated by the mDice metric. Overall, multiclass/multilabel segmentation differs from Dice-segmentation in a number of important aspects. First, a new evaluation metric called mDice is introduced. Second, multiclass/multilabel outcomes provide two different ways of probabilistic modeling. Third, whether (or not) to allow for overlapping segmentation results among different classes leads to different problem setups. These aspects will be discussed in detail in the following subsections.

3.1 The mDice metric

For multiclass/multilabel segmentation, (𝐗,𝐘1,,𝐘K)(\mathbf{\bm{X}},\mathbf{\bm{Y}}_{\cdot 1},\cdots,\mathbf{\bm{Y}}_{\cdot K}) is available, where 𝐗d\mathbf{\bm{X}}\in\mathbb{R}^{d} represents a feature vector, KK is the total number of classes of interest, 𝐘k{0,1}d\mathbf{\bm{Y}}_{\cdot k}\in\{0,1\}^{d} is the true feature-wise segmentation label for the kk-th class, where Yjk=1Y_{jk}=1 indicates that the jj-th feature XjX_{j} is segmented under the kk-th class, and I(𝐘k)={j:Yjk=1; for j=1,,d}I(\mathbf{\bm{Y}}_{\cdot k})=\big{\{}j:Y_{jk}=1;\text{ for }j=1,\cdots,d\big{\}} is the class-specific index set for all segmented features.

On this ground, a class-specific segmentation operator 𝚫k:d{0,1}d(k=1,,K)\boldsymbol{\Delta}_{k}:\mathbb{R}^{d}\to\{0,1\}^{d}(k=1,\cdots,K) is introduced, where Δjk(𝐱){0,1}\Delta_{jk}(\mathbf{\bm{x}})\in\{0,1\} is the predicted segmentation for the class kk of the jj-th feature, and I(𝚫k(𝐗))={j:Δjk(𝐗)=1; for j=1,,d}I(\boldsymbol{\Delta}_{k}(\mathbf{\bm{X}}))=\{j:\Delta_{jk}(\mathbf{\bm{X}})=1;\text{ for }j=1,\cdots,d\} is the class-specific index set of features for the predicted segmentation. Then, given a segmentation operator 𝚫=(𝚫1,,𝚫K)\boldsymbol{\Delta}=(\boldsymbol{\Delta}_{1},\cdots,\boldsymbol{\Delta}_{K}), the multi-Dice (mDice) metric is defined as:

mDiceγ(𝚫)\displaystyle{\rm mDice}_{\gamma}(\boldsymbol{\Delta}) =k=1KαkDiceγ,k(𝚫k)\displaystyle=\sum_{k=1}^{K}\alpha_{k}{{\rm Dice}}_{\gamma,k}(\boldsymbol{\Delta}_{k})
=k=1Kαk𝔼(2|I(𝐘k)I(𝚫k(𝐗))|+γ|I(𝐘k)|+|I(𝚫k(𝐗))|+γ)=k=1Kαk𝔼(2𝐘k𝚫k(𝐗)+γ𝐘k1+𝚫k(𝐗)1+γ),\displaystyle=\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\Big{(}\frac{2\big{|}I(\mathbf{\bm{Y}}_{\cdot k})\cap I(\boldsymbol{\Delta}_{k}(\mathbf{\bm{X}}))\big{|}+\gamma}{|I(\mathbf{\bm{Y}}_{\cdot k})|+|I(\boldsymbol{\Delta}_{k}(\mathbf{\bm{X}}))|+\gamma}\Big{)}=\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\Big{(}\frac{2\mathbf{\bm{Y}}_{\cdot k}^{\intercal}\boldsymbol{\Delta}_{k}(\mathbf{\bm{X}})+\gamma}{\|\mathbf{\bm{Y}}_{\cdot k}\|_{1}+\|\boldsymbol{\Delta}_{k}(\mathbf{\bm{X}})\|_{1}+\gamma}\Big{)}, (14)

where Diceγ,k(){{\rm Dice}}_{\gamma,k}(\cdot) is the Dice metric under the kk-th class, 𝜶=(α1,,αK)𝟎K\boldsymbol{\alpha}=(\alpha_{1},\cdots,\alpha_{K})^{\intercal}\geq\mathbf{\bm{0}}_{K} is a weight vector for classes with 𝜶1=1\|\boldsymbol{\alpha}\|_{1}=1. For example, αk=1/K\alpha_{k}=1/K yields that each class has the same contribution to segmentation performance. More generally, 𝜶𝟎K\boldsymbol{\alpha}\geq\mathbf{\bm{0}}_{K} can be a custom weight vector indicating the relative importance of segmentation classes. In practice, given a new instance (𝐱,𝐲)(\mathbf{\bm{x}},\mathbf{\bm{y}}), the weight is an average over classes excluding those are not present and not predicted, that is,

αk=0, if 𝐲k1=𝚫k(𝐱)1=0;αk=1k=1K𝟏(𝐲k1+𝚫k(𝐱)1>0), otherwise.\alpha_{k}=0,\text{ if }\|\mathbf{\bm{y}}_{\cdot k}\|_{1}=\|\boldsymbol{\Delta}_{k}(\mathbf{\bm{x}})\|_{1}=0;\qquad\alpha_{k}=\frac{1}{\sum_{k=1}^{K}\mathbf{\bm{1}}(\|\mathbf{\bm{y}}_{\cdot k}\|_{1}+\|\boldsymbol{\Delta}_{k}(\mathbf{\bm{x}})\|_{1}>0)},\text{ otherwise}. (15)

Following our convention, we shall call multiclass/multilabel segmentation with respect to the mDice metric as “mDice-segmentation”. As indicated in Figure 2, unlike Dice-segmentation, mDice-segmentation provides more flexibility in probabilistic modeling (multiclass or multilabel) and the decision-making in prediction (taking argmax or thresholding at 0.5), resulting in different operating losses and the construction of predictive functions.

3.2 Multilabel/multiclass outcomes

In this section, we describe two probabilistic models (multilabel or multiclass) of 𝐘j|𝐗\mathbf{\bm{Y}}_{j}|\mathbf{\bm{X}}, where 𝐘j=(Yj1,,YjK)\mathbf{\bm{Y}}_{j}=(Y_{j1},\cdots,Y_{jK})^{\intercal} is the true label for the jj-th feature.

For multilabel modeling, we assume each feature could be assigned to multiple classes, that is, 𝐘j1{0,,K}\|\mathbf{\bm{Y}}_{j}\|_{1}\in\{0,\cdots,K\}, for j=1,,dj=1,\cdots,d. As a result, the index sets of segmented features may overlap among classes. In this case, we formulate and estimate qjk(𝐱)q_{jk}(\mathbf{\bm{x}}) under a multilabel probabilistic model. For example, for deep learning models, we use the sigmoid function as the output layer activation function of a neural network with the binary cross-entropy loss.

For multiclass modeling, each feature is assigned to one and only one class, that is, 𝐘j1=1\|\mathbf{\bm{Y}}_{j}\|_{1}=1 for j=1,,dj=1,\cdots,d. An instance is the panoptic annotation in image segmentation. In this case, we formulate and estimate 𝐪j(𝐱)=(qj1(𝐱),,qjK(𝐱))\mathbf{\bm{q}}_{j}(\mathbf{\bm{x}})=(q_{j1}(\mathbf{\bm{x}}),\cdots,q_{jK}(\mathbf{\bm{x}}))^{\intercal} under a multiclass model with additional sum-to-one constraints 𝐪j(𝐱)1=1\|\mathbf{\bm{q}}_{j}(\mathbf{\bm{x}})\|_{1}=1 for j=1,,dj=1,\cdots,d and 𝐱d\mathbf{\bm{x}}\in\mathbb{R}^{d}. Specifically, for deep learning models, we use the softmax function as the output layer activation function of a neural network, which automatically enforces the sum-to-one constraints. Correspondingly, a multiclass loss is used as an operating loss, including the multiclass cross-entropy. Note that the Dice-approximating losses, such as the soft-Dice loss, can be adopted into the multilabel/multiclass modeling.

In literature, once the estimation is done, the predicted segmentation is produced by taking argmax or thresholding at 0.5 on the estimated probabilities. Indeed, argmax and thresholding are inspired by the decision-making in multiclass and multilabel classification, respectively. In segmentation, it is possible to attempt ad-hoc combinations, such as multiclass modeling + thresholding, and multilabel modeling + argmax. The major purpose of argmax is to provide non-overlapping prediction (e.g., the panoptic prediction in image segmentation). We discuss overlapping/non-overlapping segmentation in the next section.

Refer to caption
Figure 2: The existing and the proposed frameworks under multiclass/multilabel modeling. The upper panel is threshold-/argmax-based segmentation, and the lower panel is the proposed mRankDice framework.

3.3 Overlapping or non-overlapping mDice-segmentation

Specifically, whether (or not) to allow for overlapping results among distinct classes leads to different decision-making procedures, namely overlapping/non-overlapping mDice-segmentation:

(Overlapping)argmax𝚫mDiceγ(𝚫),(Non-overlapping)argmax𝚫mDiceγ(𝚫),k=1K𝚫k()=𝟏d.\displaystyle\text{(Overlapping)}\ \ \mathop{\rm argmax}_{\boldsymbol{\Delta}}\ {\rm mDice}_{\gamma}(\boldsymbol{\Delta}),\quad\text{(Non-overlapping)}\ \ \mathop{\rm argmax}_{\boldsymbol{\Delta}}\ {\rm mDice}_{\gamma}(\boldsymbol{\Delta}),\quad\sum_{k=1}^{K}\boldsymbol{\Delta}_{k}(\cdot)=\mathbf{\bm{1}}_{d}. (16)

In the overlapping setting, there is no restriction on a segmentation operator, thus the predicted segmentation for different classes may overlap. On the other hand, the overlapping is ruled out in non-overlapping formulation due the additional sum-to-one constraint. Lemma 6 presents the Bayes rule for overlapping segmentation, yielding that mDice-segmentation is reduced to class-specific Dice-segmentation subproblems if overlapping is allowed.

Lemma 6 (The Bayes rule for overlapping mDice-segmentation)

An overlapping (allowing) segmentation operator 𝚫\boldsymbol{\Delta}^{*} is a global maximizer of mDiceγ(𝚫){\rm mDice}_{\gamma}(\boldsymbol{\Delta}) if and only if 𝚫k\boldsymbol{\Delta}^{*}_{k} is the Bayes rule (global maximizer) of Diceγ,k(𝚫k){{\rm Dice}}_{\gamma,k}(\boldsymbol{\Delta}_{k}) on (𝐗,𝐘k)(\mathbf{\bm{X}},\mathbf{\bm{Y}}_{\cdot k}) for all k{1kK:αk>0}k\in\{1\leq k\leq K:\alpha_{k}>0\}.

Therefore, it suffices to consider Dice-segmentation of each class separately in overlapping mDice-segmentation, and the proposed RankDice is readily extended to mRankDice; see Section 3.4.

Next, we investigate the Bayes rule for non-overlapping mDice-segmentation. To proceed, we denote 𝚫\boldsymbol{\Delta}^{*} as the Bayes rule (global maximizer) of non-overlapping mDice-segmentation in (16), and 𝝉()\boldsymbol{\tau}^{*}(\cdot) as the volume function of the Bayes segmentation rule: 𝝉(𝐱)=(τ1(𝐱),,τK(𝐱))=(𝚫1(𝐱)1,,𝚫K(𝐱)1).\boldsymbol{\tau}^{*}(\mathbf{\bm{x}})=(\tau^{*}_{1}(\mathbf{\bm{x}}),\dots,\tau^{*}_{K}(\mathbf{\bm{x}}))^{\intercal}=(\|\boldsymbol{\Delta}^{*}_{1}(\mathbf{\bm{x}})\|_{1},\cdots,\|\boldsymbol{\Delta}^{*}_{K}(\mathbf{\bm{x}})\|_{1})^{\intercal}.

Lemma 7

Suppose 𝛕()\boldsymbol{\tau}^{*}(\cdot) is pre-known, then solving the Bayes rule for non-overlapping mDice-segmentation in (16) is equivalent to an assignment problem.

As indicated in Lemma 7, even when the optimal volume function is pre-given, solving the Bayes rule for non-overlapping segmentation is nontrivial. For example, the most successful assignment algorithms, including the Hungarian method (Kuhn, 1955; Edmonds and Karp, 1972; Tomizawa, 1971) and its variants Jonker-Volgenant algorithm (Crouse, 2016), generally achieves an O(d3)O(d^{3}) running time complexity in our content, which is time-consuming for high-dimensional segmentation. In sharp contrast, for the overlapping case, when τ(𝐱)\tau^{*}(\mathbf{\bm{x}}) is given, the Bayes rule is simply ranking all the conditional probabilities with O(dlog(d))O(d\log(d)). Moreover, when 𝝉(𝐱)\boldsymbol{\tau}^{*}(\mathbf{\bm{x}}) is unknown, the optimization for non-overlapping segmentation becomes a nonlinear integer optimization which is NP-hard in general (Murty and Kabadi, 1985; D’Ambrosio et al., 2020). Therefore, a fast O(dlog(d))O(d\log(d)) greedy approximation algorithm is more feasible in practical implementation. We leave pursuing this topic as future work.

Next, we summarize the proposed mRankDice framework under three scenarios.

3.4 mRankDice

In this section, we present the proposed mRankDice framework for mDice-segmentation. Before we proceed, we highlight the different roles of multiclass/multilabel modeling and the overlapping/non-overlapping constraint. Multiclass/multilabel modeling determines the probabilistic models (say the softmax or the sigmoid activation in a neural network) and an operating loss in probability estimation (say the cross-entropy or the binary cross-entropy). Meanwhile, the overlapping/non-overlapping constraint affects the segmentation prediction after the probabilities are estimated.

Therefore, we consider following segmentation three cases: “multilabel + overlapping”, “multiclass + overlapping”, and “multilabel/multiclass + non-overlapping”.

mRankDice (multilabel + overlapping segmentation) is a straightforward extension of RankDice in Dice-segmentation (inspired by Lemma 6). Given a training dataset (𝐱i,𝐲i,1:d,1,,𝐲i,1:d,K)i=1n(\mathbf{\bm{x}}_{i},\mathbf{\bm{y}}_{i,1:d,1},\cdots,\mathbf{\bm{y}}_{i,1:d,K})_{i=1}^{n}, with the same manner, the conditional probability function is estimated under a multilabel logistic regression (the binary cross-entropy loss):

𝐐^(𝐱)=argmin𝐐𝒬i=1nj=1dk=1K(yijklog(qjk(𝐱i))+(1yijk)log(1qjk(𝐱i)))+λ𝐐2,\widehat{\mathbf{\bm{Q}}}(\mathbf{\bm{x}})=\mathop{\rm argmin}_{\mathbf{\bm{Q}}\in\mathcal{Q}}\sum_{i=1}^{n}\sum_{j=1}^{d}\sum_{k=1}^{K}\Big{(}y_{ijk}\log\big{(}q_{jk}(\mathbf{\bm{x}}_{i})\big{)}+(1-y_{ijk})\log\big{(}1-q_{jk}(\mathbf{\bm{x}}_{i})\big{)}\Big{)}+\lambda\|\mathbf{\bm{Q}}\|^{2}, (17)

where 𝐐=(qjk):d[0,1]d×K\mathbf{\bm{Q}}=(q_{jk}):\mathbb{R}^{d}\to[0,1]^{d\times K} is a matrix function, and qjk(𝐱)q_{jk}(\mathbf{\bm{x}}) is a candidate estimator of pjk(𝐱)p_{jk}(\mathbf{\bm{x}}). Then, given a new instance 𝐱\mathbf{\bm{x}}, the class-specific segmentation 𝚫^k(𝐱)\widehat{\boldsymbol{\Delta}}_{k}(\mathbf{\bm{x}}) is generated based on Steps 2-3 in Section 2.2 with the estimated conditional probabilities 𝐐^k(𝐱)\widehat{\mathbf{\bm{Q}}}_{k}(\mathbf{\bm{x}}) (the kk-th column of 𝐐^(𝐱)\widehat{\mathbf{\bm{Q}}}(\mathbf{\bm{x}})). We summarize the foregoing computational scheme in Algorithm 2.

mRankDice (multiclass + overlapping segmentation) yields a different probability estimation procedure, where the conditional probability function is estimated under a multiclass logistic regression (the multiclass cross-entropy loss):

𝐐^(𝐱)=argmin𝐐𝒬i=1nj=1dk=1Kyijklog(qjk(𝐱i))+λ𝐐2,s.t.k=1Kqjk()=1; for j=1,d,\widehat{\mathbf{\bm{Q}}}(\mathbf{\bm{x}})=\mathop{\rm argmin}_{\mathbf{\bm{Q}}\in\mathcal{Q}}\sum_{i=1}^{n}\sum_{j=1}^{d}\sum_{k=1}^{K}y_{ijk}\log\big{(}q_{jk}(\mathbf{\bm{x}}_{i})\big{)}+\lambda\|\mathbf{\bm{Q}}\|^{2},\ \ \text{s.t.}\ \sum_{k=1}^{K}q_{jk}(\cdot)=1;\text{ for }j=1,\cdots d, (18)

where 𝐐=(qjk):d[0,1]d×K\mathbf{\bm{Q}}=(q_{jk}):\mathbb{R}^{d}\to[0,1]^{d\times K} is a matrix function, and 𝐪j(𝐱)\mathbf{\bm{q}}_{j}(\mathbf{\bm{x}}) is a candidate estimator of 𝐩j(𝐱)\mathbf{\bm{p}}_{j}(\mathbf{\bm{x}}). Although the probability estimation (18) differs from (17), the downstream ranking and volume estimation remain the same according to Lemma 6. We also summarize the foregoing computational scheme in Algorithm 2.

Input : Training set: (𝐱i,𝐲i,1:d,1,,𝐲i,1:d,K)i=1n(\mathbf{\bm{x}}_{i},\mathbf{\bm{y}}_{i,1:d,1},\cdots,\mathbf{\bm{y}}_{i,1:d,K})_{i=1}^{n}; A new testing instance: 𝐱\mathbf{\bm{x}}
Output : The predicted segmentation for the testing instance 𝚫^(𝐱)\widehat{\boldsymbol{\Delta}}(\mathbf{\bm{x}})
1 if multilabel outcome then
2       Multilabel Prob Est. Estimate the prob function 𝐐^\widehat{\mathbf{\bm{Q}}} via (17);
3      
4 end if
5if multiclass outcome then
6       Multiclass Prob Est. Estimate the prob function 𝐐^\widehat{\mathbf{\bm{Q}}} via (18);
7      
8 end if
9
10for k=1,,Kk=1,\cdots,K do
11       Class-specific RankDice. Obtain 𝚫^k(𝐱)\widehat{\boldsymbol{\Delta}}_{k}(\mathbf{\bm{x}}) from Lines 2-22 in Algorithm 1 based on the estimated prob 𝐐^k(𝐱)\widehat{\mathbf{\bm{Q}}}_{k}(\mathbf{\bm{x}}).
12 end for
13return The predicted segmentation 𝚫^(𝐱)=(𝚫^1(𝐱),,𝚫^K(𝐱))\widehat{\boldsymbol{\Delta}}(\mathbf{\bm{x}})=(\widehat{\boldsymbol{\Delta}}_{1}(\mathbf{\bm{x}}),\cdots,\widehat{\boldsymbol{\Delta}}_{K}(\mathbf{\bm{x}}))
Algorithm 2 mRankDice for overlapping mDice-segmentation.

mRankDice (multiclass/multilabel + non-overlapping segmentation) is a quite difficult scenario for developing mRankDice from RankDice in Dice-segmentation. As indicated in Lemma 7, searching for an optimal non-overlapping mDice-segmentation operator is NP-hard in general. We leave pursuing this topic as future work.

4 Theory

In this section, we establish a theoretical foundation of segmentation, including the concept of Dice-calibration, the excess risk of the Dice metric, and the rate of convergence with respect to the excess risk of the proposed RankDice framework. For illustration, we focus on Dice-segmentation, and the results can be extended to mDice-segmentation, and RankIoU in terms of the IoU metric.

4.1 Dice-calibrated segmentation

In Theorem 1, the Bayes rule of Dice-segmentation is obtained. To carry this agenda further, we propose concept of “Dice-calibrated”, following the same philosophy of Fisher consistency or classification-calibration in Lin (2004); Zhang (2004); Bartlett et al. (2006). Intuitively, Dice-calibration is the weakest possible condition on a consistent segmentation method with respect to the Dice metric, this is, at population level, a method ultimately searches for the Bayes rule that achieves the optimal Dice metric. Figure 3 indicates the overview and logical relations among the theoretic results in this section.

Refer to caption
Figure 3: Flowchart of the theory for RankDice in Section 4.1, indicating the logical relations among the developed theorems.

To proceed, let 𝒫\mathcal{P} be the class of all probability measures 𝐗,𝐘\mathbb{P}_{\mathbf{\bm{X}},\mathbf{\bm{Y}}} on (Borel) measurable subsets of d×{0,1}d\mathbb{R}^{d}\times\{0,1\}^{d} such that (𝐗,𝐘)𝐗,𝐘(\mathbf{\bm{X}},\mathbf{\bm{Y}})\sim\mathbb{P}_{\mathbf{\bm{X}},\mathbf{\bm{Y}}}, (𝐗,𝐘)d×{0,1}d(\mathbf{\bm{X}},\mathbf{\bm{Y}})\in\mathbb{R}^{d}\times\{0,1\}^{d}, and YiYj|𝐗Y_{i}\perp Y_{j}|\mathbf{\bm{X}} for iji\neq j. Denote 𝒟\mathcal{D} as the class of all (Borel) measurable segmentation operators 𝜹:𝐱d𝜹(𝐱){0,1}d\boldsymbol{\delta}:\mathbf{\bm{x}}\in\mathbb{R}^{d}\to\boldsymbol{\delta}(\mathbf{\bm{x}})\in\{0,1\}^{d}. The definition of Dice-calibrated segmentation is given as follows.

Definition 8 (Dice-calibrated segmentation)

Given γ0\gamma\geq 0, a (population) segmentation method γ:𝒫𝒟\mathcal{M}_{\gamma}:\mathcal{P}\to\mathcal{D} is Dice-calibrated if, for any 𝐗,𝐘𝒫\mathbb{P}_{\mathbf{\bm{X}},\mathbf{\bm{Y}}}\in\mathcal{P},

Diceγ(γ(𝐗,𝐘))=Diceγ(𝜹),{\rm Dice}_{\gamma}\big{(}\mathcal{M}_{\gamma}(\mathbb{P}_{\mathbf{\bm{X}},\mathbf{\bm{Y}}})\big{)}={\rm Dice}_{\gamma}(\boldsymbol{\delta}^{*}),

where 𝛅\boldsymbol{\delta}^{*} is the Bayes rule for Dice-segmentation defined in Theorem 1.

Now, we show that most existing loss functions, under the framework (2), are not Dice-calibrated.

Lemma 9

Given a loss function l(,)l(\cdot,\cdot), define γ(𝐗,𝐘)\mathcal{M}_{\gamma}(\mathbb{P}_{\mathbf{\bm{X}},\mathbf{\bm{Y}}}) under the framework (2), that is,

γ(𝐗,𝐘)(𝐱):=𝟏(𝐪~l(𝐱)0.5),𝐪~l=argmin𝐪𝔼(l(𝐘,𝐪(𝐗))).\displaystyle\mathcal{M}_{\gamma}(\mathbb{P}_{\mathbf{\bm{X}},\mathbf{\bm{Y}}})(\mathbf{\bm{x}}):=\mathbf{\bm{1}}(\widetilde{\mathbf{\bm{q}}}_{l}(\mathbf{\bm{x}})\geq 0.5),\quad\widetilde{\mathbf{\bm{q}}}_{l}=\mathop{\rm argmin}_{\mathbf{\bm{q}}}\ \mathbb{E}\big{(}l(\mathbf{\bm{Y}},\mathbf{\bm{q}}(\mathbf{\bm{X}}))\big{)}.

Then, γ(𝐗,𝐘)\mathcal{M}_{\gamma}(\mathbb{P}_{\mathbf{\bm{X}},\mathbf{\bm{Y}}}) is not Dice-calibrated for γ=0\gamma=0 when l(,)l(\cdot,\cdot) is any classification-calibrated loss, including the cross-entropy loss and the focal loss.

Lemma 9 indicates that the existing framework (2) with most losses ultimately yields a suboptimal solution to Dice-segmentation, even if a “classification-calibrated” loss, such as the cross-entropy loss or the focal loss (Charoenphakdee et al., 2021), is used. Meanwhile, as indicated in Bertels et al. (2019), the optimization with the soft-Dice loss can introduce a volumetric bias for tasks with high inherent uncertainty. In sharp contrast, the proposed RankDice method is Dice-calibrated (Lemma 10) and its asymptotic convergence rate in terms of the Dice metric is provided in Theorem 11.

To proceed, we give the definition of RankDice at population level in Appendix B.1, which replaces the average in (5) by the population expectation. Moreover, the cross-entropy loss in (5) can be extended to an arbitrary strictly proper loss (Gneiting and Raftery, 2007). The most common strictly proper losses are the cross-entropy loss and the squared error loss.

Lemma 10 (Dice-calibrated)

The proposed RankDice framework with a strictly proper loss is Dice-calibrated.

Next, we present an excess risk bound in terms of the Dice metric, that is, Dice(𝜹)Dice(𝜹^){\rm Dice}(\boldsymbol{\delta}^{*})-{\rm Dice}(\widehat{\boldsymbol{\delta}}).

Theorem 11 (Excess risk bounds)

Given γ0\gamma\geq 0, let 𝐪^()\widehat{\mathbf{\bm{q}}}(\cdot) be an estimated probability of 𝐩()\mathbf{\bm{p}}(\cdot), and 𝛅^()\widehat{\boldsymbol{\delta}}(\cdot) be the RankDice segmentation function defined in (7) based on 𝐪^()\widehat{\mathbf{\bm{q}}}(\cdot), then

Diceγ(𝜹)Diceγ(𝜹^)C1𝔼𝐗𝐪^(𝐗)𝐩(𝐗)1,{\rm Dice}_{\gamma}(\boldsymbol{\delta}^{*})-{\rm Dice}_{\gamma}(\widehat{\boldsymbol{\delta}})\leq C_{1}\mathbb{E}_{\mathbf{\bm{X}}}{\|\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{X}})-\mathbf{\bm{p}}(\mathbf{\bm{X}})\|_{1}}, (19)

where C1>0C_{1}>0 is a universal constant depending only on γ\gamma.

As indicated in Theorem 11, the excess risk of the Dice metric for the proposed RankDice framework is upper bounded by the total variation (TV) distance between the estimated probability 𝐪^\widehat{\mathbf{\bm{q}}} and the true probability 𝐩\mathbf{\bm{p}}. Note that the Kullback-Leibler divergence (the excess risk for the cross-entropy) dominates the TV distance. It follows that if the KL divergence between pjp_{j} and q^j\widehat{q}_{j} goes to 0, then 𝐪^\widehat{\mathbf{\bm{q}}} converges to 𝐩\mathbf{\bm{p}} in the TV sense, and so does Diceγ(𝜹^){\rm Dice}_{\gamma}(\widehat{\boldsymbol{\delta}}) to Diceγ(𝜹){\rm Dice}_{\gamma}(\boldsymbol{\delta}^{*}).

Taken together, we present the rate of convergence for the empirical estimator obtained from the proposed RankDice framework (Steps 1-3) in Section 2.2.

Corollary 12 (Convergence rate)

Let 𝐪^()\widehat{\mathbf{\bm{q}}}(\cdot) and 𝛅^()\widehat{\boldsymbol{\delta}}(\cdot) be obtained by the proposed RankDice framework (Steps 1-3) in Section 2.2, and

CE(𝐪^):=𝔼(lCE(𝐘,𝐪^(𝐗)))𝔼(lCE(𝐘,𝐩(𝐗)))=OP(ϵn),\mathcal{E}_{\text{CE}}(\widehat{\mathbf{\bm{q}}}):=\mathbb{E}\Big{(}l_{\text{CE}}\big{(}\mathbf{\bm{Y}},\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{X}})\big{)}\Big{)}-\mathbb{E}\Big{(}l_{\text{CE}}\big{(}\mathbf{\bm{Y}},\mathbf{\bm{p}}(\mathbf{\bm{X}})\big{)}\Big{)}=O_{P}(\epsilon_{n}),

where lCE(,)l_{\text{CE}}(\cdot,\cdot) is defined in (3). Then,

Diceγ(𝜹)Diceγ(𝜹^)=OP(dϵn1/2).{\rm Dice}_{\gamma}(\boldsymbol{\delta}^{*})-{\rm Dice}_{\gamma}(\widehat{\boldsymbol{\delta}})=O_{P}(\sqrt{d}\epsilon^{1/2}_{n}). (20)

Note that CE\mathcal{E}_{\text{CE}} is the excess risk of the cross-entropy loss or the negative conditional log-likelihood in (5), and its asymptotics as well as a rate of convergence can be established based on statistical learning theory of empirical risk minimization (Pollard, 1984; Shen, 1997; Bartlett et al., 2005; Giné and Koltchinskii, 2006; Cucker and Zhou, 2007), which depends on the sample size and the complexity of the probability class. Then, the rate of convergence of the excess risk in terms of the Dice metric is obtained via (20). Note that both (19) and (20) are derived for a fixed dimension, and the upper bounds can be extended and improved when the dimension of segmentation grows with the sample size.

Finally, we briefly discuss the connections of the developed theory (i.e., Lemma 10, Theorem 11 and Corollary 12) with the existing results. For example, Popordanoska et al. (2021) derived an upper bound for the volume bias 𝔼𝐗(𝐪^(𝐗)𝐩(𝐗))1\|\mathbb{E}_{\mathbf{\bm{X}}}(\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{X}})-\mathbf{\bm{p}}(\mathbf{\bm{X}}))\|_{1} in terms of the TV distance. It is worth noting that the volume bias focuses on conditional probability estimation and a small volume bias may not necessarily yield a consistent segmentation rule in terms of the Dice metric. In contrast, our result on the excess risk Dice(𝜹)Dice(𝜹^){\rm Dice}(\boldsymbol{\delta}^{*})-{\rm Dice}(\widehat{\boldsymbol{\delta}}) characterizes the performance of segmentation rule 𝜹^\widehat{\boldsymbol{\delta}}. Besides, Bao and Sugiyama (2020) proved the consistency of their method under a linear fractional approximation of Dice metric (see Appendix A), which seems not directly comparable to ours.

4.2 Relation between Dice and IoU metrics

In this section, we consider the relation and difference between Dice and IoU metrics, and present the Bayes rule for IoU-segmentation.

Lemma 13

A segmentation rule 𝛅\boldsymbol{\delta}^{*} is a global maximizer of IoUγ(𝛅){{\rm IoU}}_{\gamma}(\boldsymbol{\delta}) if and only if it satisfies that

δj(𝐱)={1if pj(𝐱) ranks top τ(𝐱),0otherwise.\delta_{j}^{*}(\mathbf{\bm{x}})=\begin{cases}1&\text{if }p_{j}(\mathbf{\bm{x}})\text{ ranks top }\tau^{*}(\mathbf{\bm{x}}),\\ 0&\text{otherwise}.\end{cases}

The optimal volume τ(𝐱)\tau^{*}(\mathbf{\bm{x}}) is given as

τ(𝐱)=argmaxτ{0,1,,d}(jJτ(𝐱)pj(𝐱)+γ)l=0dτ1τ+l+γ(ΓJτ(𝐱)(𝐱)=l),\tau^{*}(\mathbf{\bm{x}})=\mathop{\rm argmax}_{\tau\in\{0,1,\cdots,d\}}\ \Big{(}\sum_{j\in J_{\tau}(\mathbf{\bm{x}})}p_{j}(\mathbf{\bm{x}})+\gamma\Big{)}\sum_{l=0}^{d-\tau}\frac{1}{\tau+l+\gamma}\mathbb{P}\big{(}\Gamma_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})=l\big{)}, (21)

where Jτ(𝐱)={j:j=1d𝕀(pj(𝐱)pj(𝐱))τ}J_{\tau}(\mathbf{\bm{x}})=\big{\{}j:\sum_{j^{\prime}=1}^{d}\mathbb{I}\big{(}p_{j^{\prime}}(\mathbf{\bm{x}})\geq p_{j}(\mathbf{\bm{x}})\big{)}\leq\tau\big{\}} is the index set of the τ\tau-largest conditional probabilities with J0(𝐱)=J_{0}(\mathbf{\bm{x}})=\emptyset, and ΓJτ(𝐱)(𝐱)=jJτ(𝐱)Bj(𝐱){\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})=\sum_{j^{\prime}\notin J_{\tau}(\mathbf{\bm{x}})}{B}_{j^{\prime}}(\mathbf{\bm{x}}) is a Poisson-binomial random variable, and Bj(𝐱){B}_{j}(\mathbf{\bm{x}}) is a Bernoulli random variable with the success probability pj(𝐱)p_{j}(\mathbf{\bm{x}}).

In view of Lemma 13, IoU-segmentation shares a substantial similarity with Dice-segmentation in terms of the Bayes rule. On this ground, a consistent RankIoU framework is also developed based on a plug-in rule by replacing 𝐩(𝐱)\mathbf{\bm{p}}(\mathbf{\bm{x}}) as 𝐪^(𝐱)\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}}). Specifically, RankIoU comprises three steps, where Steps 1-2 are the same as in RankDice; see Section 2.2.

Step 3 (IoU volume estimation): From (4), we estimate the volume τ^(𝐱)\widehat{\tau}(\mathbf{\bm{x}}) by replacing the true conditional probability 𝐩(𝐱)\mathbf{\bm{p}}(\mathbf{\bm{x}}) with the estimated one 𝐪^(𝐱)\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}}):

τ^(𝐱)=argmaxτ{0,1,,d}(jJτ(𝐱)q^j(𝐱)+γ)l=0dτ1τ+l+γ(Γ^Jτ(𝐱)(𝐱)=l),\widehat{\tau}(\mathbf{\bm{x}})=\mathop{\rm argmax}_{\tau\in\{0,1,\cdots,d\}}\ \Big{(}\sum_{j\in J_{\tau}(\mathbf{\bm{x}})}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma\Big{)}\sum_{l=0}^{d-\tau}\frac{1}{\tau+l+\gamma}\mathbb{P}\big{(}\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})=l\big{)},

where Γ^Jτ(𝐱)(𝐱)=jJτ(𝐱)B^j(𝐱)\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})=\sum_{j\notin J_{\tau}(\mathbf{\bm{x}})}\widehat{B}_{j}(\mathbf{\bm{x}}) is a Poisson-binomial random variable, and B^j(𝐱)\widehat{B}_{j}(\mathbf{\bm{x}}) are independent Bernoulli random variables with success probabilities q^j(𝐱)\widehat{q}_{j}(\mathbf{\bm{x}}); for j=1,,dj=1,\cdots,d.

Similar to RankDice, the predicted IoU-segmentation 𝜹^(𝐱)=(δ^1(𝐱),,δ^d(𝐱))\widehat{\boldsymbol{\delta}}(\mathbf{\bm{x}})=(\widehat{\delta}_{1}(\mathbf{\bm{x}}),\cdots,\widehat{\delta}_{d}(\mathbf{\bm{x}}))^{\intercal} is produced by taking the top-τ^(𝐱)\widehat{\tau}(\mathbf{\bm{x}}) conditional probabilities:

δ^j(𝐱)=1, if j{j1,,jτ^(𝐱)};δ^j(𝐱)=0, otherwise.\widehat{\delta}_{j}(\mathbf{\bm{x}})=1,\text{ if }j\in\{j_{1},\cdots,j_{\widehat{\tau}(\mathbf{\bm{x}})}\};\quad\widehat{\delta}_{j}(\mathbf{\bm{x}})=0,\text{ otherwise.}

For multiclass/multilabel segmentation, the conditional probability estimation (17) and (18) are carried over into mRankIoU and the subsequent ranking and volume estimation remain the same as Step 2 and Step 3 in binary segmentation.

Computationally, RankIoU involves the evaluation of (Γ^Jτ(𝐱)(𝐱)=l)\mathbb{P}\big{(}\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})=l\big{)} in Step 3. The FFT algorithm and the truncated refined normal approximation (T-RNA) are applicable after minor modifications; however, the blind approximation (BA) may not be appropriate due to the discrepancy of Γ^(𝐱)\widehat{\Gamma}(\mathbf{\bm{x}}) and Γ^Jτ(𝐱)(𝐱)\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}}), especially when the size of Jτ(𝐱)J_{\tau}(\mathbf{\bm{x}}) is large; see Section 2.3. Thus, the computation scheme of RankIoU might be relatively expensive in high-dimensional segmentation. Here, we present a parallel result of Lemma 3 to narrow down the searching range in Step 3 of RankIoU.

Lemma 14

If

s=1τq^js(𝐱)+γq^jτ+1(𝐱)1q^jτ+1(𝐱)max(d+γ,((dτ)q^jτ+1(𝐱)+τ+γ)2τ+γ),\sum_{s=1}^{\tau}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})+\gamma\geq\frac{\widehat{q}_{j_{\tau+1}}(\mathbf{\bm{x}})}{1-\widehat{q}_{j_{\tau+1}}(\mathbf{\bm{x}})}\max\Big{(}d+\gamma,\frac{((d-\tau)\widehat{q}_{j_{\tau+1}}(\mathbf{\bm{x}})+\tau+\gamma)^{2}}{\tau+\gamma}\Big{)},

then ϖτ(𝐱)ϖτ(𝐱)\varpi_{\tau}(\mathbf{\bm{x}})\geq\varpi_{\tau^{\prime}}(\mathbf{\bm{x}}) for all τ>τ\tau^{\prime}>\tau, where

ϖτ(𝐱)=(jJτ(𝐱)q^j(𝐱)+γ)l=0dτ1τ+l+γ(Γ^Jτ(𝐱)(𝐱)=l).\varpi_{\tau}(\mathbf{\bm{x}})=\Big{(}\sum_{j\in J_{\tau}(\mathbf{\bm{x}})}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma\Big{)}\sum_{l=0}^{d-\tau}\frac{1}{\tau+l+\gamma}\mathbb{P}\big{(}\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})=l\big{)}.

Theoretically, the concept of “IoU-calibrated” can be established (by replacing Dice as IoU in Definition 8) and the excess risk bounds can be derived in parallel to Dice-segmentation.

Theorem 15

Given γ0\gamma\geq 0, let 𝐪^()\widehat{\mathbf{\bm{q}}}(\cdot) be an estimated probability of 𝐩()\mathbf{\bm{p}}(\cdot), and 𝛅^()\widehat{\boldsymbol{\delta}}(\cdot) be the RankIoU segmentation function based on 𝐪^()\widehat{\mathbf{\bm{q}}}(\cdot), then

IoUγ(𝜹)IoUγ(𝜹^)C2𝔼𝐗𝐪^(𝐗)𝐩(𝐗)1,{\rm IoU}_{\gamma}(\boldsymbol{\delta}^{*})-{\rm IoU}_{\gamma}(\widehat{\boldsymbol{\delta}})\leq C_{2}\mathbb{E}_{\mathbf{\bm{X}}}{\|\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{X}})-\mathbf{\bm{p}}(\mathbf{\bm{X}})\|_{1}},

where C2>0C_{2}>0 is a universal constant depending only on γ\gamma. Consequently, if CE(𝐪^)=OP(ϵn)\mathcal{E}_{\text{CE}}(\widehat{\mathbf{\bm{q}}})=O_{P}(\epsilon_{n}), then

IoUγ(𝜹)IoUγ(𝜹^)=OP(dϵn1/2).{\rm IoU}_{\gamma}(\boldsymbol{\delta}^{*})-{\rm IoU}_{\gamma}(\widehat{\boldsymbol{\delta}})=O_{P}(\sqrt{d}\epsilon^{1/2}_{n}).

5 Numerical experiments

This section describes a set of simulations and real datasets that demonstrate the segmentation performance of the proposed RankDice and mRankDice frameworks compared with the existing argmax- and thresholding-based frameworks using various loss functions and network architectures. For illustration, the segmentation performances for all numerical experiments are evaluated by empirical Dice/IoU metrics with γ=0\gamma=0, see Appendix A. For the mDice/mIoU metric, the class-specific weight is defined as in (15). All experiments are conducted using PyTorch and CUDA on an NVIDIA GeForce RTX 3080 GPU. All Python codes are available for download at our GitHub repository (https://github.com/statmlben/rankseg).

5.1 Simulation

In this section, we mainly compare the proposed RankDice framework with the thresholding-based framework (2) in various simulated examples. Note that for Dice-segmentation with binary outcomes, threshold- and argmax-based frameworks yield the same solution. Both frameworks require an estimation of conditional probability function in the first stage. Therefore, in order to convincingly demonstrate the difference between two frameworks, in our simulation, we assume the true conditional probabilities pj(𝐱)=(Yj|𝐱);j=1,,dp_{j}(\mathbf{\bm{x}})=\mathbb{P}(Y_{j}|\mathbf{\bm{x}});j=1,\cdots,d are perfectly estimated, and report the Dice metric of the downstream segmentation produced by two different frameworks.

Example 1. To mimic the spatial smoothness in practical segmentation problem, especially for image segmentation, the simulated dataset based on matrix response (d=W×Hd=W\times H) is generated as follows. First, the true probability matrix 𝐏=(pwh)W×H\mathbf{\bm{P}}=(p_{wh})_{W\times H} are generated by two patterns:

  • Step decay: pwhU(0.5,1)p_{wh}\sim U(0.5,1), if wρWw\leq\lfloor\rho W\rfloor and hρHh\leq\lfloor\rho H\rfloor; pwhN[0,1](β,0.1)p_{wh}\sim N_{[0,1]}(\beta,0.1), otherwise.

  • Exponential decay: pwh=exp(β(w+h))p_{wh}=\exp\big{(}-\beta(w+h)\big{)}, as visualized in the upper panel of Figure 6.

  • Linear decay: pwh=1β(w+h)/(W+H)p_{wh}=1-\beta(w+h)/(W+H), as visualized in the lower panel of Figure 6.

Here β\beta is a decay parameter, U(0,1)U(0,1) is the uniform distribution, and N[0,1](β,0.1)N_{[0,1]}(\beta,0.1) is the truncated normal distribution with mean β\beta and standard deviation 0.1. In our simulation, we consider β=0.1,0.3,0.5\beta=0.1,0.3,0.5 with ρ=0.1\rho=0.1 for step decay, β=1.01,1.05,1.10\beta=1.01,1.05,1.10 for exponential decay, and β=1,2,4\beta=1,2,4 for linear decay. For each case, four different dimensions are considered: W=H=28,64,128,256W=H=28,64,128,256, and therefore dd increases from 784 to 65,536. Then, the proposed RankDice framework and the thresholding-based framework (at 0.5) are conducted on the true probability matrix 𝐏\mathbf{\bm{P}}. Both decay scenarios are replicated 100 times, and the averaged Dice metrics and its standard errors are summarized in Table 9.

Example 2. As indicated in Theorem 1 and Remark 2, the optimal segmentation volume (or the optimal threshold) varies significantly across different inputs. This example aims to illustrate the suboptimality of (tuning a threshold of) a fixed thresholding based on step decay in Example 1. To this end, the conditional probability matrices (𝐏i)i=1,,n(\mathbf{\bm{P}}_{i})_{i=1,\cdots,n} of inputs are generated as follows. First, ρiU(0,1)\rho_{i}\sim U(0,1) to mimic the different segmentation scales/patterns over images in real applications. Next, 𝐏i\mathbf{\bm{P}}_{i} is generated by step decay based on β=0.1\beta=0.1 and ρ=ρi\rho=\rho_{i}. Then, the proposed RankDice framework and the fixed thresholding-based framework (at 0.1. \cdots, 0.9) are applied as the same manner in Example 1. All methods are applied with n=2000n=2000 and W=H=64W=H=64, and the averaged Dice metrics and its standard errors are summarized in Table 10. In this example, the heterogeneous conditional probabilities yield different optimal segmentation volumes (or thresholds) for images, thus (tuning a threshold on) a fixed thresholding leads to a suboptimal solution. To better illustrate the adaptiveness over optimal thresholds, we also present the optimal thresholds for different images with various generating parameters (β,ρ)(\beta,\rho) on segmentation patterns of Example 2 in Figure 7, and similar results are demonstrated in Figure 4 for Pascal VOC 2012 dataset.

The major conclusions on the simulated examples are listed as follows.

  • It is evident that RankDice significantly outperforms the thresholding-based (at 0.5) framework in both decay scenarios with various dimensions (Table 9). The substantial improvement is consistent with the findings of Lemma 1, which indicates that segmentation and classification are entirely distinct problems.

  • Interestingly, the amount of improvement is gradually increased when the decay of probabilities becomes progressively faster (for exponential decay and linear decay). This suggests that the proposed RankDice might be even more advantageous in well-separated segmentation.

  • As suggested in Table 10, the proposed RankDice generally outperforms a fixed thresholding framework (with any threshold). This because that the optimal threshold can vary greatly across inputs, as indicated in Figure 7. Moreover, tuning the threshold may improve the performance of the thresholding-based framework, yet it still leads to a suboptimal solution compared with the proposed RankDice.

Refer to caption
Figure 4: The heatmap for averaged (over all validation samples) optimal thresholding probabilities (i.e. the probability cutpoint q^jτ^\widehat{q}_{j_{\widehat{\tau}}}) provided by the proposed RankDice, see Remark 2 for more details. Here, the xx-axis indicates the class, and the results are provided by a PSPNet trained with the cross-entropy loss.

5.2 Real datasets

This section examines the performance of the proposed RankDice framework in the PASCAL VOC 2012 (Everingham et al., 2012), the fine-annotated CityScapes semantic segmentation benchmark (Cordts et al., 2016), and Kvasir SEG dataset polyp segmentation dataset (Ajtai et al., 1983). Three different neural network architectures are considered: DeepLab-V3+ (Chen et al., 2018) with resnet101 as the backbone, PSPNet with resnet50 as the backbone, and FCN8 with resnet101 as the backbone. We report both mDice and mIoU metrics of the segmentation produced by Threshold, Argmax and RankDice on the same estimated network/probability. Note that only overlapping segmentation is considered.

Fine-annotated Cityscapes dataset contains 5,000 high quality pixel-level annotated images. For all methods, we employ SGD on the learning rate (lr) schedule lr_schedule=‘poly’, and the initial learning rate initial_lr=0.01, weight_decay=100, momentum=0.9, crop size 512x512, batch size 6, and 300 epochs. The performance on validation set is measured in terms of the mDice and mIoU averaged across 19 object classes (Table 2).

Pascal VOC 2012 dataset contains 20 foreground object classes and one background class. The dataset contains 1,464 training and 1,449 validation pixel-level annotated images. We augment the dataset by using the additional annotations provided by Hariharan et al. (2011). For all methods, we employ SGD on lr_schedule=‘poly’, and the initial learning rate initial_lr=0.01, weight_decay=100, momentum=0.9, crop size 480x480, batch size 8, and an early stop with patient 10 based on validation loss. The performance on validation set is measured in terms of the mDice and mIoU averaged across the 20 object classes (Table 3). In this dataset, we also present a heatmap (Figure 4) for averaged minimal estimated probabilities for segmented features (i.e. q^jτ^\widehat{q}_{j_{\widehat{\tau}}}) by the proposed RankDice, to highlight its difference to the thresholding-based framework.

Kvasir SEG dataset contains 1000 polyp images and their ground truth segmentation (a single class) from the Kvasir dataset. The scale of the images varies from 332x487 to 1920x1072 pixels. For all methods, we employ SGD on lr_schedule=‘poly’, and the initial learning rate initial_lr=0.01, weight_decay=100, momentum=0.9, crop size 320x320, batch size 8, and 140 epochs. The performance on testing set is measured in terms of the Dice and IoU (Table 4).

Multiclass/multilabel loss. In both datasets, we use six loss functions (including multiclass and multilabel losses) in the implementation, including the cross-entropy (CE), the focal loss (Focal), the binary cross-entropy (BCE), the soft-Dice loss (Soft-Dice), the binary soft-Dice loss (B-Soft-Dice), and the LovaszSoftmax loss (LovaszSoftmax). For multiclass losses, including CE, Focal, Soft-Dice, and LovaszSoftmax, we use the softmax function as the output layer activation function of a neural network. For multilabel losses, including BCE and B-Soft-Dice, we use the sigmoid function as the output layer activation function of a neural network.

Dice-segmentation based on a single class. For the first two datasets, when we focus on a single object class, it reduces to a Dice-segmentation with binary outcomes. To examine the performance for the proposed RankDice in binary segmentation, we also report the Dice/IoU metric for each label separately. In principle, we need to train a model only on the binary label for an object class, then produce the segmentation prediction by thresholding (or argmax) and RankDice. However, based on our empirical study, a model trained from full labels is significantly better than the one trained from a single binary label. We thus train a model based the same procedure in the aforementioned segmentation with multiclass/multilabel losses on full labels, and then produce the prediction based on the estimated probability for each object class separately. The best two performances (“PSPNet + CE” and “PSPNet + BCE”) for both datasets are summarized in Tables 5 and 6. The performance for other models and losses can be found in the supplementary. In Figure 5, we present the segmentation results on illustrative examples for all methods to demonstrate the difference between the proposed RankDice and the existing methods.

The fixed-thresholding framework with different thresholds. As indicated in Remark 2 and Example 2 in Section 5.1, the optimal segmentation volume (or the optimal thresholding) varies significantly across different images, thus (tuning on) a fixed-thresholding yields a suboptimal solution. In Tables 7 and 10, we also report the numerical performance based on different thresholds for real datasets and Example 2, respectively.

Probability calibration via Temperature scaling (TS). According to Theorem 11, the consistency of the proposed method holds if the estimated conditional probabilities are calibrated. Alternatively, improving the probability calibration may improve the segmentation performance for the proposed framework. Therefore, in Table 8, we examine the numerical results of the proposed method via TS (with different tuning temperatures), which is one of the most effective probability calibration methods as suggested by Guo et al. (2017).

Model Loss Threshold (at 0.5) Argmax mRankDice (our)
(mDice, mIoU) (×.01\times.01) (mDice, mIoU) (×.01\times.01) (mDice, mIoU) (×.01\times.01)
DeepLab-V3+ CE (56.00, 48.40) (54.20, 46.60) (57.80, 49.80)
(resnet101) Focal (54.10, 46.60) (53.30, 45.60) (56.50, 48.70)
BCE (49.80, 24.90) (44.20, 22.10) (54.00, 27.00)
Soft-Dice (39.50, 35.90) (39.50, 35.90) (39.50, 35.90)
B-Soft-Dice (41.00, 20.50) (27.60, 13.80) (41.10, 20.50)
LovaszSoftmax (55.20, 47.60) (52.30, 45.10) (55.50, 47.80)
PSPNet CE (57.50, 49.60) (56.50, 48.50) (59.30, 51.00)
(resnet50) Focal (56.00, 48.20) (55.80, 47.70) (58.20, 50.00)
BCE (51.40, 25.70) (47.60, 23.80) (55.10, 27.60)
Soft-Dice (49.10, 43.50) (48.70, 43.20) (49.30, 43.60)
B-Soft-Dice (46.30, 23.10) (32.70, 16.40) (46.20, 23.10)
LovaszSoftmax (56.80, 48.90) (55.40, 47.70) (56.70, 49.10)
FCN8 CE (51.40, 43.70) (50.50, 42.60) (53.50, 45.30)
(resnet101) Focal (48.50, 41.20) (49.60, 41.60) (51.50, 43.70)
BCE (39.40, 19.70) (39.40, 19.70) (41.30, 20.60)
Soft-Dice (28.30, 24.30) (28.30, 24.30) (28.30, 24.80)
B-Soft-Dice (29.10, 14.60) (29.10, 14.60) (29.10, 14.60)
LovaszSoftmax (48.10, 40.40) (42.90, 35.80) (48.90, 40.90)
Table 2: Averaged mDice and mIoU metrics of Threshold, Argmax, and the proposed mRankDice based on state-of-the-art models/losses on Fine-annotated CityScapes val set. Gray color indicates that RankDice/mRankDice is inappropriately applied to a loss function which is not strictly proper. The best performance in each model is bold-faced.
Model Loss Threshold (at 0.5) Argmax mRankDice (our)
(mDice, mIoU) (×.01\times.01) (mDice, mIoU) (×.01\times.01) (mDice, mIoU) (×.01\times.01)
DeepLab-V3+ CE (63.60, 56.70) (61.90, 55.30) (64.01, 57.01)
(resnet101) Focal (62.70, 55.01) (60.50, 53.20) (62.90, 55.10)
BCE (63.30, 31.70) (59.90, 29.90) (64.60, 32.30)
Soft-Dice
B-Soft-Dice
LovaszSoftmax (57.70, 51.60) (56.20, 50.30) (57.80, 51.60)
PSPNet CE (64.60, 57.10) (63.20, 55.90) (65.40, 57.80)
(resnet50) Focal (64.00, 56.10) (63.90, 56.10) (66.60, 58.50)
BCE (64.20, 32.10) (65.20, 32.60) (67.10, 33.50)
Soft-Dice (59.60, 54.00) (58.80, 53.20) (60.00, 54.30)
B-Soft-Dice (63.30, 31.60) (54.00. 27.00) (64.30, 32.20)
LovaszSoftmax (62.00, 55.20) (60.80, 54.10) (62.20, 55.40)
FCN8 CE (49.50, 41.90) (45.30, 38.40) (50.40, 42.70)
(resnet101) Focal (50.40, 41.80) (47.20, 39.30) (51.50, 42.50)
BCE (46.20, 23.10) (44.20, 22.10) (47.70, 23.80)
Soft-Dice
B-Soft-Dice
LovaszSoftmax (39.80, 34.30) (37.30, 32.20) (40.00, 34.40)
Table 3: Averaged mDice and mIoU of threshold, argmax, and the proposed mRankDice based on state-of-the-art models/losses on PASCAL VOC 2012 val set. “—” indicates that either the performance is significantly worse or the training is unstable. Gray color indicates that RankDice/mRankDice is inappropriately applied to a loss function which is not strictly proper. The best performance in each model is bold-faced.
Model Loss Threshold/Argmax mRankDice (our)
(Dice, IoU) (×.01\times.01) (Dice, IoU) (×.01\times.01)
DeepLab-V3+ CE (87.9, 80.7) (88.3, 80.9)
(resnet101) Focal (86.5, 87.3) (83.1, 73.2)
Soft-Dice (85.7, 77.8) (85.8, 77.9)
LovaszSoftmax (84.3, 77.3) (84.5, 77.4)
PSPNet CE (86.3, 79.2) (87.1, 79.8)
(resnet50) Focal (83.8, 75.4) (81.8, 72.4)
Soft-Dice (83.5, 75.9) (83.7, 76.1)
LovaszSoftmax (86.0, 79.2) (86.0, 79.2)
FCN8 CE (81.9, 73.5) (82.1, 73.6)
(resnet101) Focal (78.5, 69.0) (70.3, 58.3)
Soft-Dice
LovaszSoftmax (82.0, 73.4) (82.0, 73.4)
Table 4: Averaged mDice and mIoU of threshold/argmax, and the proposed mRankDice based on state-of-the-art models/losses on Kvasir SEG dataset (with a single class segmentation). “—” indicates that either the performance is significantly worse. Gray color indicates that RankDice is inappropriately applied to a loss function which is not strictly proper. The best performance in each model is bold-faced.

Overall, the empirical results show that the proposed RankDice/mRankDice framework yields good performance in three segmentation benchmarks. The major empirical conclusions on the proposed RankDice are listed as follows.

  • As suggested in Tables 2 and 3, the proposed RankDice framework consistently outperforms the threshold-based and argmax-based frameworks based on the same trained model/network. The percentage of improvement on the best performance (for each framework) are 3.13% (over threshold) and 4.96% (over argmax) for CityScapes dataset (PSPNet + CE), and 3.87% (over threshold) and 2.91% (over argmax) for Pascal VOC 2012 dataset (PSPNet + CE/BCE), and 0.926% for Kvasir SEG dataset (PSPNet + CE).

  • For Dice-segmentation based on a single class, as suggested in Tables 5 and 6, the proposed RankDice framework consistently outperforms the threshold-/argmax-based framework for each class. The largest percentage of class-specific improvement in terms of the Dice metric on the best performance (for each framework) is 23.6% for CityScapes dataset, and 26.9% for Pascal VOC 2021 dataset.

  • The proposed RankDice works significantly better in “difficult” segmentation. As indicated in Tables 5 and 6, the improvements by RankDice are negatively correlated with the resulting Dice/IoU metrics. It is also suggested by Figure 5, where we illustrate three images from classes cat (no improvement) and chair (26.9% improvement). As presented in all examples of chair and the last example of cat, the improvement is significant when segmentation is difficult (the target object is either occluded or similar with other objects).

  • As suggested in Table 7, the empirical results are in line with Remark 2 (theoretically) and Table 10 (numerically), suggesting that the proposed RankDice generally outperforms a fixed thresholding framework (with any threshold). This because that the optimal threshold can vary greatly across images, as indicated in Figure 4. Moreover, tuning the threshold may improve the performance of the fixed-thresholding framework, yet it still leads to a suboptimal solution compared with the proposed RankDice.

  • As suggested in Table 8, the segmentation performance can be potentially improved by TS probability calibration method, especially 4.59% improvement for CE loss and 3.28% improvement for BCE in Pascal VOC 2021 dataset. Yet, the TS demands an additional validation dataset to tune the optimal temperature, thus more numerical experiments are required to suggest the effectiveness of this promising method. We leave pursuing this topic as future work.

  • As expected, the proposed RankDice/mRankDice performs well for strictly proper loss functions, including CE and BCE. In addition, we show that the performance is continuously improved compared with existing frameworks for some classification calibrated (only) losses, such as the focal loss. It is possible that this phenomenon is due to the relationship between the estimated scores (from focal loss) and the true conditional probabilities (cf. Charoenphakdee et al. (2021); Liu et al. (2021)). We leave this topic as future work.

  • Although the RankDice/mRankDice framework is developed for Dice/mDice optimization, the performance in terms of the IoU/mIoU metric is also consistently improved.

Moreover, we also present some important observations based on our experiments about losses, frameworks, and models.

  • CE, Focal and BCE are the top three losses for Dice-segmentation. While some Dice approximating losses, such as Soft-Dice and binary Soft-Dice, usually lead to suboptimal solutions.

  • It seems that the multiclass modeling and multiclass losses are more preferred for both datasets. Moreover, the threshold-based framework usually outperforms the argmax-based framework for both multiclass and multilabel losses.

  • For multiclass losses, including CE Focal, Soft-Dice, and LovaszSoftmax, the Dice and IoU metrics are consistent, i.e., a higher Dice yields a higher IoU score. For multilabel losses, including BCE and B-Soft-Dice, there is a significant difference between Dice and IoU metrics.

Object class Threshold/Argmax RankDice (our) imps.
(Dice, IoU) (×.01\times.01) (Dice, IoU) (×.01\times.01) (best vs. best)
CE Focal BCE CE Focal BCE (Dice)
road (85.7, 77.2) (86.1, 77.9) (92.2, 46.1) (85.6, 77.1) (86.0, 77.7) (92.2, 46.1) \sim
sidewalk (57.3, 47.8) (53.6, 43.8) (43.8, 21.9) (60.8, 50.8) (58.7, 48.4) (54.0, 27.0) 6.1%
building (84.6, 76.2) (83.4, 74.8) (79.4, 39.7) (85.1, 76.7) (83.6, 74.7) (82.1, 41.0) \sim
wall (17.4, 13.6) (16.1, 12.4) (04.9, 02.4) (21.0, 16.4) (21.5, 16.8) (08.3, 04.2) 23.6%
fence (14.7, 10.9) (12.4, 08.9) (15.2, 07.6) (15.8, 11.7) (13.7, 09.8) (19.1, 09.5) 25.7%
pole (41.9, 29.0) (34.7, 23.4) (27.1, 13.5) (46.0, 31.7) (35.6, 23.1) (36.4, 18.2) 9.8%
traffic light (34.9, 26.5) (31.5, 24.0) (18.7, 09.4) (37.4, 28.3) (33.5, 24.7) (21.3, 10.6) 7.2%
traffic sign (49.9, 39.0) (45.9, 35.1) (35.3, 17.6) (51.4, 40.1) (46.6, 35.1) (39.6, 19.8) 3.0%
vegetation (90.2, 84.1) (90.2, 83.8) (89.0, 44.5) (90.3, 84.1) (89.6, 82.8) (89.4, 44.7) \sim
terrain (25.7, 20.1) (24.1, 18.5) (19.8, 09.9) (29.4, 23.1) (28.7, 22.7) (25.3, 12.7) 14.4%
sky (83.6, 77.0) (82.0, 75.2) (80.1, 40.0) (84.5, 77.8) (83.1, 76.2) (80.7, 40.3) 1.1%
person (45.1, 36.3) (42.6, 34.1) (32.8, 16.4) (49.5, 40.0) (47.6, 38.2) (38.6, 19.3) 9.8%
rider (35.1, 27.3) (31.2, 24.0) (18.6, 09.3) (37.2, 29.2) (33.9, 26.3) (24.0, 12.0) 6.0%
car (84.1, 76.9) (83.4, 76.2) (80.8, 40.4) (84.0, 76.6) (81.8, 74.0) (81.2, 40.6) \sim
truck (24.7, 21.9) (25.6, 22.7) (21.8, 10.9) (26.6, 23.3) (28.1, 24.8) (26.8, 13.4) 9.8%
bus (46.8, 42.2) (48.8, 43.8) (36.3, 18.2) (51.3, 46.5) (51.5, 46.8) (39.2, 19.6) 5.5%
train (34.9, 30.7) (36.3, 31.0) (33.8, 16.9) (35.8, 31.5) (37.3, 32.2) (34.7, 17.4) 2.8%
motorcycle (19.7, 15.8) (20.4, 16.1) (07.0, 03.5) (22.2, 17.7) (21.1, 16.8) (08.7, 04.4) 8.8%
bicycle (41.4, 32.5) (42.1, 32.9) (32.9, 16.5) (41.9, 32.6) (42.0, 32.5) (36.7, 18.4) \sim
Table 5: Averaged class-specific Dice and IoU metrics of Threshold/Argmax, and the proposed RankDice based on various losses of PSPNet + resnet50 on Fine-annotated CityScapes val set. The class-specific improvement in terms of the Dice metric of the proposed RankDice framework is computed in “imps.”, where \sim indicates that the difference between Threshold/Argmax and RankDice is smaller than 1.0%.
Object class Threshold/Argmax RankDice (our) imps.
(Dice, IoU) (×.01\times.01) (Dice, IoU) (×.01\times.01) (best vs. best)
CE Focal BCE CE Focal BCE (Dice)
Aeroplane (71.2, 63.4) (68.4, 59.2) (72.9, 36.5) (71.3, 63.4) (72.7, 64.1) (75.3, 37.6) 3.3%
Bicycle (37.1, 25.9) (19.6, 12.4) (14.6, 7.30) (38.7, 27.3) (30.5, 20.6) (23.1, 11.5) 4.3%
Bird (76.0, 68.2) (74.3, 65.2) (74.2, 37.1) (76.6, 68.7) (75.8, 66.4) (76.3, 38.1) \sim
Boat (51.1, 42.7) (59.5, 49.1) (55.5, 27.8) (51.3, 42.9) (61.9, 51.5) (61.0, 30.5) 4.0%
Bottle (42.8, 35.8) (36.2, 30.0) (39.1, 19.6) (44.2, 36.8) (37.6, 31.4) (41.1, 20.6) 3.3%
Bus (72.8, 68.3) (72.3, 67.5) (74.8, 37.4) (74.1, 69.6) (73.5, 68.8) (75.9, 37.9) 1.5%
Car (53.5, 47.5) (51.1, 45.6) (48.9, 24.4) (55.0, 49.0) (53.6, 47.9) (51.7, 25.9) 2.7%
Cat (75.0, 69.2) (74.1, 67.9) (73.1, 36.6) (75.5, 69.7) (75.4, 68.7) (75.1, 37.6) \sim
Chair (17.5, 12.8) (16.7, 11.6) (10.2, 5.10) (19.6, 14.4) (22.2, 16.1) (14.5, 7.30) 26.9%
Cow (65.3, 58.6) (60.1, 53.7) (64.9, 32.4) (66.5, 59.9) (62.3, 56.0) (68.4, 34.2) 4.8%
Diningtable (32.9, 27.5) (33.6, 27.4) (31.7, 15.9) (34.5, 29.2) (38.6, 32.4) (35.3, 17.6) 14.9%
Dog (64.6, 57.9) (71.0, 63.4) (71.7, 35.9) (65.5, 58.7) (72.5, 64.9) (74.4, 37.2) 3.8%
Horse (63.9, 55.3) (67.3, 58.3) (67.0, 33.5) (65.3, 56.6) (69.5, 60.1) (70.9, 35.4) 5.4%
Motorbike (69.7, 60.6) (65.5, 56.7) (66.9, 33.5) (71.6, 62.6) (67.0, 57.9) (70.1, 35.1) 2.7%
Person (67.0, 57.7) (65.0, 55.4) (67.4, 33.7) (67.4, 58.1) (67.2, 57.6) (69.7, 34.8) 3.4%
Pottedplant (26.9, 20.2) (22.4, 17.3) (25.5, 12.8) (29.1, 22.0) (26.9, 20.7) (28.6, 14.3) 8.2%
Sheep (53.9, 47.4) (62.8, 55.4) (62.1, 31.1) (54.3, 47.9) (66.0, 58.6) (66.9, 33.4) 6.5%
Sofa (29.8, 25.0) (29.8, 24.4) (33.7, 16.8) (32.0, 26.9) (34.6, 29.0) (38.9, 19.4) 14.5%
Train (77.7, 71.0) (75.8, 68.9) (80.3, 40.1) (77.9, 71.1) (77.3, 70.4) (82.1, 41.1) 2.2%
Tvmonitor (48.4, 41.4) (50.7, 41.6) (53.7, 26.8) (49.2, 42.0) (54.1, 45.4) (56.4, 28.2) 5.0%
Table 6: Class-specific Dice and IoU of Threshold/Argmax, and the proposed RankDice based on various losses of PSPNet + resnet50 on PASCAL VOC 2012 val set. The class-specific improvement in terms of the Dice metric of the proposed RankDice framework is computed in “imps.”, where \sim indicates that the difference between Threshold/Argmax and the proposed RankDice is smaller than 1.0%.
Refer to caption
Refer to caption
Refer to caption

  Refer to caption Refer to caption Refer to caption

Figure 5: Comparison of segmentation results between the proposed method and existing methods for classes cat (upper panel) and chair (lower panel). Column 1 indicates original images, Column 2 indicates ground truths, and Columns 3-5 indicate the predicted segmentation produced by argmax, thresholding, and the proposed RankDice, respectively. The results are provided by a PSPNet trained with the cross-entropy loss.
Framework Thold (Dice, IoU) (×.01\times.01)
threshold-based 0.1 (49.10, 24.60)
0.2 (53.00, 26.50)
0.3 (53.60, 26.80)
0.4 (52.90, 26.50)
0.5 (51.40, 25.70)
0.6 (49.60, 24.80)
0.7 (47.00, 23.50)
0.8 (43.40, 21.70)
0.9 (37.40, 18.70)
RankDice(our) (55.10, 27.60)
Framework Thold (Dice, IoU) (×.01\times.01)
threshold-based 0.1 (56.80, 28.40)
0.2 (63.90, 32.00)
0.3 (65.70, 32.80)
0.4 (65.60, 32.80)
0.5 (64.20, 32.10)
0.6 (62.30, 32.00)
0.7 (59.30, 29.60)
0.8 (54.20, 27.10)
0.9 (43.40, 21.70)
RankDice(our) (67.10, 33.50)
Table 7: The averaged Dice and IoU metrics and their standard errors (in parentheses) of the proposed RankDice framework and the fixed-thresholding (with different thresholds) framework in Fine-annotated CityScapes (left) and PASCAL VOC 2012 (right) datasets. The performance is reported based on PSPNet + resnet50 with the BCE loss.
Dataset Temp Threshold RankDice (our)
(Dice, IoU) (×.01\times.01) (Dice, IoU) (×.01\times.01)
CE BCE CE BCE
CityScapes 1.0 (57.50, 49.60) (51.40, 25.70) (59.30, 51.00) (55.10, 27.60)
1.2 (57.40, 49.60) / (59.40, 51.20) (54.70, 26.30)
1.5 (56.90, 49.20) / (59.40, 51.20) (50.60, 25.30)
1.7 (56.20, 48.50) / (59.10, 51.00) (47.20, 23.60)
2.0 (54.40, 46.90) / (58.00, 50.10) (47.20, 23.60)
2.2 (52.80, 45.40) / (57.10, 49.30) (44.70, 22.40)
2.5 (49.80, 42.50) / (55.40, 48.00) (41.50, 20.70)
VOC 2012 1.0 (64.60, 57.10) (64.20, 32.10) (65.40, 57.80) (67.10, 33.50)
1.2 (64.50, 57.50) / (66.10, 58.30) (68.80, 34.40)
1.5 (65.30, 57.70) / (66.90, 59.10) (69.30, 34.60)
1.7 (65.30, 57.70) / (67.60, 59.80) (69.00, 34.50)
2.0 (64.80, 57.10) / (68.30, 60.50) (68.00, 34.00)
2.2 (63.80, 56.00) / (68.40, 60.60) (67.00, 33.50)
2.5 (61.30, 53.40) / (67.90, 60.20) (65.10, 32.50)
Table 8: The averaged Dice and IoU metrics and their standard errors (in parentheses) of the proposed RankDice framework and the fixed-thresholding framework based on temperature-scaling calibration methods with different temperature tuning parameters in Fine-annotated CityScapes and PASCAL VOC 2012 datasets. ‘/’ indicates that the performance of the thresholding-based framework over different temperatures are all the same under BCE loss. The performance is reported based on PSPNet + resnet50.

6 Conclusions and future work

Summary. In this paper, we proposed a ranking-based framework for segmentation called RankSEG that comprises three steps: conditional probability estimation, ranking, and volume estimation. Specifically, we have focused on the Dice metric and developed RankDice, a version of RankSEG for optimal Dice-segmentation. We introduced a key concept “Dice-calibrated” and demonstrated that RankDice is able to recover the optimal segmentation rule, as opposed to the existing fixed-thresholding frameworks that are suboptimal with respect to the Dice metric. Computationally, we have developed efficient exact/approximate numerical methods, including GPU-enabled algorithms, to carry out RankDice. Moreover, we established general theoretical results, including excess risk bounds and a rate of convergence for RankDice, showing that RankDice is consistent when the conditional probability estimation is well-calibrated. Empirical experiments suggested that the proposed framework performs consistently well on a variety of segmentation benchmarks and state-of-the-art deep learning architectures. In parallel to RankDice, we also developed the framework RankIoU for the IoU metric. The theoretical results are similar, while the computation for the optimal IoU-segmentation could be more expensive in high-dimensional situation.

Limitation and future work. (i) For multiclass/multilabel segmentation, our results in this paper cover the overlapping (allowing) case; however, computing the optimal segmentation for the non-overlapping case is NP-hard. Thus, it would be interesting to develop a scalable approximating algorithm to utilize the proposed framework in the non-overlapping setting. (ii) The conditional independence, YiYj|𝐗Y_{i}\perp Y_{j}|\mathbf{\bm{X}} for any iji\neq j, is crucial for Theorem 1 and subsequent theorems in Section 4. In some segmentation applications, it is of interest to extend the proposed frameworks and theorems with locally dependent outcomes. (iii) When given the testing features, the proposed method can be extended to maximize the F1-score in classification tasks.


Acknowledgments and Disclosure of Funding

We would like to acknowledge support for this project from the HK RGC-ECS 24302422 and the CUHK direct fund. Ben Dai developed the main framework (RankDice), theory, algorithms and experiments. Chunlin Li mainly extended RankDice to RankIoU (Section 4.2), and partly contributed to the proof of Lemma 3 and Theorem 11. We would like to thank the referees and the Action Editor for constructive feedback which greatly improved this work.

Appendix A Empirical evaluation of the Dice metric

Recall the definition of the Dice and IoU metrics in (1), their empirical evaluation based on a validation/testing dataset (𝐱~i,𝐲~i)i=1,,m(\tilde{\mathbf{\bm{x}}}_{i},\tilde{\mathbf{\bm{y}}}_{i})_{i=1,\cdots,m}, can be written as:

Dice^γ(𝜹)\displaystyle\widehat{{\rm Dice}}_{\gamma}(\boldsymbol{\delta}) =1mi=1m2𝐲~i𝜹(𝐱~i)+γ𝐲~i1+𝜹(𝐱~i)1+γ=1mi=1m2TPi+γ2TPi+FPi+FNi+γ,\displaystyle=\frac{1}{m}\sum_{i=1}^{m}\frac{2\tilde{\mathbf{\bm{y}}}^{\intercal}_{i}\boldsymbol{\delta}(\tilde{\mathbf{\bm{x}}}_{i})+\gamma}{\|\tilde{\mathbf{\bm{y}}}_{i}\|_{1}+\|\boldsymbol{\delta}(\tilde{\mathbf{\bm{x}}}_{i})\|_{1}+\gamma}=\frac{1}{m}\sum_{i=1}^{m}\frac{2\text{TP}_{i}+\gamma}{2\text{TP}_{i}+\text{FP}_{i}+\text{FN}_{i}+\gamma},
IoU^γ(𝜹)\displaystyle\widehat{{\rm IoU}}_{\gamma}(\boldsymbol{\delta}) =1mi=1m(𝐲~i𝜹(𝐱~i)+γ𝐲~i1+𝜹(𝐱~i)1𝐲~i𝜹(𝐱~i)+γ)=1mi=1mTPi+γTPi+FPi+FNi+γ,\displaystyle=\frac{1}{m}\sum_{i=1}^{m}\Big{(}\frac{\tilde{\mathbf{\bm{y}}}_{i}^{\intercal}\boldsymbol{\delta}(\tilde{\mathbf{\bm{x}}}_{i})+\gamma}{\|\tilde{\mathbf{\bm{y}}}_{i}\|_{1}+\|\boldsymbol{\delta}(\tilde{\mathbf{\bm{x}}}_{i})\|_{1}-\tilde{\mathbf{\bm{y}}}_{i}^{\intercal}\boldsymbol{\delta}(\tilde{\mathbf{\bm{x}}}_{i})+\gamma}\Big{)}=\frac{1}{m}\sum_{i=1}^{m}\frac{\text{TP}_{i}+\gamma}{\text{TP}_{i}+\text{FP}_{i}+\text{FN}_{i}+\gamma}, (22)

where TPi\text{TP}_{i}, FPi\text{FP}_{i} and FNi\text{FN}_{i} are defined at the instance level. In general, the empirical Dice and IoU metrics are not equal to the evaluation criteria used in some literature:

Dice^γ(𝜹)Dice¯γ(𝜹):=1mi=1m2𝐲~i𝜹(𝐱~i)+γ1mi=1m𝐲~i1+1mi=1m𝜹(𝐱~i)1+γ𝔼(2𝐘𝜹(𝐗))+γ𝔼(𝐘1)+𝔼(𝜹(𝐗)1)+γ,\displaystyle\widehat{{\rm Dice}}_{\gamma}(\boldsymbol{\delta})\neq\mkern 1.5mu\overline{\mkern-1.5mu{\rm Dice}\mkern-1.5mu}\mkern 1.5mu_{\gamma}(\boldsymbol{\delta}):=\frac{\frac{1}{m}\sum_{i=1}^{m}2\tilde{\mathbf{\bm{y}}}^{\intercal}_{i}\boldsymbol{\delta}(\tilde{\mathbf{\bm{x}}}_{i})+\gamma}{\frac{1}{m}\sum_{i=1}^{m}\|\tilde{\mathbf{\bm{y}}}_{i}\|_{1}+\frac{1}{m}\sum_{i=1}^{m}\|\boldsymbol{\delta}(\tilde{\mathbf{\bm{x}}}_{i})\|_{1}+\gamma}\overset{\mathbb{P}}{\longrightarrow}\frac{\mathbb{E}\big{(}2\mathbf{\bm{Y}}^{\intercal}\boldsymbol{\delta}(\mathbf{\bm{X}})\big{)}+\gamma}{\mathbb{E}\big{(}\|\mathbf{\bm{Y}}\|_{1})+\mathbb{E}\big{(}\|\boldsymbol{\delta}(\mathbf{\bm{X}})\|_{1})+\gamma},
IoU^γ(𝜹)IoU¯γ(𝜹):=1mi=1m𝐲~i𝜹(𝐱~i)+γ1mi=1m𝐲~i1+1mi=1m𝜹(𝐱~i)11mi=1m𝐲~i𝜹(𝐱~i)+γ\displaystyle\widehat{{\rm IoU}}_{\gamma}(\boldsymbol{\delta})\neq\mkern 1.5mu\overline{\mkern-1.5mu{\rm IoU}\mkern-1.5mu}\mkern 1.5mu_{\gamma}(\boldsymbol{\delta}):=\frac{\frac{1}{m}\sum_{i=1}^{m}\tilde{\mathbf{\bm{y}}}^{\intercal}_{i}\boldsymbol{\delta}(\tilde{\mathbf{\bm{x}}}_{i})+\gamma}{\frac{1}{m}\sum_{i=1}^{m}\|\tilde{\mathbf{\bm{y}}}_{i}\|_{1}+\frac{1}{m}\sum_{i=1}^{m}\|\boldsymbol{\delta}(\tilde{\mathbf{\bm{x}}}_{i})\|_{1}-\frac{1}{m}\sum_{i=1}^{m}\tilde{\mathbf{\bm{y}}}_{i}^{\intercal}\boldsymbol{\delta}(\tilde{\mathbf{\bm{x}}}_{i})+\gamma}
𝔼(2𝐘𝜹(𝐗))+γ𝔼(𝐘1)+𝔼(𝜹(𝐗)1)𝔼(𝐘𝜹(𝐗))+γ.\displaystyle\hskip 142.26378pt\qquad\overset{\mathbb{P}}{\longrightarrow}\frac{\mathbb{E}\big{(}2\mathbf{\bm{Y}}^{\intercal}\boldsymbol{\delta}(\mathbf{\bm{X}})\big{)}+\gamma}{\mathbb{E}\big{(}\|\mathbf{\bm{Y}}\|_{1})+\mathbb{E}\big{(}\|\boldsymbol{\delta}(\mathbf{\bm{X}})\|_{1})-\mathbb{E}\big{(}\mathbf{\bm{Y}}^{\intercal}\boldsymbol{\delta}(\mathbf{\bm{X}})\big{)}+\gamma}. (23)

Here \overset{\mathbb{P}}{\to} denotes convergence in probability following from the law of large numbers and Slutsky’s theorem. Clearly, both empirical and population evaluations in (23) do not match with the empirical verisons in (22) and the population Dice in (1). Although the empirical evaluation in (23) is widely used, it inherently discounts the effects of instances with small segmented features/pixels, leading to bias in the empirical evaluation. The issues of Dice¯γ(𝜹)\mkern 1.5mu\overline{\mkern-1.5mu{\rm Dice}\mkern-1.5mu}\mkern 1.5mu_{\gamma}(\boldsymbol{\delta}) and IoU¯γ(𝜹)\mkern 1.5mu\overline{\mkern-1.5mu{\rm IoU}\mkern-1.5mu}\mkern 1.5mu_{\gamma}(\boldsymbol{\delta}) are also indicated in some recent literature, including Cordts et al. (2016) and Berman et al. (2018).

Therefore, it is highly recommended using the empirical Dice in (22) in implementation, and our numerical results in Section 5 are reported based on (22).

Appendix B Auxiliary definitions

B.1 Population RankSEG

In this section, we present the definition of population RankSEG, including the proposed frameworks RankDice and RankIoU. In other words, we work with population of (𝐗,𝐘)d×{0,1}d(\mathbf{\bm{X}},\mathbf{\bm{Y}})\in\mathbb{R}^{d}\times\{0,1\}^{d}. Denote 𝒬\mathcal{Q} as the class of all measurable functions 𝐪:𝐱d𝐪(𝐱)=(q1(𝐱),,qd(𝐱))[0,1]d\mathbf{\bm{q}}:\mathbf{\bm{x}}\in\mathbb{R}^{d}\to\mathbf{\bm{q}}(\mathbf{\bm{x}})=(q_{1}(\mathbf{\bm{x}}),\cdots,q_{d}(\mathbf{\bm{x}}))^{\intercal}\in[0,1]^{d}.

Step 1 (Conditional probability estimation): Estimate the conditional probability based on a strictly proper loss l(,)l(\cdot,\cdot):

𝐪^=argmin𝐪𝒬𝔼(l(𝐘,𝐪(𝐗))).\widehat{\mathbf{\bm{q}}}=\mathop{\rm argmin}_{\mathbf{\bm{q}}\in\mathcal{Q}}\mathbb{E}\big{(}l(\mathbf{\bm{Y}},\mathbf{\bm{q}}(\mathbf{\bm{X}}))\big{)}. (24)

Step 2 (Ranking): Given a new instance 𝐱\mathbf{\bm{x}}, sort its estimated conditional probabilities decreasingly, and denote the corresponding indices as j1,,jdj_{1},\cdots,j_{d}, that is, q^j1(𝐱)q^j2(𝐱)q^jd(𝐱)\widehat{{q}}_{j_{1}}(\mathbf{\bm{x}})\geq\widehat{{q}}_{j_{2}}(\mathbf{\bm{x}})\geq\cdots\geq\widehat{{q}}_{j_{d}}(\mathbf{\bm{x}}).

Step 3 (Volume estimation): From (4), we estimate the volume τ^(𝐱)\widehat{\tau}(\mathbf{\bm{x}}) by replacing the true conditional probability 𝐩(𝐱)\mathbf{\bm{p}}(\mathbf{\bm{x}}) by the estimated one 𝐪^(𝐱)\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}}):

(RankDice)τ^(𝐱)\displaystyle\text{(RankDice)}\ \ \ \widehat{\tau}(\mathbf{\bm{x}}) =argmaxτ{0,,d}s=1τl=0d12τ+l+γ+1q^js(𝐱)(Γ^js(𝐱)=l)+l=0dγτ+l+γ(Γ^(𝐱)=l),\displaystyle=\mathop{\rm argmax}_{\tau\in\{0,\cdots,d\}}\sum_{s=1}^{\tau}\sum_{l=0}^{d-1}\frac{2}{\tau+l+\gamma+1}\widehat{{q}}_{j_{s}}(\mathbf{\bm{x}})\mathbb{P}\big{(}\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})=l\big{)}+\sum_{l=0}^{d}\frac{\gamma}{\tau+l+\gamma}\mathbb{P}\big{(}\widehat{\Gamma}(\mathbf{\bm{x}})=l\big{)},
(RankIoU)τ^(𝐱)\displaystyle\text{(RankIoU)}\ \ \ \widehat{\tau}(\mathbf{\bm{x}}) =argmaxτ{0,1,,d}(jJτ(𝐱)q^j(𝐱)+γ)l=0dτ1τ+l+γ(Γ^Jτ(𝐱)(𝐱)=l),\displaystyle=\mathop{\rm argmax}_{\tau\in\{0,1,\cdots,d\}}\ \Big{(}\sum_{j\in J_{\tau}(\mathbf{\bm{x}})}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma\Big{)}\sum_{l=0}^{d-\tau}\frac{1}{\tau+l+\gamma}\mathbb{P}\big{(}\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})=l\big{)},

where Γ^(𝐱)=j=1dB^j(𝐱)\widehat{\Gamma}(\mathbf{\bm{x}})=\sum_{j=1}^{d}\widehat{B}_{j}(\mathbf{\bm{x}}), Γ^js(𝐱)=jjsB^j(𝐱)\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})=\sum_{j\neq j_{s}}\widehat{B}_{j}(\mathbf{\bm{x}}), and Γ^Jτ(𝐱)(𝐱)=jJτ(𝐱)B^j(𝐱)\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})=\sum_{j\notin J_{\tau}(\mathbf{\bm{x}})}\widehat{B}_{j}(\mathbf{\bm{x}}) denote Poisson-binomial random variables, and B^j(𝐱)\widehat{B}_{j}(\mathbf{\bm{x}}) is a Bernoulli random variable with the success probability q^j(𝐱)\widehat{q}_{j}(\mathbf{\bm{x}}); for j=1,,dj=1,\cdots,d.

B.2 Poisson-binomial distribution

The Poisson binomial distribution is the discrete probability distribution of a sum of independent non-identical Bernoulli trials. Specifically, suppose B1,,BdB_{1},\cdots,B_{d} are independent Bernoulli random variables, with probabilities of success 𝐩=(p1,,pd)\mathbf{\bm{p}}=(p_{1},\cdots,p_{d})^{\intercal}, then Γ=j=1dBj\Gamma=\sum_{j=1}^{d}B_{j} is a Poisson-Binomial random variable with parameter 𝐩\mathbf{\bm{p}}, denoted as ΓPB(𝐩)\Gamma\sim\text{PB}(\mathbf{\bm{p}}), and its probability mass function is:

(Γ=l)=𝐛:𝐛1=lj=1d(bjpj+(1bj)(1pj)),\mathbb{P}\big{(}\Gamma=l\big{)}=\sum_{\mathbf{\bm{b}}:\|\mathbf{\bm{b}}\|_{1}=l}\prod_{j=1}^{d}\big{(}b_{j}p_{j}+(1-b_{j})\big{(}1-p_{j}\big{)}\big{)},

where 𝐛=(b1,,bd){0,1}d\mathbf{\bm{b}}=(b_{1},\cdots,b_{d})^{\intercal}\in\{0,1\}^{d}. Moreover, the mean, variance, and skewness for ΓPB(𝐩)\Gamma\sim\text{PB}(\mathbf{\bm{p}}) are listed as follows.

μ𝔼(Γ)=j=1dpj,σ2Var(Γ)=j=1dpj(1pj),ηSkew(Γ)=1σ3j=1dpj(1pj)(12pj).\displaystyle\mu\coloneqq\mathbb{E}(\Gamma)=\sum_{j=1}^{d}p_{j},\ \sigma^{2}\coloneqq\mathop{\rm Var}(\Gamma)=\sum_{j=1}^{d}p_{j}(1-p_{j}),\ \eta\coloneqq\text{Skew}(\Gamma)=\frac{1}{\sigma^{3}}\sum_{j=1}^{d}p_{j}(1-p_{j})(1-2p_{j}).

B.3 Conditional independence in segmentation

In this section, we adopt a probabilistic perspective on the likelihood of the segmentation task, to suggest that the conditional independence (YjYj𝐗Y_{j}\perp Y_{j^{\prime}}\mid\mathbf{\bm{X}} for jjj\neq j^{\prime}) is implicitly assumed to ensure the validity of the cross-entropy (CE) loss, and widely accepted due to the high dimensional nature of segmentation data.

Suppose (𝐗i,𝐘i)iid𝐗,𝐘(\mathbf{\bm{X}}_{i},\mathbf{\bm{Y}}_{i})\overset{\text{iid}}{\sim}\mathbb{P}_{\mathbf{\bm{X}},\mathbf{\bm{Y}}}, the negative conditional log-likelihood function of 𝐪\mathbf{\bm{q}} for the probabilistic model is:

n(𝐪)\displaystyle\mathcal{L}_{n}(\mathbf{\bm{q}}) :=log(i=1n𝐪(𝐘=𝐲i𝐗=𝐱i))=log(i=1nj=1dqj(𝐱i)yij(1qj(𝐱i))1yij)\displaystyle:=-\log\Big{(}\prod_{i=1}^{n}\mathbb{P}_{\mathbf{\bm{q}}}(\mathbf{\bm{Y}}=\mathbf{\bm{y}}_{i}\mid\mathbf{\bm{X}}=\mathbf{\bm{x}}_{i})\Big{)}=-\log\Big{(}\prod_{i=1}^{n}\prod_{j=1}^{d}q_{j}(\mathbf{\bm{x}}_{i})^{y_{ij}}(1-q_{j}(\mathbf{\bm{x}}_{i}))^{1-y_{ij}}\Big{)}
=i=1nj=1d(yijlog(qj(𝐱i))+(1yij)log(1qj(𝐱i)))=i=1nlCE(𝐲i,𝐪(𝐱i)),\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{d}(y_{ij}\log(q_{j}(\mathbf{\bm{x}}_{i}))+(1-y_{ij})\log(1-q_{j}(\mathbf{\bm{x}}_{i})))=\sum_{i=1}^{n}l_{\text{CE}}(\mathbf{\bm{y}}_{i},\mathbf{\bm{q}}(\mathbf{\bm{x}}_{i})),

where the second equality follows from the conditional independence assumption YjYj𝐗Y_{j}\perp Y_{j^{\prime}}\mid\mathbf{\bm{X}} for jjj\neq j^{\prime}, which connects the CE loss to the negative conditional log-likelihood function. In this sense, the conditional independence is naturally (and implicitly) assumed by CE to presuppose the probabilistic interpretation of its estimator.

Moreover, the conditional independence is widely accepted due to the high dimensional nature of segmentation data. For example, given a 512x512 image, it is infeasible to consider 5124512^{4} pairs of label-dependence, which can even be adaptive with respect to 𝐱\mathbf{\bm{x}}.

Appendix C Simulation results and Implementation details

C.1 Simulation setting and results

This subsection includes the simulation setting demonstration (Figure 6) and the numerical results (Tables 9 and 10, and Figure 7).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Simulation setting in Section 5.1 with different decay patterns and dimensions/shapes (28x28 - 256x256). Upper panel. Heatmaps for the simulated probabilities with step decay (β=0.1,0.3,0.5\beta=0.1,0.3,0.5). Middle panel. Heatmaps for the simulated probabilities with exponential decay (base=1.01, 1.05, 1.10). Lower panel. Heatmaps for the simulated probabilities with linear decay (slope=1, 2, 4). The performance for the proposed RankDice and thresholding-based frameworks is summarized in Table 9.
Decay Shape Threshold (at 0.5) RankDice (our)
step(0.1) 28x28 0.049(.000) 0.274(.001)
64x64 0.083(.000) 0.279(.000)
128x128 0.081(.000) 0.278(.000)
256x256 0.089(.000) 0.279(.000)
step(0.3) 28x28 0.022(.001) 0.499(.001)
64x64 0.038(.000) 0.517(.001)
128x128 0.036(.000) 0.518(.000)
256x256 0.040(.000) 0.518(.000)
step(0.5) 28x28 0.708(.000) 0.708(.000)
64x64 0.707(.000) 0.707(.000)
128x128 0.708(.000) 0.708(.000)
256x256 0.708(.000) 0.708(.000)
exp(1.01) 28x28 0.870(.000) 0.870(.000)
64x64 0.669(.000) 0.714(.000)
128x128 0.410(.000) 0.551(.000)
256x256 0.286(.000) 0.450(.000)
exp(1.05) 28x28 0.427(.001) 0.551(.001)
64x64 0.296(.001) 0.446(.001)
128x128 0.276(.001) 0.427(.001)
256x256 0.274(.001) 0.427(.001)
exp(1.10) 28x28 0.332(.002) 0.467(.002)
64x64 0.301(.001) 0.439(.002)
128x128 0.300(.002) 0.438(.002)
256x256 0.298 (.002) 0.436(.002)
Decay Shape Threshold (at 0.5) RankDice (our)
linear(1.00) 28x28 0.679(.001) 0.717(.001)
64x64 0.672(.000) 0.711(.000)
128x128 0.669(.000) 0.709(.000)
256x256 0.668(.000) 0.707(.000)
linear(2.00) 28x28 0.578(.001) 0.647(.001)
64x64 0.575(.001) 0.642(.001)
128x128 0.573(.000) 0.638(.000)
256x256 0.573(.000) 0.637(.000)
linear(4.00) 28x28 0.588(.003) 0.663(.002)
64x64 0.580(.001) 0.646(.001)
128x128 0.575(.001) 0.642(.001)
256x256 0.574(.000) 0.639(.000)
Table 9: The averaged Dice metrics and its standard errors (in parentheses) of the proposed RankDice framework and the thresholding-based (or argmax-based) framework in Example 1 (see Fig 6) in Section 5.1.
Framework Threshold Dice
threshold-based 0.1 0.481(.005)
0.2 0.560(.005)
0.3 0.560(.005)
0.4 0.560(.005)
0.5 0.560(.005)
0.6 0.528(.005)
0.7 0.471(.005)
0.8 0.377(.005)
0.9 0.230(.004)
RankDice(our) 0.601(0.005)
Table 10: The averaged Dice metrics and its standard errors (in parentheses) of the proposed RankDice framework and the thresholding-based (with different thresholds) framework in Example 2 in Section 5.1.
[Uncaptioned image]
Figure 7: The optimal thresholds for different images with various generating parameters (β,ρ)(\beta,\rho) in Example 2 in Section 5.1.

C.2 Implementation details

The experiment protocol of our numerical sections basically follows a well-developed Github repository pytorch-segmentation (Ouali, 2022). The major difference lies in the empirical evaluation of the Dice and IoU metrics. In our experiments, we report the unbiased evaluations mDice^γ()\widehat{{\rm mDice}}_{\gamma}(\cdot) and mIoU^γ()\widehat{{\rm mIoU}}_{\gamma}(\cdot), yet the biased evaluations mDice¯γ()\mkern 1.5mu\overline{\mkern-1.5mu{\rm mDice}\mkern-1.5mu}\mkern 1.5mu_{\gamma}(\cdot) and mIoU¯γ()\mkern 1.5mu\overline{\mkern-1.5mu{\rm mIoU}\mkern-1.5mu}\mkern 1.5mu_{\gamma}(\cdot) are usually used for existing literature, see more discussion and definitions in Appendix A.

To justify the effectiveness of our experiment protocol, we also report mDice¯γ()\mkern 1.5mu\overline{\mkern-1.5mu{\rm mDice}\mkern-1.5mu}\mkern 1.5mu_{\gamma}(\cdot) and mIoU¯γ()\mkern 1.5mu\overline{\mkern-1.5mu{\rm mIoU}\mkern-1.5mu}\mkern 1.5mu_{\gamma}(\cdot) under our setting based on the argmax-based framework and compare the performance with the existing benchmarks. Specifically, in our setting, the performance is, DeepLab: (i) CityScapes: mIoU¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mIoU}\mkern-1.5mu}\mkern 1.5mu 63.20%; mDice¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mDice}\mkern-1.5mu}\mkern 1.5mu 76.10%; (ii) VOC: mIoU¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mIoU}\mkern-1.5mu}\mkern 1.5mu 74.40%; mDice¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mDice}\mkern-1.5mu}\mkern 1.5mu 84.30%; PSPNet: (i) CityScapes: mIoU¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mIoU}\mkern-1.5mu}\mkern 1.5mu 65.20%; mDice¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mDice}\mkern-1.5mu}\mkern 1.5mu 77.60%; (ii) VOC test: mIoU¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mIoU}\mkern-1.5mu}\mkern 1.5mu 79% (which is provided by Ouali (2022) with the same configuration expect batch_size=16); FCN8: VOC: mIoU¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mIoU}\mkern-1.5mu}\mkern 1.5mu 55.60%; mDice¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mDice}\mkern-1.5mu}\mkern 1.5mu 70.40%.

The experiment protocol, including learning_rate, crop_size, backbone, and batch_size, on the existing networks are summarized as follows.

DeepLab (Chen et al., 2018). The experiment on the Fine-annotated CityScapes dataset is set as follows: backbone is “Xception-65”. The final mIoU¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mIoU}\mkern-1.5mu}\mkern 1.5mu is 79.14%; The experiment on the PASCAL VOC 2012 dataset is set as follows: learning_rate is 0.007 with poly schedule; crop_size is 513x513, backbone is “resnet101”, batch_size is 16. The final mIoU¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mIoU}\mkern-1.5mu}\mkern 1.5mu is 78.21%;

PSPNet (Zhao et al., 2017). The experiment on the Fine-annotated CityScapes dataset is set as follows: learning_rate is 0.01. The final mIoU¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mIoU}\mkern-1.5mu}\mkern 1.5mu is 78.4%; The experiment on the PASCAL VOC 2012 dataset is set as follows: learning_rate is 0.01, batch_size is 16. The final mIoU¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mIoU}\mkern-1.5mu}\mkern 1.5mu based on the VOC test datset is 82.6%;

FCN8 (Long et al., 2015). The experiment on the PASCAL VOC 2012 dataset is set as follows: learning_rate is 0.0001, batch_size is 20, backbone is “VGG16”. The final mIoU¯\mkern 1.5mu\overline{\mkern-1.5mu{\rm mIoU}\mkern-1.5mu}\mkern 1.5mu is 62.2%;

Note that the suboptimal performance of our experiment may be caused by a small batch/crop size, different specified backbone models, and other fitting hyperparameters. The current experiment can be further improved by carefully tuning the hyperparameters, yet it provides a fair numerical comparison of all frameworks (threshold-based, argmax-based, and the proposed RankDice).

Appendix D Technical proofs

D.1 Proof of Theorem 1

Proof  It suffices to consider the point-wise maximization:

𝜹(𝐱)=argmax𝐯{0,1}dDiceγ(𝐯|𝐱),Diceγ(𝐯|𝐱)=𝔼(2𝐘𝐯+γ𝐘1+𝐯1+γ|𝐗=𝐱).\boldsymbol{\delta}^{*}(\mathbf{\bm{x}})=\mathop{\rm argmax}_{\mathbf{\bm{v}}\in\{0,1\}^{d}}\ {\rm Dice}_{\gamma}(\mathbf{\bm{v}}|\mathbf{\bm{x}}),\quad{\rm Dice}_{\gamma}(\mathbf{\bm{v}}|\mathbf{\bm{x}})=\mathbb{E}\Big{(}\frac{2\mathbf{\bm{Y}}^{\intercal}\mathbf{\bm{v}}+\gamma}{\|\mathbf{\bm{Y}}\|_{1}+\|\mathbf{\bm{v}}\|_{1}+\gamma}\Big{|}\mathbf{\bm{X}}=\mathbf{\bm{x}}\Big{)}.

Let 𝐲j=(y1,,yj1,yj+1,,yd)\mathbf{\bm{y}}_{\sm j}=(y_{1},\cdots,y_{j-1},y_{j+1},\cdots,y_{d})^{\intercal}, I(𝐯)=I(𝜹(𝐱))={j:vj=1}I(\mathbf{\bm{v}})=I(\boldsymbol{\delta}(\mathbf{\bm{x}}))=\{j:v_{j}=1\} be the index set of segmented features by 𝜹(𝐱)\boldsymbol{\delta}(\mathbf{\bm{x}}), and 𝐯1=τ\|\mathbf{\bm{v}}\|_{1}=\tau, we have

Diceγ(𝐯|𝐱)=𝔼(2𝐘𝐯𝐘1+τ+γ|𝐗=𝐱)+𝔼(γ𝐘1+τ+γ|𝐗=𝐱).{\rm Dice}_{\gamma}(\mathbf{\bm{v}}|\mathbf{\bm{x}})=\mathbb{E}\Big{(}\frac{2\mathbf{\bm{Y}}^{\intercal}\mathbf{\bm{v}}}{\|\mathbf{\bm{Y}}\|_{1}+\tau+\gamma}\Big{|}\mathbf{\bm{X}}=\mathbf{\bm{x}}\Big{)}+\mathbb{E}\Big{(}\frac{\gamma}{\|\mathbf{\bm{Y}}\|_{1}+\tau+\gamma}\Big{|}\mathbf{\bm{X}}=\mathbf{\bm{x}}\Big{)}.

Note that the second term is only related to τ\tau, and the first term can be rewritten as:

𝔼(2𝐘𝐯𝐘1+τ+γ\displaystyle\mathbb{E}\Big{(}\frac{2\mathbf{\bm{Y}}^{\intercal}\mathbf{\bm{v}}}{\|\mathbf{\bm{Y}}\|_{1}+\tau+\gamma} |𝐗=𝐱)=𝐲{0,1}d2𝐲𝐯(𝐘=𝐲|𝐱)τ+𝐲1+γ=𝐲{0,1}dj=1d2yjvj(𝐘=𝐲|𝐱)τ+𝐲1+γ\displaystyle\Big{|}\mathbf{\bm{X}}=\mathbf{\bm{x}}\Big{)}=\sum_{\mathbf{\bm{y}}\in\{0,1\}^{d}}\frac{2\mathbf{\bm{y}}^{\intercal}\mathbf{\bm{v}}\mathbb{P}(\mathbf{\bm{Y}}=\mathbf{\bm{y}}|\mathbf{\bm{x}})}{\tau+\|\mathbf{\bm{y}}\|_{1}+\gamma}=\sum_{\mathbf{\bm{y}}\in\{0,1\}^{d}}\sum_{j=1}^{d}\frac{2y_{j}v_{j}\mathbb{P}(\mathbf{\bm{Y}}=\mathbf{\bm{y}}|\mathbf{\bm{x}})}{\tau+\|\mathbf{\bm{y}}\|_{1}+\gamma}
=jI(𝐯)𝐲{0,1}d2yj(𝐘=𝐲|𝐱)τ+𝐲1+γ=jI(𝐯)𝐲j{0,1}d1yj=12(𝐘=𝐲|𝐱)τ+𝐲1+γ=jI(𝐯)sj(τ).\displaystyle=\sum_{j\in I(\mathbf{\bm{v}})}\sum_{\mathbf{\bm{y}}\in\{0,1\}^{d}}\frac{2y_{j}\mathbb{P}(\mathbf{\bm{Y}}=\mathbf{\bm{y}}|\mathbf{\bm{x}})}{\tau+\|\mathbf{\bm{y}}\|_{1}+\gamma}=\sum_{j\in I(\mathbf{\bm{v}})}\sum_{\begin{subarray}{c}\mathbf{\bm{y}}_{\sm j}\in\{0,1\}^{d-1}\\ y_{j}=1\end{subarray}}\frac{2\mathbb{P}(\mathbf{\bm{Y}}=\mathbf{\bm{y}}|\mathbf{\bm{x}})}{\tau+\|\mathbf{\bm{y}}\|_{1}+\gamma}=\sum_{j\in I(\mathbf{\bm{v}})}s_{j}(\tau). (25)

As indicated in (D.1), when τ\tau is given, Diceγ(𝐯|𝐱){\rm Dice}_{\gamma}(\mathbf{\bm{v}}|\mathbf{\bm{x}}) is an additive function with respect to jI(𝐯)j\in I(\mathbf{\bm{v}}). Therefore, maximizing Diceγ(𝐯|𝐱){\rm Dice}_{\gamma}(\mathbf{\bm{v}}|\mathbf{\bm{x}}) suffices to find the indices of top τ\tau largest sj(τ)s_{j}(\tau). Toward this end, we consider the differenced score function:

Djj(τ)\displaystyle D_{jj^{\prime}}(\tau) =sj(τ)sj(τ)=𝐲j{0,1}d1yj=12(𝐘=𝐲|𝐱)τ+𝐲1+γ𝐲j{0,1}d1yj=12(𝐘=𝐲|𝐱)τ+𝐲1+γ\displaystyle=s_{j}(\tau)-s_{j^{\prime}}(\tau)=\sum_{\begin{subarray}{c}\mathbf{\bm{y}}_{\sm j}\in\{0,1\}^{d-1}\\ y_{j}=1\end{subarray}}\frac{2\mathbb{P}(\mathbf{\bm{Y}}=\mathbf{\bm{y}}|\mathbf{\bm{x}})}{\tau+\|\mathbf{\bm{y}}\|_{1}+\gamma}-\sum_{\begin{subarray}{c}\mathbf{\bm{y}}_{\sm j^{\prime}}\in\{0,1\}^{d-1}\\ y_{j^{\prime}}=1\end{subarray}}\frac{2\mathbb{P}(\mathbf{\bm{Y}}=\mathbf{\bm{y}}|\mathbf{\bm{x}})}{\tau+\|\mathbf{\bm{y}}\|_{1}+\gamma}
=𝐲jj{0,1}d2yj=1;yj=02(𝐘=𝐲|𝐱)τ+𝐲1+γ𝐲jj{0,1}d2yj=0,yj=12(𝐘=𝐲|𝐱)τ+𝐲+γ\displaystyle=\sum_{\begin{subarray}{c}\mathbf{\bm{y}}_{\sm jj^{\prime}}\in\{0,1\}^{d-2}\\ y_{j}=1;y_{j^{\prime}}=0\end{subarray}}\frac{2\mathbb{P}(\mathbf{\bm{Y}}=\mathbf{\bm{y}}|\mathbf{\bm{x}})}{\tau+\|\mathbf{\bm{y}}\|_{1}+\gamma}-\sum_{\begin{subarray}{c}\mathbf{\bm{y}}_{\sm jj^{\prime}}\in\{0,1\}^{d-2}\\ y_{j}=0,y_{j^{\prime}}=1\end{subarray}}\frac{2\mathbb{P}(\mathbf{\bm{Y}}=\mathbf{\bm{y}}|\mathbf{\bm{x}})}{\tau+\|\mathbf{\bm{y}}\|+\gamma}
=𝐲jj{0,1}d2yj=1;yj=02i{j,j}(Yi=yi|𝐱)(Yj=1|𝐱)(Yj=0|𝐱)τ+1+𝐲jj1+γ\displaystyle=\sum_{\begin{subarray}{c}\mathbf{\bm{y}}_{\sm jj^{\prime}}\in\{0,1\}^{d-2}\\ y_{j}=1;y_{j^{\prime}}=0\end{subarray}}\frac{2\prod_{i\neq\{j,j^{\prime}\}}\mathbb{P}(Y_{i}=y_{i}|\mathbf{\bm{x}})\mathbb{P}(Y_{j}=1|\mathbf{\bm{x}})\mathbb{P}(Y_{j^{\prime}}=0|\mathbf{\bm{x}})}{\tau+1+\|\mathbf{\bm{y}}_{\sm jj^{\prime}}\|_{1}+\gamma}
𝐲jj{0,1}d2yj=0;yj=12i{j,j}(Yi=yi|𝐱)(Yj=0|𝐱)(Yj=1|𝐱)τ+1+𝐲jj1+γ\displaystyle\hskip 85.35826pt-\sum_{\begin{subarray}{c}\mathbf{\bm{y}}_{\sm jj^{\prime}}\in\{0,1\}^{d-2}\\ y_{j}=0;y_{j^{\prime}}=1\end{subarray}}\frac{2\prod_{i\neq\{j,j^{\prime}\}}\mathbb{P}(Y_{i}=y_{i}|\mathbf{\bm{x}})\mathbb{P}(Y_{j}=0|\mathbf{\bm{x}})\mathbb{P}(Y_{j^{\prime}}=1|\mathbf{\bm{x}})}{\tau+1+\|\mathbf{\bm{y}}_{\sm jj^{\prime}}\|_{1}+\gamma}
=((Yj=1|𝐱)(Yj=1|𝐱))𝐲jj{0,1}d22ij,j(Yi=yi|𝐱)τ+1+𝐲jj1+γ,\displaystyle=\big{(}\mathbb{P}(Y_{j}=1|\mathbf{\bm{x}})-\mathbb{P}(Y_{j^{\prime}}=1|\mathbf{\bm{x}})\big{)}\sum_{\begin{subarray}{c}\mathbf{\bm{y}}_{\sm jj^{\prime}}\in\{0,1\}^{d-2}\end{subarray}}\frac{2\prod_{i\neq j,j^{\prime}}\mathbb{P}(Y_{i}=y_{i}|\mathbf{\bm{x}})}{\tau+1+\|\mathbf{\bm{y}}_{\sm jj^{\prime}}\|_{1}+\gamma}, (26)

where 𝐲jj\mathbf{\bm{y}}_{\sm jj^{\prime}} is 𝐲\mathbf{\bm{y}} removing yjy_{j} and yjy_{j^{\prime}}, the second last equality follows from the conditional independence of YjY_{j} and YjY_{j^{\prime}} given 𝐗\mathbf{\bm{X}} for any pair jj and jj^{\prime}. According to (D.1), we have

Djj(τ)0(Yj=1|𝐱)(Yj=1|𝐱)0,D_{jj^{\prime}}(\tau)\geq 0\quad\iff\quad\mathbb{P}(Y_{j}=1|\mathbf{\bm{x}})-\mathbb{P}(Y_{j^{\prime}}=1|\mathbf{\bm{x}})\geq 0,

thus sorting sj(τ)s_{j}(\tau) is equivalent to sorting (Yj=1|𝐱)\mathbb{P}(Y_{j}=1|\mathbf{\bm{x}}) for any given τ\tau. Let Jτ={j:j=1d𝟏((Yj=1|𝐱)(Yj=1|𝐱))τ}J_{\tau}=\big{\{}j:\sum_{j^{\prime}=1}^{d}\mathbf{\bm{1}}\big{(}\mathbb{P}(Y_{j^{\prime}}=1|\mathbf{\bm{x}})\geq\mathbb{P}(Y_{j}=1|\mathbf{\bm{x}})\big{)}\leq\tau\big{\}} be the index set of the τ\tau-largest conditional probabilities, it suffices to solve

τ\displaystyle\tau^{*} =argmaxτ=0,,djJτ𝔼(2Yj𝐘1+τ+γ)+𝔼(γ𝐘1+τ+γ)\displaystyle=\mathop{\rm argmax}_{\tau=0,\cdots,d}\sum_{j\in J_{\tau}}\mathbb{E}\big{(}\frac{2Y_{j}}{\|\mathbf{\bm{Y}}\|_{1}+\tau+\gamma}\big{)}+\mathbb{E}\big{(}\frac{\gamma}{\|\mathbf{\bm{Y}}\|_{1}+\tau+\gamma}\big{)}
=argmaxτ=0,,djJτl=0d12pj(𝐱)τ+l+γ+1(𝐘j1=l|𝐗=𝐱)+l=0dγτ+l+γ(𝐘1=l|𝐗=𝐱),\displaystyle=\mathop{\rm argmax}_{\tau=0,\cdots,d}\sum_{j\in J_{\tau}}\sum_{l=0}^{d-1}\frac{2p_{j}(\mathbf{\bm{x}})}{\tau+l+\gamma+1}\mathbb{P}\big{(}\|\mathbf{\bm{Y}}_{\sm j}\|_{1}=l|\mathbf{\bm{X}}=\mathbf{\bm{x}}\big{)}+\sum_{l=0}^{d}\frac{\gamma}{\tau+l+\gamma}\mathbb{P}\big{(}\|\mathbf{\bm{Y}}\|_{1}=l|\mathbf{\bm{X}}=\mathbf{\bm{x}}\big{)},

where 𝐘j1=jjYj\|\mathbf{\bm{Y}}_{\sm j}\|_{1}=\sum_{j^{\prime}\neq j}Y_{j^{\prime}} is a Poisson-binomial random variable with success probabilities 𝐩j(𝐱)\mathbf{\bm{p}}_{\sm j}(\mathbf{\bm{x}}), since YjY_{j} (j=1,,dj=1,\cdots,d) is an independent Bernoulli random variable given 𝐗=𝐱\mathbf{\bm{X}}=\mathbf{\bm{x}}. The desirable result then follows.  

D.2 Proof of Lemma 3

Proof  To proceed, we denote ξl(𝐱)=(Γ^(𝐱)=l)\xi_{l}(\mathbf{\bm{x}})=\mathbb{P}(\widehat{\Gamma}(\mathbf{\bm{x}})=l), and ξjl(𝐱)=(Γ^j(𝐱)=l)\xi_{jl}(\mathbf{\bm{x}})=\mathbb{P}(\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})=l), and π¯τ=ω¯τ(𝐱)+ν¯τ(𝐱)\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu_{\tau}=\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau}(\mathbf{\bm{x}})+\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}(\mathbf{\bm{x}}). For simplicity in presentation, we assume that q^1(𝐱)q^d(𝐱)\widehat{q}_{1}(\mathbf{\bm{x}})\geq\cdots\geq\widehat{q}_{d}(\mathbf{\bm{x}}). Then, for any τ>τ\tau^{\prime}>\tau, since ν¯τν¯τ\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}\geq\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau^{\prime}}, and

π¯τ(𝐱)π¯τ(𝐱)2(ττ)\displaystyle\frac{\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu_{\tau}(\mathbf{\bm{x}})-\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu_{\tau^{\prime}}(\mathbf{\bm{x}})}{2(\tau^{\prime}-\tau)} j=1τq^j(𝐱)l=0d1ξjl(𝐱)(τ+l+1+γ)(τ+l+1+γ)1ττj=τ+1τq^j(𝐱)l=0d1ξjl(𝐱)τ+l+γ+1\displaystyle\geq\sum_{j=1}^{\tau}\widehat{q}_{j}(\mathbf{\bm{x}})\sum_{l=0}^{d-1}\frac{\xi_{jl}(\mathbf{\bm{x}})}{(\tau+l+1+\gamma)(\tau^{\prime}+l+1+\gamma)}-\frac{1}{\tau^{\prime}-\tau}\sum_{j=\tau+1}^{\tau^{\prime}}\widehat{q}_{j}(\mathbf{\bm{x}})\sum_{l=0}^{d-1}\frac{\xi_{jl}(\mathbf{\bm{x}})}{\tau^{\prime}+l+\gamma+1}
j=1τq^j(𝐱)l=0d1ξτl(𝐱)(τ+l+γ+1)(τ+l+γ+1)q^τ+1(𝐱)l=0d1ξτl(𝐱)τ+l+γ+1\displaystyle\geq\sum_{j=1}^{\tau}\widehat{q}_{j}(\mathbf{\bm{x}})\sum_{l=0}^{d-1}\frac{\xi_{\tau l}(\mathbf{\bm{x}})}{(\tau+l+\gamma+1)(\tau^{\prime}+l+\gamma+1)}-\widehat{q}_{\tau+1}(\mathbf{\bm{x}})\sum_{l=0}^{d-1}\frac{\xi_{\tau l}(\mathbf{\bm{x}})}{\tau^{\prime}+l+\gamma+1}
q^τ+1(𝐱)l=0d1(τ+γ+d)ξτl(𝐱)(τ+l+γ+1)(τ+l+γ+1)q^τ+1(𝐱)l=0d1ξτl(𝐱)τ+l+γ+10,\displaystyle\geq\widehat{q}_{\tau+1}(\mathbf{\bm{x}})\sum_{l=0}^{d-1}\frac{(\tau+\gamma+d)\xi_{\tau l}(\mathbf{\bm{x}})}{(\tau+l+\gamma+1)(\tau^{\prime}+l+\gamma+1)}-\widehat{q}_{\tau+1}(\mathbf{\bm{x}})\sum_{l=0}^{d-1}\frac{\xi_{\tau l}(\mathbf{\bm{x}})}{\tau^{\prime}+l+\gamma+1}\geq 0,

where the second inequality follows from Lemma 17 with ζl=(τ+l+γ+1)(τ+l+γ+1)\zeta_{l}=(\tau+l+\gamma+1)(\tau^{\prime}+l+\gamma+1) and ζl=τ+l+γ+1\zeta_{l}=\tau^{\prime}+l+\gamma+1, and q^1(𝐱)q^τ(𝐱)q^τ(𝐱)\widehat{q}_{1}(\mathbf{\bm{x}})\geq\cdots\geq\widehat{q}_{\tau}(\mathbf{\bm{x}})\geq\cdots\geq\widehat{q}_{\tau^{\prime}}(\mathbf{\bm{x}}), and the third inequality follows from the condition that j=1τq^j(𝐱)/q^τ+1(𝐱)τ+γ+d\sum_{j=1}^{\tau}\widehat{q}_{j}(\mathbf{\bm{x}})/\widehat{q}_{\tau+1}(\mathbf{\bm{x}})\geq\tau+\gamma+d. The desirable result then follows. The upper bound provided by Lemma 3 is illustrated in Figure 8 based on a random example of Example 1.

Refer to caption
Figure 8: Dice score vs. τ\tau based on a random example of Example 1. Note that Lemma 3 indicates the optimal τ\tau (red line) is always obtained before the upper bound (green line). Thus, the searching region of τ\tau can be shrunk.

 

D.3 Proof of Lemma 4

Proof  Without loss of generality, assume (ϵ){0,,d1}\mathcal{L}(\epsilon)\subset\{0,\cdots,d-1\}. Denote lL=σ^(𝐱)Ψ1(ϵ)+μ^(𝐱)1l_{L}=\lfloor\widehat{\sigma}(\mathbf{\bm{x}})\Psi^{-1}(\epsilon)+\widehat{\mu}(\mathbf{\bm{x}})\rfloor-1 and lU=σ^(𝐱)Ψ1(ϵ)+μ^(𝐱)l_{U}=\lceil-\widehat{\sigma}(\mathbf{\bm{x}})\Psi^{-1}(\epsilon)+\widehat{\mu}(\mathbf{\bm{x}})\rceil, ξl=(Γ^(𝐱)=l)\xi_{l}=\mathbb{P}\big{(}\widehat{\Gamma}(\mathbf{\bm{x}})=l\big{)}, ξ~l=~(Γ^(𝐱)=l)\widetilde{\xi}_{l}=\widetilde{\mathbb{P}}\big{(}\widehat{\Gamma}(\mathbf{\bm{x}})=l\big{)}, ξj,l=(Γ^j(𝐱)=l)\xi_{\sm j,l}=\mathbb{P}\big{(}\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})=l\big{)}, ξ~j,l=~(Γ^j(𝐱)=l)\widetilde{\xi}_{\sm j,l}=\widetilde{\mathbb{P}}\big{(}\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})=l\big{)}, we treat |ω~τω¯τ|\big{|}\widetilde{\omega}_{\tau}-\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau}\big{|} and |ν~τν¯τ|\big{|}\widetilde{\nu}_{\tau}-\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}\big{|} separately. First,

|ω~τω¯τ|\displaystyle\big{|}\widetilde{\omega}_{\tau}-\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau}\big{|} l>lU2τ+l+γ+1ωτ,l+0l<lL2τ+l+γ+1ωτ,l+lLl<lU2τ+l+γ+1|ωτ,lω~τ,l|\displaystyle\leq\sum_{l>l_{U}}\frac{2}{\tau+l+\gamma+1}\omega_{\tau,l}+\sum_{0\leq l<l_{L}}\frac{2}{\tau+l+\gamma+1}\omega_{\tau,l}+\sum_{l_{L}\leq l<l_{U}}\frac{2}{\tau+l+\gamma+1}\Big{|}\omega_{\tau,l}-\widetilde{\omega}_{\tau,l}\Big{|}
=:S1+S2+S3.\displaystyle=:S_{1}+S_{2}+S_{3}.

Next, we turn to bound S1S_{1} - S3S_{3} separately.

S1\displaystyle S_{1} =l>lU2τ+l+γ+1s=1τq^js(𝐱)(Γ^js(𝐱)=l)2τ+lU+γ+1s=1τl>lU(Γ^js(𝐱)=l)\displaystyle=\sum_{l>l_{U}}\frac{2}{\tau+l+\gamma+1}\sum_{s=1}^{\tau}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\mathbb{P}\big{(}\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})=l\big{)}\leq\frac{2}{\tau+l_{U}+\gamma+1}\sum_{s=1}^{\tau}\sum_{l>l_{U}}\mathbb{P}\big{(}\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})=l\big{)}
2τ+lU+γ+1s=1τ(1(Γ^js(𝐱)lU))2ττ+lU+γ+1(Γ^(𝐱)>lU),\displaystyle\leq\frac{2}{\tau+l_{U}+\gamma+1}\sum_{s=1}^{\tau}\big{(}1-\mathbb{P}\big{(}\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})\leq l_{U}\big{)}\big{)}\leq\frac{2\tau}{\tau+l_{U}+\gamma+1}\mathbb{P}\big{(}\widehat{\Gamma}(\mathbf{\bm{x}})>l_{U}\big{)},

where the last inequality follows from Lemma 16. For S2S_{2}, we have

S2\displaystyle S_{2} 2τ+γ+1s=1τ0l<lL(Γ^js(𝐱)=l)2ττ+γ+1(Γ^(𝐱)lL+1),\displaystyle\leq\frac{2}{\tau+\gamma+1}\sum_{s=1}^{\tau}\sum_{0\leq l<l_{L}}\mathbb{P}\big{(}\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})=l\big{)}\leq\frac{2\tau}{\tau+\gamma+1}\mathbb{P}\big{(}\widehat{\Gamma}(\mathbf{\bm{x}})\leq l_{L}+1\big{)},

where the last inequality follows from Lemma 16. Next, according to Theorem 1.1 in Neammanee (2005),

(Γ^(𝐱)lL+1)(ZΨ1(ϵ))+C0σ^2(𝐱)=ϵ+C0σ^2(𝐱),\mathbb{P}\big{(}\widehat{\Gamma}(\mathbf{\bm{x}})\leq l_{L}+1\big{)}\leq\mathbb{P}\big{(}Z\leq\Psi^{-1}(\epsilon)\big{)}+\frac{C_{0}}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})}=\epsilon+\frac{C_{0}}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})},

and

(Γ^(𝐱)>lU)(ZΨ1(1ϵ))+C0σ^2(𝐱)=ϵ+C0σ^2(𝐱),\mathbb{P}\big{(}\widehat{\Gamma}(\mathbf{\bm{x}})>l_{U}\big{)}\leq\mathbb{P}\big{(}Z\geq\Psi^{-1}(1-\epsilon)\big{)}+\frac{C_{0}}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})}=\epsilon+\frac{C_{0}}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})},

where ZZ is a random variable following the refined normal distribution. For S3S_{3},

S3\displaystyle S_{3} l=0d12τ+l+γ+1s=1τ𝐪^js(𝐱)|ξ~j,lξj,l|s=1τ𝐪^js(𝐱)l=0d12τ+l+γ+1maxj=1,,τ|ξ~j,lξj,l|\displaystyle\leq\sum_{l=0}^{d-1}\frac{2}{\tau+l+\gamma+1}\sum_{s=1}^{\tau}\widehat{\mathbf{\bm{q}}}_{j_{s}}(\mathbf{\bm{x}})|\widetilde{\xi}_{\sm j,l}-\xi_{\sm j,l}|\leq\sum_{s=1}^{\tau}\widehat{\mathbf{\bm{q}}}_{j_{s}}(\mathbf{\bm{x}})\sum_{l=0}^{d-1}\frac{2}{\tau+l+\gamma+1}\max_{j=1,\cdots,\tau}|\widetilde{\xi}_{\sm j,l}-\xi_{\sm j,l}|
min(μ^(𝐱),τ)l=0d12τ+l+γ+1maxs=1,,τC0σ^2(𝐱)q^js(𝐱)(1q^js(𝐱))\displaystyle\leq\min(\widehat{\mu}(\mathbf{\bm{x}}),\tau)\sum_{l=0}^{d-1}\frac{2}{\tau+l+\gamma+1}\max_{s=1,\cdots,\tau}\frac{C_{0}}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})-\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\big{(}1-\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\big{)}}
2C0min(μ^(𝐱),τ)σ^2(𝐱)1/4l=1d1τ+l+γ=2C0min(μ^(𝐱),τ)σ^2(𝐱)1/4(Hτ+d+γHτ+γ)\displaystyle\leq\frac{2C_{0}\min(\widehat{\mu}(\mathbf{\bm{x}}),\tau)}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4}\sum_{l=1}^{d}\frac{1}{\tau+l+\gamma}=\frac{2C_{0}\min(\widehat{\mu}(\mathbf{\bm{x}}),\tau)}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4}\big{(}H_{\tau+d+\gamma}-H_{\tau+\gamma}\big{)}
2C0min(μ^(𝐱),τ)σ^2(𝐱)1/4(log(1+dτ+γ)+11τ+γ),\displaystyle\leq\frac{2C_{0}\min(\widehat{\mu}(\mathbf{\bm{x}}),\tau)}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4}\Big{(}\log\big{(}1+\frac{d}{\tau+\gamma}\big{)}+1-\frac{1}{\tau+\gamma}\Big{)},

where HK=k=1K1/kH_{K}=\sum_{k=1}^{K}1/k is the harmonic number, and the last inequality follows from the fact that log(K)+1/KHKlog(K)+1\log(K)+1/K\leq H_{K}\leq\log(K)+1. Taken together,

|ω~τω¯τ|4ττ+γ+1(ϵ+C0σ^2(𝐱))+2C0min(μ^(𝐱),τ)σ^2(𝐱)1/4(log(1+dτ+γ)+τ+γ1τ+γ).\big{|}\widetilde{\omega}_{\tau}-\mkern 1.5mu\overline{\mkern-1.5mu\omega\mkern-1.5mu}\mkern 1.5mu_{\tau}\big{|}\leq\frac{4\tau}{\tau+\gamma+1}\big{(}\epsilon+\frac{C_{0}}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})}\big{)}+\frac{2C_{0}\min(\widehat{\mu}(\mathbf{\bm{x}}),\tau)}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4}\Big{(}\log\big{(}1+\frac{d}{\tau+\gamma}\big{)}+\frac{\tau+\gamma-1}{\tau+\gamma}\Big{)}.

Similarly, for |ν~τν¯τ||\widetilde{\nu}_{\tau}-\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}|,

|ν~τν¯τ|2γτ+γϵ+γC0σ^2(𝐱)(log(1+dτ+γ1)+11τ+γ1).\displaystyle|\widetilde{\nu}_{\tau}-\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu_{\tau}|\leq\frac{2\gamma}{\tau+\gamma}\epsilon+\frac{\gamma C_{0}}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})}\Big{(}\log\big{(}1+\frac{d}{\tau+\gamma-1}\big{)}+1-\frac{1}{\tau+\gamma-1}\Big{)}.

This completes the proof.  

D.4 Proof of Lemma 5

Proof  With the same argument in the proof of Lemma 4, it suffices to consider

|~(Γ^(𝐱)=l)~(Γ^j(𝐱)=l)|l=l1l|~(Γ^(𝐱)l)~(Γ^j(𝐱)l)|.\displaystyle\big{|}\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})=l)-\widetilde{\mathbb{P}}(\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})=l)\big{|}\leq\sum_{l^{\prime}=l-1}^{l}\big{|}\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})\leq l^{\prime})-\widetilde{\mathbb{P}}(\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})\leq l^{\prime})\big{|}.

Denote I=σ^(𝐱)1(l+1/2μ^(𝐱))I=\widehat{\sigma}(\mathbf{\bm{x}})^{-1}(l+1/2-\widehat{\mu}(\mathbf{\bm{x}})) and Ij=σ^j(𝐱)1(l+1/2μ^j(𝐱))I_{\sm j}=\widehat{\sigma}_{\sm j}(\mathbf{\bm{x}})^{-1}(l+1/2-\widehat{\mu}_{\sm j}(\mathbf{\bm{x}})), we have

|~(Γ^(𝐱)l)\displaystyle\big{|}\widetilde{\mathbb{P}}(\widehat{\Gamma}(\mathbf{\bm{x}})\leq l) ~(Γ^j(𝐱)l)|=|Ψ(I)Ψj(Ij)|\displaystyle-\widetilde{\mathbb{P}}(\widehat{\Gamma}_{\sm j}(\mathbf{\bm{x}})\leq l)\big{|}=\big{|}\Psi(I)-\Psi_{\sm j}(I_{\sm j})\big{|}
|Φ(I)Φ(Ij)|+16|η^(𝐱)(1I2)ϕ(I)η^j(𝐱)(1Ij2)ϕ(Ij)|=:T1+16T2.\displaystyle\leq\big{|}\Phi(I)-\Phi(I_{\sm j})\big{|}+\frac{1}{6}\big{|}\widehat{\eta}(\mathbf{\bm{x}})(1-I^{2})\phi(I)-\widehat{\eta}_{\sm j}(\mathbf{\bm{x}})(1-I_{\sm j}^{2})\phi(I_{\sm j})\big{|}=:T_{1}+\frac{1}{6}T_{2}.

Next, we turn to treat T1T_{1} and T2T_{2} separately. Without loss generalization, we assume |Ij||I||I_{\sm j}|\geq|I|, then

T1\displaystyle T_{1} |IIjϕ(𝐱)𝑑𝐱||IjI|ϕ(I)\displaystyle\leq\big{|}\int_{I}^{I_{\sm j}}\phi(\mathbf{\bm{x}})d\mathbf{\bm{x}}\big{|}\leq|I_{\sm j}-I|\phi(I)
=(|σ^1(𝐱)σ^j1(𝐱)||l+1/2μ^(𝐱)|+σ^j1(𝐱)|μ^(𝐱)μ^j(𝐱)|)ϕ(I)\displaystyle=\Big{(}|\widehat{\sigma}^{-1}(\mathbf{\bm{x}})-\widehat{\sigma}_{\sm j}^{-1}(\mathbf{\bm{x}})|\big{|}l+1/2-\widehat{\mu}(\mathbf{\bm{x}})\big{|}+\widehat{\sigma}_{\sm j}^{-1}(\mathbf{\bm{x}})\big{|}\widehat{\mu}(\mathbf{\bm{x}})-\widehat{\mu}_{\sm j}(\mathbf{\bm{x}})\big{|}\Big{)}\phi(I)
=σ^j1(𝐱)(σ^(𝐱)σ^j(𝐱))|I|ϕ(I)+σ^j1(𝐱)q^j(𝐱)ϕ(I)\displaystyle=\widehat{\sigma}^{-1}_{\sm j}(\mathbf{\bm{x}})(\widehat{\sigma}(\mathbf{\bm{x}})-\widehat{\sigma}_{\sm j}(\mathbf{\bm{x}}))|I|\phi(I)+\widehat{\sigma}_{\sm j}^{-1}(\mathbf{\bm{x}})\widehat{q}_{j}(\mathbf{\bm{x}})\phi(I)
q^j(𝐱)(1q^j(𝐱))σ^j(𝐱)(σ^j(𝐱)+σ^(𝐱))|I|ϕ(I)+12πσ^j(𝐱)\displaystyle\leq\frac{\widehat{q}_{j}(\mathbf{\bm{x}})(1-\widehat{q}_{j}(\mathbf{\bm{x}}))}{\widehat{\sigma}_{\sm j}(\mathbf{\bm{x}})\big{(}\widehat{\sigma}_{\sm j}(\mathbf{\bm{x}})+\widehat{\sigma}(\mathbf{\bm{x}})\big{)}}|I|\phi(I)+\frac{1}{\sqrt{2\pi}\widehat{\sigma}_{\sm j}(\mathbf{\bm{x}})}
142π(12e(σ^2(𝐱)1/4)+4σ^2(𝐱)1/4),\displaystyle\leq\frac{1}{4\sqrt{2\pi}}\Big{(}\frac{1}{2\sqrt{e}(\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4)}+\frac{4}{\sqrt{\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4}}\Big{)},

where the last inequality follows the fact that |u|ϕ(u)1/2eπ|u|\phi(u)\leq 1/\sqrt{2e\pi} and 0ϕ(u)1/2π0\leq\phi(u)\leq 1/\sqrt{2\pi}. For T2T_{2}, let g(u)=(1u2)ϕ(u)g(u)=(1-u^{2})\phi(u), we have

T2\displaystyle T_{2} |η^(𝐱)(1I2)ϕ(I)|+|η^j(𝐱)(1Ij2)ϕ(Ij)|12πm^3(𝐱)(σ^2(𝐱)1/4)3/2,\displaystyle\leq\big{|}\widehat{\eta}(\mathbf{\bm{x}})(1-I^{2})\phi(I)\big{|}+\big{|}\widehat{\eta}_{\sm j}(\mathbf{\bm{x}})(1-I_{\sm j}^{2})\phi(I_{\sm j})\big{|}\leq\frac{1}{\sqrt{2\pi}}\frac{\widehat{m}_{3}(\mathbf{\bm{x}})}{(\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4)^{3/2}},

where the last inequality follows from the fact that |g(u)|1/2π|g(u)|\leq 1/\sqrt{2\pi}, and m^3(𝐱)=j=1dp^j(𝐱)(1p^j(𝐱))(12p^j(𝐱))\widehat{m}_{3}(\mathbf{\bm{x}})=\sum_{j=1}^{d}\widehat{p}_{j}(\mathbf{\bm{x}})(1-\widehat{p}_{j}(\mathbf{\bm{x}}))(1-2\widehat{p}_{j}(\mathbf{\bm{x}})). Taken together, using the same argument in the proof of Lemma 4, we have

|ω~τbω~τ|142π(1/(2e)σ^2(𝐱)1/4+4σ^2(𝐱)1/4+4m^3(𝐱)(σ^2(𝐱)1/4)3/2)(log(1+dτ+γ)+1).\displaystyle\big{|}\widetilde{\omega}^{b}_{\tau}-\widetilde{\omega}_{\tau}\big{|}\leq\frac{1}{4\sqrt{2\pi}}\Big{(}\frac{1/(2\sqrt{e})}{\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4}+\frac{4}{\sqrt{\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4}}+\frac{4\widehat{m}_{3}(\mathbf{\bm{x}})}{(\widehat{\sigma}^{2}(\mathbf{\bm{x}})-1/4)^{3/2}}\Big{)}\big{(}\log\big{(}1+\frac{d}{\tau+\gamma}\big{)}+1\big{)}.

Combining Lemma 4, the desirable result then follows.  

D.5 Proof of Lemma 6

Proof  Denote 𝒦+={1kK:αk>0}\mathcal{K}_{+}=\{1\leq k\leq K:\alpha_{k}>0\}. We first prove the necessity. Suppose 𝚫k\boldsymbol{\Delta}^{*}_{k} is a global minimizer of Dicek(){{\rm Dice}}_{k}(\cdot), for k𝒦+k\in\mathcal{K}_{+}. Then for any 𝚫=(𝚫1,,𝚫K)\boldsymbol{\Delta}=(\boldsymbol{\Delta}_{1},\cdots,\boldsymbol{\Delta}_{K}), we have

mDiceγ(𝚫)=k𝒦+αkDiceγ,k(𝚫k)k𝒦+αkDiceγ,k(𝚫k),{\rm mDice}_{\gamma}(\boldsymbol{\Delta}^{*})=\sum_{k\in\mathcal{K}_{+}}\alpha_{k}{{\rm Dice}}_{\gamma,k}(\boldsymbol{\Delta}^{*}_{k})\leq\sum_{k\in\mathcal{K}_{+}}\alpha_{k}{{\rm Dice}}_{\gamma,k}(\boldsymbol{\Delta}_{k}),

yields that 𝚫\boldsymbol{\Delta}^{*} is a global minimizer of mDiceγ(){\rm mDice}_{\gamma}(\cdot). We next prove the sufficiency by contradiction. Suppose 𝚫\boldsymbol{\Delta}^{*} is a global minimizer of mDiceγ(){\rm mDice}_{\gamma}(\cdot), yet there exists k0𝒦+k_{0}\in\mathcal{K}_{+} such that 𝚫k0\boldsymbol{\Delta}_{k_{0}}^{*} is not a minimizer of Diceγ,k0(){{\rm Dice}}_{\gamma,k_{0}}(\cdot). Thus, there exists a segmentation rule 𝚫~\widetilde{\boldsymbol{\Delta}} such that Diceγ,k0(𝚫~)<Diceγ,k0(𝚫k0){{\rm Dice}}_{\gamma,k_{0}}(\widetilde{\boldsymbol{\Delta}})<{{\rm Dice}}_{\gamma,k_{0}}(\boldsymbol{\Delta}_{k_{0}}^{*}), then let 𝚫~=(𝚫1,,𝚫k01,𝚫~k0,𝚫k0+1,,𝚫K)\widetilde{\boldsymbol{\Delta}}=(\boldsymbol{\Delta}^{*}_{1},\cdots,\boldsymbol{\Delta}^{*}_{k_{0}-1},\widetilde{\boldsymbol{\Delta}}_{k_{0}},\boldsymbol{\Delta}^{*}_{k_{0}+1},\cdots,\boldsymbol{\Delta}^{*}_{K})

mDiceγ(𝚫)=k𝒦+αkDiceγ,k(𝚫k)>k𝒦+αkDiceγ,k(𝚫~)=mDiceγ(𝚫~),{\rm mDice}_{\gamma}(\boldsymbol{\Delta}^{*})=\sum_{k\in\mathcal{K}_{+}}\alpha_{k}{{\rm Dice}}_{\gamma,k}(\boldsymbol{\Delta}^{*}_{k})>\sum_{k\in\mathcal{K}_{+}}\alpha_{k}{{\rm Dice}}_{\gamma,k}(\widetilde{\boldsymbol{\Delta}})={\rm mDice}_{\gamma}(\widetilde{\boldsymbol{\Delta}}),

which leads to contradiction of that 𝚫\boldsymbol{\Delta}^{*} is a global minimizer of mDiceγ(){\rm mDice}_{\gamma}(\cdot). The desirable result then follows.  

D.6 Proof of Lemma 7

Proof  Given a segmentation rule 𝚫(𝐗)=(𝚫1(𝐗),,𝚫K(𝐗))\boldsymbol{\Delta}(\mathbf{\bm{X}})=(\boldsymbol{\Delta}_{1}(\mathbf{\bm{X}}),\cdots,\boldsymbol{\Delta}_{K}(\mathbf{\bm{X}})), mDiceγ(){\rm mDice}_{\gamma}(\cdot) can be rewritten as

mDiceγ(𝚫)=k=1KαkDiceγ,k(𝚫k(𝐗)).{\rm mDice}_{\gamma}(\boldsymbol{\Delta})=\sum_{k=1}^{K}\alpha_{k}{{\rm Dice}}_{\gamma,k}(\boldsymbol{\Delta}_{k}(\mathbf{\bm{X}})).

Similarly, it is equivalent to consider the point-wise minimization conditional on 𝐗=𝐱\mathbf{\bm{X}}=\mathbf{\bm{x}}:

𝚫(𝐱)\displaystyle\boldsymbol{\Delta}^{*}(\mathbf{\bm{x}}) =argmax𝐕{0,1}d×KmDiceγ(𝐕|𝐱),s.t.k=1K𝐕k=𝟏d,\displaystyle=\mathop{\rm argmax}_{\mathbf{\bm{V}}\in\{0,1\}^{d\times K}}\ {\rm mDice}_{\gamma}(\mathbf{\bm{V}}|\mathbf{\bm{x}}),\quad\text{s.t.}\sum_{k=1}^{K}\mathbf{\bm{V}}_{k}=\mathbf{\bm{1}}_{d},\quad
mDiceγ(𝐕|𝐱)=k=1KαkDiceγ,k(𝐕k|𝐱),\displaystyle{\rm mDice}_{\gamma}(\mathbf{\bm{V}}|\mathbf{\bm{x}})=\sum_{k=1}^{K}\alpha_{k}{{\rm Dice}}_{\gamma,k}(\mathbf{\bm{V}}_{k}|\mathbf{\bm{x}}),

where 𝐕k\mathbf{\bm{V}}_{k} is the kk-th column of 𝐕\mathbf{\bm{V}}. According to (D.1), we have

mDiceγ(𝐕|𝐱)\displaystyle{\rm mDice}_{\gamma}(\mathbf{\bm{V}}|\mathbf{\bm{x}}) =k=1KαkjI(𝐕k)𝐲j,k{0,1}d1yj,k=12(𝐘k=𝐲k|𝐱)τk+𝐲k1+γ+k=1Kαk𝔼(γ𝐘k1+τk+γ)\displaystyle=\sum_{k=1}^{K}\alpha_{k}\sum_{j\in I({\mathbf{\bm{V}}_{k}})}\sum_{\begin{subarray}{c}\mathbf{\bm{y}}_{\sm j,k}\in\{0,1\}^{d-1}\\ y_{j,k}=1\end{subarray}}\frac{2\mathbb{P}(\mathbf{\bm{Y}}_{k}=\mathbf{\bm{y}}_{k}|\mathbf{\bm{x}})}{\tau_{k}+\|\mathbf{\bm{y}}_{k}\|_{1}+\gamma}+\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\Big{(}\frac{\gamma}{\|\mathbf{\bm{Y}}_{\cdot k}\|_{1}+\tau_{k}+\gamma}\Big{)}
=k=1Kj=1dRjk(τk)vjk+k=1Kαkν¯(τk),\displaystyle=\sum_{k=1}^{K}\sum_{j=1}^{d}R_{jk}(\tau_{k})v_{jk}+\sum_{k=1}^{K}\alpha_{k}\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu(\tau_{k}),

where vjk{0,1}v_{jk}\in\{0,1\} is the segmentation indicator of the jj-th feature under the class-kk, and Rjk()R_{jk}(\cdot) is a reward function defined as:

Rjk(τk)=αk𝐲j,k{0,1}d1yj,k=12(𝐘k=𝐲k|𝐱)τk+𝐲k1.R_{jk}(\tau_{k})=\alpha_{k}\sum_{\begin{subarray}{c}\mathbf{\bm{y}}_{\sm j,k}\in\{0,1\}^{d-1}\\ y_{j,k}=1\end{subarray}}\frac{2\mathbb{P}(\mathbf{\bm{Y}}_{k}=\mathbf{\bm{y}}_{k}|\mathbf{\bm{x}})}{\tau_{k}+\|\mathbf{\bm{y}}_{k}\|_{1}}.

Now, suppose the optimal volume function 𝝉(𝐱)=(τ1(𝐱),,τK(𝐱))\boldsymbol{\tau}^{*}(\mathbf{\bm{x}})=(\tau_{1}^{*}(\mathbf{\bm{x}}),\cdots,\tau_{K}^{*}(\mathbf{\bm{x}}))^{\intercal} is given, then ν¯(τk)\mkern 1.5mu\overline{\mkern-1.5mu\nu\mkern-1.5mu}\mkern 1.5mu(\tau^{*}_{k}) becomes a constant, and the point-wise maximization on mDiceγ{\rm mDice}_{\gamma} is equivalent to:

max𝐕{0,1}d×K\displaystyle\max_{\mathbf{\bm{V}}\in\{0,1\}^{d\times K}} k=1Kj=1dRjkvjk,\displaystyle\quad\sum_{k=1}^{K}\sum_{j=1}^{d}R^{*}_{jk}v_{jk},
subject to j=1dvjk=τk(𝐱),k=1Kvjk=1,\displaystyle\quad\sum_{j=1}^{d}v_{jk}=\tau^{*}_{k}(\mathbf{\bm{x}}),\quad\sum_{k=1}^{K}v_{jk}=1,
for k=1,,K;for j=1,,d,\displaystyle\text{for }k=1,\cdots,K;\quad\text{for }j=1,\cdots,d, (28)

where Rjk=Rjk(τk(𝐱))R^{*}_{jk}=R_{jk}(\tau^{*}_{k}(\mathbf{\bm{x}})) is the reward under τk(𝐱)\tau_{k}^{*}(\mathbf{\bm{x}}). Note that (D.6) is the formulation for the assignment problem (Kuhn, 1955). This completes the proof.  

D.7 Proof of Lemma 9

Proof  For simplicity, we construct a counter example based on γ=0\gamma=0 and d=2d=2, that is, 𝐘=(Y1,Y2)\mathbf{\bm{Y}}=(Y_{1},Y_{2})^{\intercal} and 𝐩(𝐱)=(p1(𝐱),p2(𝐱))\mathbf{\bm{p}}(\mathbf{\bm{x}})=(p_{1}(\mathbf{\bm{x}}),p_{2}(\mathbf{\bm{x}}))^{\intercal}. Without loss generality, we assume p1(𝐱)>p2(𝐱)p_{1}(\mathbf{\bm{x}})>p_{2}(\mathbf{\bm{x}}).

First, we derive the Bayes rule in Theorem 1 for this case. Note that it suffices to compare the scores for τ=1\tau=1 (𝐯=(1,0)\mathbf{\bm{v}}=(1,0)^{\intercal}) and τ=2\tau=2 (𝐯=(1,1))(\mathbf{\bm{v}}=(1,1)^{\intercal}).

Dice((1,0)|𝐱)=p1(𝐱)13p1(𝐱)p2(𝐱),Dice((1,1)|𝐱)=23(p1(𝐱)+p2(𝐱))13p1(𝐱)p2(𝐱).\displaystyle{\rm Dice}((1,0)^{\intercal}|\mathbf{\bm{x}})=p_{1}(\mathbf{\bm{x}})-\frac{1}{3}p_{1}(\mathbf{\bm{x}})p_{2}(\mathbf{\bm{x}}),\quad{\rm Dice}((1,1)^{\intercal}|\mathbf{\bm{x}})=\frac{2}{3}(p_{1}(\mathbf{\bm{x}})+p_{2}(\mathbf{\bm{x}}))-\frac{1}{3}p_{1}(\mathbf{\bm{x}})p_{2}(\mathbf{\bm{x}}).

Therefore, the Bayes rule for Dice optimal segmentation is:

𝜹(𝐱)=(1,0), if 12p1(𝐱)>p2(𝐱),𝜹(𝐱)=(1,1), otherwise.\boldsymbol{\delta}^{*}(\mathbf{\bm{x}})=(1,0)^{\intercal},\text{ if }\frac{1}{2}p_{1}(\mathbf{\bm{x}})>p_{2}(\mathbf{\bm{x}}),\quad\boldsymbol{\delta}^{*}(\mathbf{\bm{x}})=(1,1)^{\intercal},\text{ otherwise}.

Now, we check the Dice-calibrated for classification-calibrated losses. For example, p1(𝐱)=0.45p_{1}(\mathbf{\bm{x}})=0.45 and p2(𝐱)=0.44p_{2}(\mathbf{\bm{x}})=0.44, then

𝜹~(𝐱)=𝟏(𝐩(𝐱)0.5)=(0,0)(1,1)=𝜹(𝐱),\widetilde{\boldsymbol{\delta}}(\mathbf{\bm{x}})=\mathbf{\bm{1}}(\mathbf{\bm{p}}(\mathbf{\bm{x}})\geq 0.5)=(0,0)^{\intercal}\neq(1,1)^{\intercal}=\boldsymbol{\delta}^{*}(\mathbf{\bm{x}}),

where the first equality follows from the definition of a classification-calibrated loss: the decision rule must agree with the conditional probabilities. Therefore, Dice(𝜹~)<Dice(𝜹){\rm Dice}(\widetilde{\boldsymbol{\delta}})<{\rm Dice}(\boldsymbol{\delta}^{*}) yields that a classification-calibrated loss with thresholding at 0.5 is not Dice-calibrated.

 

D.8 Proof of Lemma 10

Proof  By the definition of a strictly proper loss, see Gneiting and Raftery (2007) and references herein, and the formulation in (24), we have 𝐪^(𝐱)=𝐩(𝐱)=((Y1=1|𝐗=𝐱),,(Yd=1|𝐗=𝐱))\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}})=\mathbf{\bm{p}}(\mathbf{\bm{x}})=\big{(}\mathbb{P}(Y_{1}=1|\mathbf{\bm{X}}=\mathbf{\bm{x}}),\cdots,\mathbb{P}(Y_{d}=1|\mathbf{\bm{X}}=\mathbf{\bm{x}})\big{)}^{\intercal}. Then the estimation of τ^(𝐱)\widehat{\tau}(\mathbf{\bm{x}}) and 𝜹^(𝐱)\widehat{\boldsymbol{\delta}}(\mathbf{\bm{x}}) agrees with the definition of τ(𝐱)\tau^{*}(\mathbf{\bm{x}}) and 𝜹(𝐱)\boldsymbol{\delta}^{*}(\mathbf{\bm{x}}). This completes the proof.  

D.9 Proof of Theorem 11

Proof  First, we consider point-wise approximation of the Dice metric under probabilities 𝐩\mathbf{\bm{p}} and 𝐪^\widehat{\mathbf{\bm{q}}}. For any 𝜹\boldsymbol{\delta}, such that δj(𝐱)=1\delta_{j}(\mathbf{\bm{x}})=1 for j=j1,,jτj=j_{1},\cdots,j_{\tau}, and δj(𝐱)=0\delta_{j}(\mathbf{\bm{x}})=0 otherwise. Define

Dice^γ(𝜹(𝐱)|𝐗=𝐱)\displaystyle\widehat{{\rm Dice}}_{\gamma}({\boldsymbol{\delta}}(\mathbf{\bm{x}})|\mathbf{\bm{X}}=\mathbf{\bm{x}}) :=s=1τl=0d12q^js(𝐱)(Γ^js(𝐱)=l)τ+l+γ+1+l=0dγ(Γ^(𝐗)=l)τ+l+γ\displaystyle:=\sum_{s=1}^{\tau}\sum_{l=0}^{d-1}\frac{2\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\mathbb{P}\big{(}\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})=l\big{)}}{\tau+l+\gamma+1}+\sum_{l=0}^{d}\frac{\gamma\mathbb{P}\big{(}\widehat{\Gamma}(\mathbf{\bm{X}})=l\big{)}}{\tau+l+\gamma}
=s=1τ2q^js(𝐱)𝔼(1τ+γ+1+Γ^js(𝐱))+γ𝔼(1τ+γ+Γ^(𝐱))\displaystyle=\sum_{s=1}^{\tau}2\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\mathbb{E}\big{(}\frac{1}{\tau+\gamma+1+\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})}\big{)}+\gamma\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\widehat{\Gamma}(\mathbf{\bm{x}})}\big{)}
Diceγ(𝜹(𝐱)|𝐗=𝐱)\displaystyle{\rm Dice}_{\gamma}({\boldsymbol{\delta}}(\mathbf{\bm{x}})|\mathbf{\bm{X}}=\mathbf{\bm{x}}) :=s=1τl=0d12pjs(𝐱)(Γjs(𝐱)=l)τ+l+γ+1+l=0dγ(Γ(𝐗)=l)τ+l+γ\displaystyle:=\sum_{s=1}^{\tau}\sum_{l=0}^{d-1}\frac{2p_{j_{s}}(\mathbf{\bm{x}})\mathbb{P}\big{(}\Gamma_{\sm j_{s}}(\mathbf{\bm{x}})=l\big{)}}{\tau+l+\gamma+1}+\sum_{l=0}^{d}\frac{\gamma\mathbb{P}\big{(}{\Gamma}(\mathbf{\bm{X}})=l\big{)}}{\tau+l+\gamma}
=s=1τ2pjs(𝐱)𝔼(1τ+γ+1+Γjs(𝐱))+γ𝔼(1τ+γ+Γ(𝐱)).\displaystyle=\sum_{s=1}^{\tau}2p_{j_{s}}(\mathbf{\bm{x}})\mathbb{E}\big{(}\frac{1}{\tau+\gamma+1+\Gamma_{\sm j_{s}}(\mathbf{\bm{x}})}\big{)}+\gamma\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\Gamma(\mathbf{\bm{x}})}\big{)}.

Now, we have

|Dice^γ(𝜹(𝐱)|𝐗=𝐱)Diceγ(𝜹(𝐱)|𝐗=𝐱)|\displaystyle\big{|}\widehat{{\rm Dice}}_{\gamma}({\boldsymbol{\delta}}(\mathbf{\bm{x}})|\mathbf{\bm{X}}=\mathbf{\bm{x}})-{\rm Dice}_{\gamma}({\boldsymbol{\delta}}(\mathbf{\bm{x}})|\mathbf{\bm{X}}=\mathbf{\bm{x}})\big{|}
|2s=1τ(q^js(𝐱)𝔼(1τ+γ+1+Γ^js(𝐱))pjs(𝐱)𝔼(1τ+γ+1+Γjs(𝐱)))|\displaystyle\leq\Big{|}2\sum_{s=1}^{\tau}\Big{(}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\mathbb{E}\big{(}\frac{1}{\tau+\gamma+1+\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})}\big{)}-p_{j_{s}}(\mathbf{\bm{x}})\mathbb{E}\big{(}\frac{1}{\tau+\gamma+1+\Gamma_{\sm j_{s}}(\mathbf{\bm{x}})}\big{)}\Big{)}\Big{|}
+γ|𝔼(1τ+γ+Γ^(𝐱))𝔼(1τ+γ+Γ(𝐱))|\displaystyle\qquad+\gamma\Big{|}\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\widehat{\Gamma}(\mathbf{\bm{x}})}\big{)}-\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\Gamma(\mathbf{\bm{x}})}\big{)}\Big{|}
|2s=1τq^js(𝐱)(𝔼(1τ+γ+1+Γ^js(𝐱))𝔼(1τ+γ+1+Γjs(𝐱)))|\displaystyle\leq\Big{|}2\sum_{s=1}^{\tau}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\Big{(}\mathbb{E}\big{(}\frac{1}{\tau+\gamma+1+\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})}\big{)}-\mathbb{E}\big{(}\frac{1}{\tau+\gamma+1+\Gamma_{\sm j_{s}}(\mathbf{\bm{x}})}\big{)}\Big{)}\Big{|}
+|2s=1τ(q^js(𝐱)pjs(𝐱))𝔼(1τ+γ+1+Γjs(𝐱)))|+γ|𝔼(1τ+γ+Γ^(𝐱))𝔼(1τ+γ+Γ(𝐱))|\displaystyle\quad+\Big{|}2\sum_{s=1}^{\tau}\big{(}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})-p_{j_{s}}(\mathbf{\bm{x}})\big{)}\mathbb{E}\big{(}\frac{1}{\tau+\gamma+1+\Gamma_{\sm j_{s}}(\mathbf{\bm{x}})}\big{)}\Big{)}\Big{|}+\gamma\Big{|}\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\widehat{\Gamma}(\mathbf{\bm{x}})}\big{)}-\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\Gamma(\mathbf{\bm{x}})}\big{)}\Big{|}
2s=1τ|𝔼(Γjs(𝐱)Γ^js(𝐱))(τ+γ+1)2|+2s=1τ|q^js(𝐱)pjs(𝐱)|τ+γ+1+γ|𝔼(Γ(𝐱)Γ^(𝐱))(τ+γ)2|\displaystyle\leq 2\sum_{s=1}^{\tau}\Big{|}\frac{\mathbb{E}\big{(}\Gamma_{\sm j_{s}}(\mathbf{\bm{x}})-\widehat{\Gamma}_{\sm j_{s}}(\mathbf{\bm{x}})\big{)}}{(\tau+\gamma+1)^{2}}\Big{|}+2\sum_{s=1}^{\tau}\frac{\big{|}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})-p_{j_{s}}(\mathbf{\bm{x}})\big{|}}{\tau+\gamma+1}+\gamma\Big{|}\frac{\mathbb{E}\big{(}\Gamma(\mathbf{\bm{x}})-\widehat{\Gamma}(\mathbf{\bm{x}})\big{)}}{(\tau+\gamma)^{2}}\Big{|}
2s=1τ|𝐪^(𝐱)1𝐩(𝐱)1|+|q^js(𝐱)pjs(𝐱)|(τ+γ+1)2+2s=1τ|q^js(𝐱)pjs(𝐱)|τ+γ+1+γ|𝐪^(𝐱)1𝐩(𝐱)1|(τ+γ)2\displaystyle\leq 2\sum_{s=1}^{\tau}\frac{|\|\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}})\|_{1}-\|\mathbf{\bm{p}}(\mathbf{\bm{x}})\|_{1}|+|\widehat{q}_{j_{s}}(\mathbf{\bm{x}})-p_{j_{s}}(\mathbf{\bm{x}})|}{(\tau+\gamma+1)^{2}}+2\sum_{s=1}^{\tau}\frac{\big{|}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})-p_{j_{s}}(\mathbf{\bm{x}})\big{|}}{\tau+\gamma+1}+\gamma\frac{|\|\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}})\|_{1}-\|\mathbf{\bm{p}}(\mathbf{\bm{x}})\|_{1}|}{(\tau+\gamma)^{2}}
(32(1+γ)+c1)𝐪^(𝐱)𝐩(𝐱)1,\displaystyle\leq\big{(}\frac{3}{{2(1+\gamma)}}+c_{1}\big{)}{\|\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}})-\mathbf{\bm{p}}(\mathbf{\bm{x}})\|_{1}},

where the second inequality follows from the triangle inequality, c1=0c_{1}=0 if γ=0\gamma=0, c1=1/γc_{1}=1/\gamma if γ>0\gamma>0. Therefore,

Diceγ(𝜹)\displaystyle{\rm Dice}_{\gamma}(\boldsymbol{\delta}^{*}) Diceγ(𝜹^)=Diceγ(𝜹)Dice^γ(𝜹)+Dice^γ(𝜹)Dice^γ(𝜹^)+Dice^γ(𝜹^)Diceγ(𝜹^)\displaystyle-{\rm Dice}_{\gamma}(\widehat{\boldsymbol{\delta}})={\rm Dice}_{\gamma}(\boldsymbol{\delta}^{*})-\widehat{{\rm Dice}}_{\gamma}(\boldsymbol{\delta}^{*})+\widehat{{\rm Dice}}_{\gamma}(\boldsymbol{\delta}^{*})-\widehat{{\rm Dice}}_{\gamma}(\widehat{\boldsymbol{\delta}})+\widehat{{\rm Dice}}_{\gamma}(\widehat{\boldsymbol{\delta}})-{\rm Dice}_{\gamma}(\widehat{\boldsymbol{\delta}})
Diceγ(𝜹)Dice^γ(𝜹)+Dice^γ(𝜹^)Diceγ(𝜹^)\displaystyle\leq{\rm Dice}_{\gamma}(\boldsymbol{\delta}^{*})-\widehat{{\rm Dice}}_{\gamma}(\boldsymbol{\delta}^{*})+\widehat{{\rm Dice}}_{\gamma}(\widehat{\boldsymbol{\delta}})-{\rm Dice}_{\gamma}(\widehat{\boldsymbol{\delta}})
𝔼𝐗|Dice^γ(𝜹(𝐗)|𝐗)Diceγ(𝜹(𝐗)|𝐗)|+𝔼𝐗|Dice^γ(𝜹^(𝐗)|𝐗)Diceγ(𝜹^(𝐗)|𝐗)|\displaystyle\leq\mathbb{E}_{\mathbf{\bm{X}}}\big{|}\widehat{{\rm Dice}}_{\gamma}({\boldsymbol{\delta}^{*}}(\mathbf{\bm{X}})|\mathbf{\bm{X}})-{\rm Dice}_{\gamma}({\boldsymbol{\delta}^{*}}(\mathbf{\bm{X}})|\mathbf{\bm{X}})\big{|}+\mathbb{E}_{\mathbf{\bm{X}}}\big{|}\widehat{{\rm Dice}}_{\gamma}(\widehat{\boldsymbol{\delta}}(\mathbf{\bm{X}})|\mathbf{\bm{X}})-{\rm Dice}_{\gamma}(\widehat{\boldsymbol{\delta}}(\mathbf{\bm{X}})|\mathbf{\bm{X}})\big{|}
(31+γ+2c1)𝔼𝐪^(𝐗)𝐩(𝐗)1,\displaystyle\leq\big{(}\frac{3}{1+\gamma}+2c_{1}\big{)}\mathbb{E}{\|\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{X}})-\mathbf{\bm{p}}(\mathbf{\bm{X}})\|_{1}},

where the first inequality follows from the definition of 𝜹^\widehat{\boldsymbol{\delta}} such that Dice^γ(𝜹)Dice^γ(𝜹^)0\widehat{{\rm Dice}}_{\gamma}(\boldsymbol{\delta}^{*})-\widehat{{\rm Dice}}_{\gamma}(\widehat{\boldsymbol{\delta}})\leq 0. This completes the proof.  

D.10 Proof of Corollary 12

Proof  According to the Pinsker’s inequality, we have

𝔼𝐗\displaystyle\mathbb{E}_{\mathbf{\bm{X}}} 𝐪^(𝐗)𝐩(𝐗)1=j=1d𝔼𝐗|q^j(𝐗)pj(𝐗)|j=1d𝔼𝐗12KL((Yj|𝐗),^(Yj|𝐗))\displaystyle{\|\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{X}})-\mathbf{\bm{p}}(\mathbf{\bm{X}})\|_{1}}=\sum_{j=1}^{d}\mathbb{E}_{\mathbf{\bm{X}}}{|\widehat{{q}}_{j}(\mathbf{\bm{X}})-{p}_{j}(\mathbf{\bm{X}})|}\leq\sum_{j=1}^{d}\mathbb{E}_{\mathbf{\bm{X}}}\sqrt{\frac{1}{2}{\rm KL}\big{(}\mathbb{P}(Y_{j}|\mathbf{\bm{X}}),\widehat{\mathbb{P}}(Y_{j}|\mathbf{\bm{X}})\big{)}}
d2j=1d𝔼𝐗KL((Yj|𝐗),^(Yj|𝐗))\displaystyle\leq\sqrt{\frac{d}{2}\sum_{j=1}^{d}\mathbb{E}_{\mathbf{\bm{X}}}{\rm KL}\big{(}\mathbb{P}(Y_{j}|\mathbf{\bm{X}}),\widehat{\mathbb{P}}(Y_{j}|\mathbf{\bm{X}})\big{)}}
=d2𝔼(lCE(𝐘,𝐪^(𝐗)))𝔼(lCE(𝐘,𝐩(𝐗))),\displaystyle=\sqrt{\frac{d}{2}}\sqrt{\mathbb{E}\Big{(}l_{\text{CE}}\big{(}\mathbf{\bm{Y}},\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{X}})\big{)}\Big{)}-\mathbb{E}\Big{(}l_{\text{CE}}\big{(}\mathbf{\bm{Y}},\mathbf{\bm{p}}(\mathbf{\bm{X}})\big{)}\Big{)}},

where the last inequality follows from the Cauchy-Schwarz inequality and the Jensen’s inequality, and KL((Yj|𝐱),^(Yj|𝐱)):=pj(𝐱)log(pj(𝐱)/q^j(𝐱))+(1pj(𝐱))log((1pj(𝐱))/(1q^j(𝐱))){\rm KL}\big{(}\mathbb{P}(Y_{j}|\mathbf{\bm{x}}),\widehat{\mathbb{P}}(Y_{j}|\mathbf{\bm{x}})\big{)}:=p_{j}(\mathbf{\bm{x}})\log\big{(}p_{j}(\mathbf{\bm{x}})/\widehat{q}_{j}(\mathbf{\bm{x}})\big{)}+(1-p_{j}(\mathbf{\bm{x}}))\log((1-p_{j}(\mathbf{\bm{x}}))/(1-\widehat{q}_{j}(\mathbf{\bm{x}}))) is the KL divergence between (Yj|𝐱)\mathbb{P}(Y_{j}|\mathbf{\bm{x}}) under 𝐩\mathbf{\bm{p}} and ^(Yj|𝐱)\widehat{\mathbb{P}}(Y_{j}|\mathbf{\bm{x}}) under 𝐪^\widehat{\mathbf{\bm{q}}}. The desirable result then follows by combining (19) in Theorem 11. This completes the proof.  

D.11 Proof of Lemma 13

Proof  Denote 𝐲I=(yj:jI)\mathbf{\bm{y}}_{I}=(y_{j}:j\in I)^{\intercal}, 𝐲I=(yj:jI)\mathbf{\bm{y}}_{\sm I}=(y_{j}:j\notin I)^{\intercal}, and let I(𝐯)=I(𝜹(𝐱))={j:vj=1}I(\mathbf{\bm{v}})=I(\boldsymbol{\delta}(\mathbf{\bm{x}}))=\{j:v_{j}=1\} be a segmentation index set, and 𝐯1=τ\|\mathbf{\bm{v}}\|_{1}=\tau. Again, consider the point-wise maximization:

𝜹(𝐱)=argmax𝐯{0,1}dIoUγ(𝐯|𝐱),\boldsymbol{\delta}^{*}(\mathbf{\bm{x}})=\mathop{\rm argmax}_{\mathbf{\bm{v}}\in\{0,1\}^{d}}\ {{\rm IoU}}_{\gamma}(\mathbf{\bm{v}}|\mathbf{\bm{x}}),

where IoUγ(𝐯|𝐱){{\rm IoU}}_{\gamma}(\mathbf{\bm{v}}|\mathbf{\bm{x}}) is defined as

IoUγ(𝐯|𝐱)\displaystyle{{\rm IoU}}_{\gamma}(\mathbf{\bm{v}}|\mathbf{\bm{x}}) =𝔼(𝐘I(𝐯)1+γ𝐘I(𝐯)1+𝐯1+γ|𝐗=𝐱)\displaystyle=\mathbb{E}\Big{(}\frac{\|\mathbf{\bm{Y}}_{I(\mathbf{\bm{v}})}\|_{1}+\gamma}{\|\mathbf{\bm{Y}}_{\sm I(\mathbf{\bm{v}})}\|_{1}+\|\mathbf{\bm{v}}\|_{1}+\gamma}\Big{|}\mathbf{\bm{X}}=\mathbf{\bm{x}}\Big{)}
=𝔼(𝐘I(𝐯)1𝐘I(𝐯)1+τ+γ|𝐗=𝐱)+𝔼(γ𝐘I(𝐯)1+τ+γ|𝐗=𝐱)\displaystyle=\mathbb{E}\Big{(}\frac{\|\mathbf{\bm{Y}}_{I(\mathbf{\bm{v}})}\|_{1}}{\|\mathbf{\bm{Y}}_{\sm I(\mathbf{\bm{v}})}\|_{1}+\tau+\gamma}\Big{|}\mathbf{\bm{X}}=\mathbf{\bm{x}}\Big{)}+\mathbb{E}\Big{(}\frac{\gamma}{\|\mathbf{\bm{Y}}_{\sm I(\mathbf{\bm{v}})}\|_{1}+\tau+\gamma}\Big{|}\mathbf{\bm{X}}=\mathbf{\bm{x}}\Big{)}
=(𝔼(𝐘I(𝐯)1|𝐗=𝐱)+γ)𝔼(1𝐘I(𝐯)1+τ+γ|𝐗=𝐱),\displaystyle=\Big{(}\mathbb{E}(\|\mathbf{\bm{Y}}_{I(\mathbf{\bm{v}})}\|_{1}|\mathbf{\bm{X}}=\mathbf{\bm{x}})+\gamma\Big{)}\mathbb{E}\Big{(}\frac{1}{\|\mathbf{\bm{Y}}_{\sm I(\mathbf{\bm{v}})}\|_{1}+\tau+\gamma}\Big{|}\mathbf{\bm{X}}=\mathbf{\bm{x}}\Big{)},

where the last equality follows from the fact that 𝐘I(𝐯)𝐘I(𝐯)𝐗\mathbf{\bm{Y}}_{I(\mathbf{\bm{v}})}\perp\mathbf{\bm{Y}}_{\sm I(\mathbf{\bm{v}})}\mid\mathbf{\bm{X}}. Fix τ=𝐯1\tau=\|\mathbf{\bm{v}}\|_{1}. The first term is maximized at 𝐯=(v1,,vd)\mathbf{\bm{v}}^{*}=(v_{1}^{*},\cdots,v_{d}^{*}) with

vj={1if pj(𝐱) ranks top τ,0otherwise,\displaystyle{v}^{*}_{j}=\begin{cases}1&\text{if }p_{j}(\mathbf{\bm{x}})\text{ ranks top }\tau,\\ 0&\text{otherwise},\end{cases}

and the maximum value is jJτ(𝐱)pj(𝐱)+γ\sum_{j\in J_{\tau}(\mathbf{\bm{x}})}p_{j}(\mathbf{\bm{x}})+\gamma. The second term is

𝔼(1𝐘I(𝐯)1+τ+γ|𝐗=𝐱)=𝔼(1ΓI(𝐯)(𝐱)+τ+γ).\displaystyle\mathbb{E}\Big{(}\frac{1}{\|\mathbf{\bm{Y}}_{\sm I(\mathbf{\bm{v}})}\|_{1}+\tau+\gamma}\Big{|}\mathbf{\bm{X}}=\mathbf{\bm{x}}\Big{)}=\mathbb{E}\Big{(}\frac{1}{\Gamma_{\sm I(\mathbf{\bm{v}})}(\mathbf{\bm{x}})+\tau+\gamma}\Big{)}.

Given I(𝐯)I(𝐯)I(\mathbf{\bm{v}})\neq I(\mathbf{\bm{v}}^{*}), we have ΓI(𝐯)(𝐱)=j=1dτBj\Gamma_{\sm I(\mathbf{\bm{v}})}(\mathbf{\bm{x}})=\sum_{j=1}^{d-\tau}B_{j} and ΓI(𝐯)(𝐱)=j=1dτBj\Gamma_{\sm I(\mathbf{\bm{v}}^{*})}(\mathbf{\bm{x}})=\sum_{j=1}^{d-\tau}B^{*}_{j}, where BjB_{j} and BjB^{*}_{j} are independent Bernoulli variables with success probability pjpjp_{j}\geq p^{*}_{j}, respectively, for j=1,,dτj=1,\cdots,d-\tau. By Theorem 2.2.9 of Belzunce et al. (2015), ΓI(𝐯)(𝐱)\Gamma_{\sm I(\mathbf{\bm{v}})}(\mathbf{\bm{x}}) is stochastically greater than ΓI(𝐯)(𝐱)\Gamma_{\sm I(\mathbf{\bm{v}}^{*})}(\mathbf{\bm{x}}), namely

(ΓI(𝐯)(𝐱)ς)(ΓI(𝐯)(𝐱)ς)\mathbb{P}(\Gamma_{\sm I(\mathbf{\bm{v}})}(\mathbf{\bm{x}})\leq\varsigma)\leq\mathbb{P}(\Gamma_{\sm I(\mathbf{\bm{v}}^{*})}(\mathbf{\bm{x}})\leq\varsigma)

for any ς\varsigma\in\mathbb{R}. Thus,

𝔼(1ΓI(𝐯)(𝐱)+τ+γ)\displaystyle\mathbb{E}\Big{(}\frac{1}{\Gamma_{\sm I(\mathbf{\bm{v}})}(\mathbf{\bm{x}})+\tau+\gamma}\Big{)} =0(1ΓI(𝐯)(𝐱)+τ+γς)𝑑ς\displaystyle=\int_{0}^{\infty}\mathbb{P}\Big{(}\frac{1}{\Gamma_{\sm I(\mathbf{\bm{v}})}(\mathbf{\bm{x}})+\tau+\gamma}\geq\varsigma\Big{)}d\varsigma
0(1ΓI(𝐯)(𝐱)+τ+γς)𝑑ς=𝔼(1ΓI(𝐯)(𝐱)+τ+γ).\displaystyle\leq\int_{0}^{\infty}\mathbb{P}\Big{(}\frac{1}{\Gamma_{\sm I(\mathbf{\bm{v}}^{*})}(\mathbf{\bm{x}})+\tau+\gamma}\geq\varsigma\Big{)}d\varsigma=\mathbb{E}\Big{(}\frac{1}{\Gamma_{\sm I(\mathbf{\bm{v}}^{*})}(\mathbf{\bm{x}})+\tau+\gamma}\Big{)}.

As a result, the second term is also maximized at 𝐯\mathbf{\bm{v}}^{*}. Therefore, 𝐯\mathbf{\bm{v}}^{*} maximizes IoUγ(𝐯|𝐱){\rm IoU}_{\gamma}(\mathbf{\bm{v}}|\mathbf{\bm{x}}) for a fixed τ\tau. The desired result follows.  

D.12 Proof of Lemma 14

Proof  Without loss of generality, assuming q^1(𝐱)q^d(𝐱)\widehat{q}_{1}(\mathbf{\bm{x}})\geq\cdots\geq\widehat{q}_{d}(\mathbf{\bm{x}}) are fixed, then for any τ>τ\tau^{\prime}>\tau,

ϖτ(𝐱)ϖτ(𝐱)=(j=1τq^j(𝐱)+γ)𝔼(1Γ^Jτ(𝐱)+τ+γ)(j=1τq^j(𝐱)+γ)𝔼(1Γ^Jτ(𝐱)+τ+γ)=(j=1τq^j(𝐱)+γ)𝔼(1Γ^Jτ(𝐱)+j=τ+1τB^j(𝐱)+τ+γ)(j=1τq^j(𝐱)+γ)𝔼(1Γ^Jτ(𝐱)+τ+γ)=𝔼((j=1τq^j(𝐱)+γ)(Γ^Jτ(𝐱)+τ+γ)(j=1τq^j(𝐱)+γ)(Γ^Jτ(𝐱)+j=τ+1τB^j(𝐱)+τ+γ)(Γ^Jτ(𝐱)+j=τ+1τB^j(𝐱)+τ+γ)(Γ^Jτ(𝐱)+τ+γ))=𝔼((j=1τq^j(𝐱)+γ)(ττj=τ+1τB^j(𝐱))(Γ^Jτ(𝐱)+j=τ+1τB^j(𝐱)+τ+γ)(Γ^Jτ(𝐱)+τ+γ))𝔼(j=τ+1τq^j(𝐱)Γ^Jτ(𝐱)+τ+γ)(j=1τq^j(𝐱)+γ)(ττj=τ+1τq^j(𝐱))((dτ)q^τ+1(𝐱)+τ+γ)2j=τ+1τq^j(𝐱)τ+γ(j=1τq^j(𝐱)+γ)(ττ)(1q^τ+1(𝐱))((dτ)q^τ+1(𝐱)+τ+γ)2(ττ)q^τ+1(𝐱)τ+γ0,\begin{split}&\varpi_{\tau}(\mathbf{\bm{x}})-\varpi_{\tau^{\prime}}(\mathbf{\bm{x}})\\ &=\Big{(}\sum_{j=1}^{\tau}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma\Big{)}\mathbb{E}\Big{(}\frac{1}{\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}+\tau+\gamma}\Big{)}-\Big{(}\sum_{j=1}^{\tau^{\prime}}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma\Big{)}\mathbb{E}\Big{(}\frac{1}{\widehat{\Gamma}_{\sm J_{\tau^{\prime}}(\mathbf{\bm{x}})}+\tau^{\prime}+\gamma}\Big{)}\\ &=\Big{(}\sum_{j=1}^{\tau}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma\Big{)}\mathbb{E}\Big{(}\frac{1}{\widehat{\Gamma}_{\sm J_{\tau^{\prime}}(\mathbf{\bm{x}})}+\sum_{j=\tau+1}^{\tau^{\prime}}\widehat{B}_{j}(\mathbf{\bm{x}})+\tau+\gamma}\Big{)}-\Big{(}\sum_{j=1}^{\tau^{\prime}}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma\Big{)}\mathbb{E}\Big{(}\frac{1}{\widehat{\Gamma}_{\sm J_{\tau^{\prime}}(\mathbf{\bm{x}})}+\tau^{\prime}+\gamma}\Big{)}\\ &=\mathbb{E}\Big{(}\frac{(\sum_{j=1}^{\tau}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma)(\widehat{\Gamma}_{\sm J_{\tau^{\prime}}(\mathbf{\bm{x}})}+\tau^{\prime}+\gamma)-(\sum_{j=1}^{\tau^{\prime}}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma)(\widehat{\Gamma}_{\sm J_{\tau^{\prime}}(\mathbf{\bm{x}})}+\sum_{j=\tau+1}^{\tau^{\prime}}\widehat{B}_{j}(\mathbf{\bm{x}})+\tau+\gamma)}{(\widehat{\Gamma}_{\sm J_{\tau^{\prime}}(\mathbf{\bm{x}})}+\sum_{j=\tau+1}^{\tau^{\prime}}\widehat{B}_{j}(\mathbf{\bm{x}})+\tau+\gamma)(\widehat{\Gamma}_{\sm J_{\tau^{\prime}}(\mathbf{\bm{x}})}+\tau^{\prime}+\gamma)}\Big{)}\\ &=\mathbb{E}\Big{(}\frac{(\sum_{j=1}^{\tau}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma)(\tau^{\prime}-\tau-\sum_{j=\tau+1}^{\tau^{\prime}}\widehat{B}_{j}(\mathbf{\bm{x}}))}{(\widehat{\Gamma}_{\sm J_{\tau^{\prime}}(\mathbf{\bm{x}})}+\sum_{j=\tau+1}^{\tau^{\prime}}\widehat{B}_{j}(\mathbf{\bm{x}})+\tau+\gamma)(\widehat{\Gamma}_{\sm J_{\tau^{\prime}}(\mathbf{\bm{x}})}+\tau^{\prime}+\gamma)}\Big{)}-\mathbb{E}\Big{(}\frac{\sum_{j=\tau+1}^{\tau^{\prime}}\widehat{q}_{j}(\mathbf{\bm{x}})}{\widehat{\Gamma}_{\sm J_{\tau^{\prime}}(\mathbf{\bm{x}})}+\tau^{\prime}+\gamma}\Big{)}\\ &\geq\frac{(\sum_{j=1}^{\tau}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma)(\tau^{\prime}-\tau-\sum_{j=\tau+1}^{\tau^{\prime}}\widehat{q}_{j}(\mathbf{\bm{x}}))}{((d-\tau^{\prime})\widehat{q}_{\tau+1}(\mathbf{\bm{x}})+\tau^{\prime}+\gamma)^{2}}-\frac{\sum_{j=\tau+1}^{\tau^{\prime}}\widehat{q}_{j}(\mathbf{\bm{x}})}{\tau^{\prime}+\gamma}\\ &\geq\frac{(\sum_{j=1}^{\tau}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma)(\tau^{\prime}-\tau)(1-\widehat{q}_{\tau+1}(\mathbf{\bm{x}}))}{((d-\tau^{\prime})\widehat{q}_{\tau+1}(\mathbf{\bm{x}})+\tau^{\prime}+\gamma)^{2}}-\frac{(\tau^{\prime}-\tau)\widehat{q}_{\tau+1}(\mathbf{\bm{x}})}{\tau^{\prime}+\gamma}\geq 0,\end{split}

whenever

j=1τq^j(𝐱)+γq^τ+1(𝐱)1q^τ+1(𝐱)maxτ>τ((dτ)q^τ+1(𝐱)+τ+γ)2τ+γ=q^τ+1(𝐱)1q^τ+1(𝐱)max(d+γ,((dτ)q^τ+1(𝐱)+τ+γ)2τ+γ).\begin{split}\sum_{j=1}^{\tau}\widehat{q}_{j}(\mathbf{\bm{x}})+\gamma&\geq\frac{\widehat{q}_{\tau+1}(\mathbf{\bm{x}})}{1-\widehat{q}_{\tau+1}(\mathbf{\bm{x}})}\max_{\tau^{\prime}>\tau}\frac{((d-\tau^{\prime})\widehat{q}_{\tau+1}(\mathbf{\bm{x}})+\tau^{\prime}+\gamma)^{2}}{\tau^{\prime}+\gamma}\\ &=\frac{\widehat{q}_{\tau+1}(\mathbf{\bm{x}})}{1-\widehat{q}_{\tau+1}(\mathbf{\bm{x}})}\max\Big{(}d+\gamma,\frac{((d-\tau)\widehat{q}_{\tau+1}(\mathbf{\bm{x}})+\tau+\gamma)^{2}}{\tau+\gamma}\Big{)}.\end{split}

This completes the proof.  

D.13 Proof of Theorem 15

Proof  Similar to the proof of Theorem 11, we consider point-wise approximation of IoU metric under 𝐩\mathbf{\bm{p}} and 𝐪^\widehat{\mathbf{\bm{q}}}. For any 𝜹\mathbf{\bm{\delta}} such that δj(𝐱)=1\delta_{j}(\mathbf{\bm{x}})=1 for jJτ(𝐱)j\in J_{\tau}(\mathbf{\bm{x}}) and δj(𝐱)=0\delta_{j}(\mathbf{\bm{x}})=0 otherwise. Define

IoU^γ(𝜹(𝐱)|𝐗=𝐱)\displaystyle\widehat{{\rm IoU}}_{\gamma}({\boldsymbol{\delta}}(\mathbf{\bm{x}})|\mathbf{\bm{X}}=\mathbf{\bm{x}}) :=(s=1τq^js(𝐱)+γ)l=0dτ(Γ^Jτ(𝐱)(𝐱)=l)τ+l+γ\displaystyle:=\Big{(}\sum_{s=1}^{\tau}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})+\gamma\Big{)}\sum_{l=0}^{d-\tau}\frac{\mathbb{P}\big{(}\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})=l\big{)}}{\tau+l+\gamma}
=(s=1τq^js(𝐱)+γ)𝔼(1τ+γ+Γ^Jτ(𝐱)(𝐱)),\displaystyle=\Big{(}\sum_{s=1}^{\tau}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})+\gamma\Big{)}\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})}\big{)},
IoUγ(𝜹(𝐱)|𝐗=𝐱)\displaystyle{\rm IoU}_{\gamma}({\boldsymbol{\delta}}(\mathbf{\bm{x}})|\mathbf{\bm{X}}=\mathbf{\bm{x}}) :=(s=1τpj(𝐱)+γ)l=0dτ(ΓJτ(𝐱)(𝐱)=l)τ+l+γ\displaystyle:=\Big{(}\sum_{s=1}^{\tau}p_{j}(\mathbf{\bm{x}})+\gamma\Big{)}\sum_{l=0}^{d-\tau}\frac{\mathbb{P}\big{(}\Gamma_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})=l\big{)}}{\tau+l+\gamma}
=(s=1τpj(𝐱)+γ)𝔼(1τ+γ+ΓJτ(𝐱)(𝐱)).\displaystyle=\Big{(}\sum_{s=1}^{\tau}p_{j}(\mathbf{\bm{x}})+\gamma\Big{)}\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\Gamma_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})}\big{)}.

We have

|IoU^γ(𝜹(𝐱)|𝐗=𝐱)IoUγ(𝜹(𝐱)|𝐗=𝐱)|\displaystyle\big{|}\widehat{{\rm IoU}}_{\gamma}({\boldsymbol{\delta}}(\mathbf{\bm{x}})|\mathbf{\bm{X}}=\mathbf{\bm{x}})-{\rm IoU}_{\gamma}({\boldsymbol{\delta}}(\mathbf{\bm{x}})|\mathbf{\bm{X}}=\mathbf{\bm{x}})\big{|}
|s=1τ(q^js(𝐱)𝔼(1τ+γ+Γ^Jτ(𝐱)(𝐱))pjs(𝐱)𝔼(1τ+γ+ΓJτ(𝐱)(𝐱)))|\displaystyle\leq\Big{|}\sum_{s=1}^{\tau}\Big{(}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})}\big{)}-p_{j_{s}}(\mathbf{\bm{x}})\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\Gamma_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})}\big{)}\Big{)}\Big{|}
+γ|𝔼(1τ+γ+Γ^Jτ(𝐱)(𝐱))𝔼(1τ+γ+ΓJτ(𝐱)(𝐱))|\displaystyle\qquad+\gamma\Big{|}\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})}\big{)}-\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\Gamma_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})}\big{)}\Big{|}
|s=1τq^js(𝐱)(𝔼(1τ+γ+1+Γ^Jτ(𝐱)(𝐱))𝔼(1τ+γ+1+ΓJτ(𝐱)(𝐱)))|\displaystyle\leq\Big{|}\sum_{s=1}^{\tau}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})\Big{(}\mathbb{E}\big{(}\frac{1}{\tau+\gamma+1+\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})}\big{)}-\mathbb{E}\big{(}\frac{1}{\tau+\gamma+1+\Gamma_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})}\big{)}\Big{)}\Big{|}
+|s=1τ(q^js(𝐱)pjs(𝐱))𝔼(1τ+γ+1+ΓJτ(𝐱)(𝐱)))|\displaystyle\qquad+\Big{|}\sum_{s=1}^{\tau}\big{(}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})-p_{j_{s}}(\mathbf{\bm{x}})\big{)}\mathbb{E}\big{(}\frac{1}{\tau+\gamma+1+\Gamma_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})}\big{)}\Big{)}\Big{|}
+γ|𝔼(1τ+γ+Γ^Jτ(𝐱)(𝐱))𝔼(1τ+γ+ΓJτ(𝐱)(𝐱))|\displaystyle\qquad+\gamma\Big{|}\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})}\big{)}-\mathbb{E}\big{(}\frac{1}{\tau+\gamma+\Gamma_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})}\big{)}\Big{|}
s=1τ|𝔼(ΓJτ(𝐱)(𝐱)Γ^Jτ(𝐱)(𝐱))(τ+γ+1)2|+s=1τ|q^js(𝐱)pjs(𝐱)|τ+γ+1+γ|𝔼(ΓJτ(𝐱)(𝐱)Γ^Jτ(𝐱)(𝐱))(τ+γ)2|\displaystyle\leq\sum_{s=1}^{\tau}\Big{|}\frac{\mathbb{E}\big{(}\Gamma_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})-\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})\big{)}}{(\tau+\gamma+1)^{2}}\Big{|}+\sum_{s=1}^{\tau}\frac{\big{|}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})-p_{j_{s}}(\mathbf{\bm{x}})\big{|}}{\tau+\gamma+1}+\gamma\Big{|}\frac{\mathbb{E}\big{(}\Gamma_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})-\widehat{\Gamma}_{\sm J_{\tau}(\mathbf{\bm{x}})}(\mathbf{\bm{x}})\big{)}}{(\tau+\gamma)^{2}}\Big{|}
s=1τ𝐪^(𝐱)𝐩(𝐱)1(τ+γ+1)2+s=1τ|q^js(𝐱)pjs(𝐱)|τ+γ+1+γ𝐪^(𝐱)𝐩(𝐱)1(τ+γ)2\displaystyle\leq\sum_{s=1}^{\tau}\frac{\|\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}})-\mathbf{\bm{p}}(\mathbf{\bm{x}})\|_{1}}{(\tau+\gamma+1)^{2}}+\sum_{s=1}^{\tau}\frac{\big{|}\widehat{q}_{j_{s}}(\mathbf{\bm{x}})-p_{j_{s}}(\mathbf{\bm{x}})\big{|}}{\tau+\gamma+1}+\gamma\frac{\|\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}})-\mathbf{\bm{p}}(\mathbf{\bm{x}})\|_{1}}{(\tau+\gamma)^{2}}
C2𝐪^(𝐱)𝐩(𝐱)1,\displaystyle\leq C_{2}{\|\widehat{\mathbf{\bm{q}}}(\mathbf{\bm{x}})-\mathbf{\bm{p}}(\mathbf{\bm{x}})\|_{1}},

where C2C_{2} is a constant that may depend on γ\gamma. Then the rest of the proof follows from the same arguments for Theorems 11 and 12 with Dice replaced by IoU, and is omitted.  

Appendix E Auxiliary Lemmas

Lemma 16

Suppose Γ=j=1dBj\Gamma=\sum_{j=1}^{d}B_{j} is a Poisson binomial random variable, and Bj(j=1,,d)B_{j}(j=1,\cdots,d) are independent Bernoulli trials with success probabilities p1,,pdp_{1},\cdots,p_{d}. Then, for any j=1,,dj=1,\cdots,d, we have

(Γjl1)(Γl)(Γjl),\mathbb{P}\big{(}\Gamma_{\sm j}\leq l-1\big{)}\leq\mathbb{P}\big{(}\Gamma\leq l\big{)}\leq\mathbb{P}\big{(}\Gamma_{\sm j}\leq l\big{)},

where Γj=jjBj\Gamma_{\sm j}=\sum_{j^{\prime}\neq j}B_{j^{\prime}}.

Proof  Note that Γ=Γj+Bj\Gamma=\Gamma_{\sm j}+B_{j}, then

(Γl)\displaystyle\mathbb{P}(\Gamma\leq l) =(Γj+Bjl)=(ΓjlBj=0)(1pj)+(Γjl1Bj=1)pj\displaystyle=\mathbb{P}(\Gamma_{\sm j}+B_{j}\leq l)=\mathbb{P}(\Gamma_{\sm j}\leq l\mid B_{j}=0)(1-p_{j})+\mathbb{P}(\Gamma_{\sm j}\leq l-1\mid B_{j}=1)p_{j}
=(Γjl)(1pj)+(Γjl1)pj,\displaystyle=\mathbb{P}(\Gamma_{\sm j}\leq l)(1-p_{j})+\mathbb{P}(\Gamma_{\sm j}\leq l-1)p_{j},

where the last equality follows from the fact that BjBjB_{j}\perp B_{j^{\prime}} for jjj\neq j^{\prime}. The desirable result then follows since (Γjl)(Γjl1)\mathbb{P}(\Gamma_{\sm j}\leq l)\geq\mathbb{P}(\Gamma_{\sm j}\leq l-1).  

Lemma 17

Suppose Γ=j=1dBj\Gamma=\sum_{j=1}^{d}B_{j} is a Poisson binomial random variable, and Bj(j=1,,d)B_{j}(j=1,\cdots,d) are independent Bernoulli trials with success probabilities p1p2pdp_{1}\geq p_{2}\geq\cdots\geq p_{d}. Then, for an arbitrary positive non-decreasing sequence ζl\zeta_{l} (l=0,,d1)(l=0,\cdots,d-1),

Qj=l=0d11ζl(Γj=l)Q_{j}=\sum_{l=0}^{d-1}\frac{1}{\zeta_{l}}\mathbb{P}(\Gamma_{\sm j}=l)

is a non-increasing function with respect to j=1,,dj=1,\cdots,d.

Proof  Note that for any jjj^{\prime}\neq j, Γj=ijBi=ij,jBi+Bj=:Γjj+Bj\Gamma_{\sm j}=\sum_{i\neq j}B_{i}=\sum_{i\neq j,j^{\prime}}B_{i}+B_{j^{\prime}}=:\Gamma_{\sm jj^{\prime}}+B_{j^{\prime}}. Denote ξjj,l=(Γjj=l)\xi_{\sm jj^{\prime},l}=\mathbb{P}\big{(}\Gamma_{\sm jj^{\prime}}=l\big{)}, we have

(Γj=l)=(1pj)ξjj,l+pjξjj,l1.\displaystyle\mathbb{P}\big{(}\Gamma_{\sm j}=l\big{)}=(1-p_{j^{\prime}})\xi_{\sm jj^{\prime},l}+p_{j^{\prime}}\xi_{\sm jj^{\prime},l-1}.

Then,

QjQj\displaystyle Q_{j}-Q_{j^{\prime}} =l=0d1ζl((1pj)ξjj,l+pjξjj,l1(1pj)ξjj,lpjξjj,l1)\displaystyle=\sum_{l=0}^{d}\frac{1}{\zeta_{l}}\big{(}(1-p_{j^{\prime}})\xi_{\sm jj^{\prime},l}+p_{j^{\prime}}\xi_{\sm jj^{\prime},l-1}-(1-p_{j})\xi_{\sm jj^{\prime},l}-p_{j}\xi_{\sm jj^{\prime},l-1}\big{)}
=(pjpj)(l=0d1ζlξjj,ll=0d1ζlξjj,l1)=(pjpj)(l=0d(1ζl1ζl+1)ξjj,l),\displaystyle=(p_{j}-p_{j^{\prime}})\Big{(}\sum_{l=0}^{d}\frac{1}{\zeta_{l}}\xi_{\sm jj^{\prime},l}-\sum_{l=0}^{d}\frac{1}{\zeta_{l}}\xi_{\sm jj^{\prime},l-1}\Big{)}=(p_{j}-p_{j^{\prime}})\Big{(}\sum_{l=0}^{d}\big{(}\frac{1}{\zeta_{l}}-\frac{1}{\zeta_{l+1}})\xi_{\sm jj^{\prime},l}\Big{)},

where the first equality follows from the fact that (Γj=d)=0\mathbb{P}(\Gamma_{\sm j}=d)=0, and the last equality follows from the fact that ξjj,l=0\xi_{\sm jj^{\prime},l}=0 for l<0l<0 or l=dl=d. Hence, we have QjQjQ_{j}-Q_{j^{\prime}} has the same sign with pjpjp_{j}-p_{j^{\prime}}, and the desirable result then follows.  

Lemma 18

Let Γ\Gamma be a Poisson-binomial random variable with the success probability (pj)j=1,,d(p_{j})_{j=1,\cdots,d}, then for any τ1\tau\geq 1, we have

(j=1dpj+τ)1𝔼(1Γ+τ)(d+1dj=1dpj+τ1)1.(\sum_{j=1}^{d}p_{j}+\tau)^{-1}\leq\mathbb{E}\big{(}\frac{1}{\Gamma+\tau}\big{)}\leq(\frac{d+1}{d}\sum_{j=1}^{d}p_{j}+\tau-1)^{-1}.

Proof  According to the corollary in Chao and Strawderman (1972), we have

𝔼(1Γ+τ)\displaystyle\mathbb{E}\big{(}\frac{1}{\Gamma+\tau}\big{)} =01tτ1PΓ(t)𝑑t=01tτ1(j=1d(1pj+pjt))𝑑t01tτ1(1p¯+p¯t)d𝑑t\displaystyle=\int_{0}^{1}t^{\tau-1}P_{\Gamma}(t)dt=\int_{0}^{1}t^{\tau-1}\big{(}\prod_{j=1}^{d}(1-p_{j}+p_{j}t)\big{)}dt\leq\int_{0}^{1}t^{\tau-1}(1-\bar{p}+\bar{p}t)^{d}dt
=𝔼(1Λ+τ)(d+1dj=1dpj+τ1)1,\displaystyle=\mathbb{E}\big{(}\frac{1}{\Lambda+\tau}\big{)}\leq(\frac{d+1}{d}\sum_{j=1}^{d}p_{j}+\tau-1)^{-1},

where p¯=d1j=1dpj\bar{p}=d^{-1}\sum_{j=1}^{d}p_{j}, the first inequality follows from the inequality of arithmetic and geometric means, the last equality follows from Section 3.1 of Chao and Strawderman (1972), and the last inequality follows from (25) in Wooff (1985). This completes the proof.  


References

  • Ajtai et al. (1983) Miklós Ajtai, János Komlós, and Endre Szemerédi. An O{O}(nlognnlogn) sorting network. In Proceedings of the Fifteenth Annual ACM Symposium on Theory of Computing, pages 1–9, 1983.
  • Assidiq et al. (2008) Abdulhakam AM Assidiq, Othman O Khalifa, Md Rafiqul Islam, and Sheroz Khan. Real time lane detection for autonomous vehicles. In 2008 International Conference on Computer and Communication Engineering, pages 82–88. IEEE, 2008.
  • Badrinarayanan et al. (2017) Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, 2017.
  • Bao and Sugiyama (2020) Han Bao and Masashi Sugiyama. Calibrated surrogate maximization of linear-fractional utility in binary classification. In International Conference on Artificial Intelligence and Statistics, pages 2337–2347. PMLR, 2020.
  • Bartlett et al. (2005) Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Annals of Statistics, 33(4):1497–1537, 2005.
  • Bartlett et al. (2006) Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
  • Belzunce et al. (2015) Felix Belzunce, Carolina Martinez Riquelme, and Julio Mulero. An Introduction to Stochastic Orders. Academic Press, 2015.
  • Berman et al. (2018) Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4413–4421, 2018.
  • Bertels et al. (2019) Jeroen Bertels, David Robben, Dirk Vandermeulen, and Paul Suetens. Optimization with soft dice can lead to a volumetric bias. In International MICCAI Brainlesion Workshop, pages 89–97. Springer, 2019.
  • Bice et al. (2021) Noah Bice, Neil Kirby, Ruiqi Li, Dan Nguyen, Tyler Bahr, Christopher Kabat, Pamela Myers, Niko Papanikolaou, and Mohamad Fakhreddine. A sensitivity analysis of probability maps in deep-learning-based anatomical segmentation. Journal of Applied Clinical Medical Physics, 22(8):105–119, 2021.
  • Bracewell and Bracewell (1986) Ronald Newbold Bracewell and Ronald N Bracewell. The Fourier Transform and Its Applications, volume 31999. McGraw-hill New York, 1986.
  • Chao and Strawderman (1972) Min-Te Chao and WE Strawderman. Negative moments of positive random variables. Journal of the American Statistical Association, 67(338):429–431, 1972.
  • Charoenphakdee et al. (2021) Nontawat Charoenphakdee, Jayakorn Vongkulbhisal, Nuttapong Chairatanakul, and Masashi Sugiyama. On focal loss for class-posterior probability estimation: A theoretical perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5202–5211, 2021.
  • Chen et al. (2018) Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 801–818, 2018.
  • Cordts et al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3223, 2016.
  • Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.
  • Cox (1958) David R Cox. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2):215–232, 1958.
  • Crouse (2016) David F Crouse. On implementing 2d rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696, 2016.
  • Cucker and Zhou (2007) Felipe Cucker and Ding Xuan Zhou. Learning theory: an approximation theory viewpoint, volume 24. Cambridge University Press, 2007.
  • D’Ambrosio et al. (2020) Claudia D’Ambrosio, Silvano Martello, and Michele Monaci. Lower and upper bounds for the non-linear generalized assignment problem. Computers & Operations Research, 120:104933, 2020.
  • Edmonds and Karp (1972) Jack Edmonds and Richard M Karp. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM (JACM), 19(2):248–264, 1972.
  • Everingham et al. (2012) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html, 2012.
  • Giné and Koltchinskii (2006) Evarist Giné and Vladimir Koltchinskii. Concentration inequalities and asymptotic results for ratio type empirical processes. Annals of Probability, 34(3):1143–1216, 2006.
  • Gneiting and Raftery (2007) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
  • Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321–1330. PMLR, 2017.
  • Hariharan et al. (2011) Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In 2011 International Conference on Computer Vision, pages 991–998. IEEE, 2011.
  • Hong (2013) Yili Hong. On computing the distribution function for the poisson binomial distribution. Computational Statistics & Data Analysis, 59:41–51, 2013.
  • Jadon (2021) Shruti Jadon. Semsegloss: A python package of loss functions for semantic segmentation. Software Impacts, 9:100078, 2021.
  • Kuhn (1955) Harold W Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
  • Lin et al. (2017) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 2980–2988, 2017.
  • Lin (2004) Yi Lin. A note on margin-based loss functions in classification. Statistics & Probability Letters, 68(1):73–82, 2004.
  • Lipton et al. (2014) Zachary C Lipton, Charles Elkan, and Balakrishnan Naryanaswamy. Optimal thresholding of classifiers to maximize F1 measure. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part II 14, pages 225–239. Springer, 2014.
  • Liu et al. (2021) Sheng Liu, Aakash Kaku, Weicheng Zhu, Matan Leibovich, Sreyas Mohan, Boyang Yu, Haoxiang Huang, Laure Zanna, Narges Razavian, Jonathan Niles-Weed, et al. Deep probability estimation. arXiv preprint arXiv:2111.10734, 2021.
  • Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
  • Milletari et al. (2016) Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pages 565–571. IEEE, 2016.
  • Murty and Kabadi (1985) Katta G Murty and Santosh N Kabadi. Some np-complete problems in quadratic and nonlinear programming. Technical report, 1985.
  • Neammanee (2005) Kritsana Neammanee. A refinement of normal approximation to poisson binomial. International Journal of Mathematics and Mathematical Sciences, 2005(5):717–728, 2005.
  • Nickolls et al. (2008) John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? Queue, 6(2):40–53, 2008.
  • Nordström et al. (2020) Marcus Nordström, Han Bao, Fredrik Löfman, Henrik Hult, Atsuto Maki, and Masashi Sugiyama. Calibrated surrogate maximization of Dice. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part IV 23, pages 269–278. Springer, 2020.
  • Ouali (2022) Yassine Ouali. Github repository: pytorch-segmentation. https://github.com/yassouali/pytorch-segmentation, 2022.
  • Pollard (1984) David Pollard. Convergence of Stochastic Processes. Springer Science & Business Media, 1984.
  • Popordanoska et al. (2021) Teodora Popordanoska, Jeroen Bertels, Dirk Vandermeulen, Frederik Maes, and Matthew B Blaschko. On the relationship between calibrated predictors and unbiased volume estimation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 678–688. Springer, 2021.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
  • Salehi et al. (2017) Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. Tversky loss function for image segmentation using 3d fully convolutional deep networks. In International Workshop on Machine Learning in Medical Imaging, pages 379–387. Springer, 2017.
  • Shen (1997) Xiaotong Shen. On methods of sieves and penalization. Annals of Statistics, 25(6):2555–2591, 1997.
  • Shou et al. (2017) Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5734–5743, 2017.
  • Sudre et al. (2017) Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages 240–248. Springer, 2017.
  • Tahmasebi et al. (2012) Pejman Tahmasebi, Ardeshir Hezarkhani, and Muhammad Sahimi. Multiple-point geostatistical modeling based on the cross-correlation functions. Computational Geosciences, 16(3):779–797, 2012.
  • Tewari and Bartlett (2007) Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(5), 2007.
  • Tomizawa (1971) Nobuaki Tomizawa. On some techniques useful for solution of transportation network problems. Networks, 1(2):173–194, 1971.
  • Wang et al. (2018) Guotai Wang, Maria A Zuluaga, Wenqi Li, Rosalind Pratt, Premal A Patel, Michael Aertsen, Tom Doel, Anna L David, Jan Deprest, Sébastien Ourselin, et al. Deepigeos: a deep interactive geodesic framework for medical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7):1559–1572, 2018.
  • Wooff (1985) David A Wooff. Bounds on reciprocal moments with applications and developments in stein estimation and post-stratification. Journal of the Royal Statistical Society: Series B (Methodological), 47(2):362–371, 1985.
  • Xin et al. (2018) Yang Xin, Lingshuang Kong, Zhi Liu, Chunhua Wang, Hongliang Zhu, Mingcheng Gao, Chensu Zhao, and Xiaoke Xu. Multimodal feature-level fusion for biometrics identification system on iomt platform. IEEE Access, 6:21418–21426, 2018.
  • Zhang (2004) Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1):56–85, 2004.
  • Zhao et al. (2017) Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2881–2890, 2017.