This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

thanks: Georgios Fellouris is with the Department of Statistics, the Electrical and Computer Engineering Department, and the Coordinated Science Laboratory, University of Illinois at Urbana–Champaign, Champaign, IL 61820 USA, (e-mail: [email protected]). Aristomenis Tsopelakos is with the Electrical and Computer Engineering Department and the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana IL 61801, USA, (e-mail: [email protected]). The work of the two authors were supported in part by the NSF under Grant CIF 1514245, through the University of Illinois at Urbana–Champaign. The work of Georgios Fellouris was also supported in part by the NSF under Grant DMS 1737962, through the University of Illinois at Urbana–Champaign.

Sequential anomaly detection
under sampling constraints

Aristomenis Tsopelakos and Georgios Fellouris, Member, IEEE
Abstract

The problem of sequential anomaly detection is considered, where multiple data sources are monitored in real time and the goal is to identify the “anomalous” ones among them, when it is not possible to sample all sources at all times. A detection scheme in this context requires specifying not only when to stop sampling and which sources to identify as anomalous upon stopping, but also which sources to sample at each time instance until stopping. A novel formulation for this problem is proposed, in which the number of anomalous sources is not necessarily known in advance and the number of sampled sources per time instance is not necessarily fixed. Instead, an arbitrary lower bound and an arbitrary upper bound are assumed on the number of anomalous sources, and the fraction of the expected number of samples over the expected time until stopping is required to not exceed an arbitrary, user-specified level. In addition to this sampling constraint, the probabilities of at least one false alarm and at least one missed detection are controlled below user-specified tolerance levels. A general criterion is established for a policy to achieve the minimum expected time until stopping to a first-order asymptotic approximation as the two familywise error rates go to zero. Moreover, the asymptotic optimality is established of a family of policies that sample each source at each time instance with a probability that depends on past observations only through the current estimate of the subset of anomalous sources. This family includes, in particular, a novel policy that requires minimal computation under any setup of the problem.

Index Terms:
Active sensing; Anomaly detection; Asymptotic optimality; Controlled sensing; Sequential design of experiments; Sequential detection; Sequential sampling; Sequential testing.

I Introduction

In various engineering and scientific areas data are often collected in real time over multiple streams, and it is of interest to quickly identify those data streams, if any, that exhibit outlying behavior. In brain science, for example, it is desirable to identify groups of cells with large vibration frequency, as this is a symptom for the development of a particular malfunctioning [1]. In fraud prevention security systems in e-commerce, it is desirable to identify transition links with low transition rate, as this may be an indication that a link is tapped [2]. Such applications, among many others, motivate the study of sequential multiple testing problems where the data for the various hypotheses are generated by distinct sources, there are two hypotheses for each data source, and the goal is to identify as quickly as possible the “anomalous” sources, i.e., those in which the alternative hypothesis is correct. In certain works, e.g., [3, 4, 5, 6, 7, 8], it is assumed that all sources are sampled at each time instance, whereas in others, e.g., [9, 10, 11, 12, 13, 14, 15, 16, 17], only a fixed number of sources (typically, only one) can be sampled at each time instance. In the latter case, apart from when to stop sampling and which data sources to identify as anomalous upon stopping, one also needs to specify which sources to sample at every time instance until stopping.

The latter problem, which is often called sequential anomaly detection in the literature, can be viewed as a special case of the sequential multi-hypothesis testing problem with controlled sensing (or observation control), where the goal is to solve a sequential multi-hypothesis testing problem while taking at each time instance an action that influences the distribution of the observations [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]. In the anomaly detection case, the action is the selection of the sources to be sampled, whereas the hypotheses correspond to the possible subsets of anomalous sources. Therefore, policies and results in the context of sequential multi-hypothesis testing with controlled sensing are applicable, in principle at least, to the sequential anomaly detection problem. Such a policy was first proposed in [18] in the case of two hypotheses, and subsequently generalized in [20, 21] to the case of an arbitrary, finite number of hypotheses. When applied to the sequential anomaly detection problem, this policy samples each subset of sources of the allowed size at each time instance with a certain probability that depends on the past observations only through the currently estimated subset of anomalous sources.

In general, the implementation of the policy in [20] requires solving, for each subset of anomalous sources, a linear system where the number of equations is equal to the number of sources and the number of unknowns is equal to the number of all subsets of sources of the allowed size. Moreover, its asymptotic optimality has been established only under restrictive assumptions, such as when the following hold simultaneously: it is known a priori that there is only one anomalous source, it is possible to sample only one source at a time, the testing problems are identical, and the sources generate Bernoulli random variables under each hypothesis [21, Appendix A]. To avoid such restrictions, it has been proposed to modify the policy in [20] at an appropriate subsequence of time instances, at which each subset of sources of the allowed size is sampled with the same probability [18, Remark 7],[25]. Such a modified policy was shown in [25] to always be asymptotically optimal, as long as the log-likelihood ratio statistic of each observation has a finite second moment.

A goal of the present work is to show that the unmodified policy in [20] is always asymptotically optimal in the context of the above sequential anomaly detection problem, as long as the log-likelihood ratio statistic of each observation has a finite first moment. However, our main goal in this paper is to propose a more general framework for the problem of sequential anomaly detection with sampling constraints that (i) does not rely on the restrictive assumption that the number of anomalous sources is known in advance, (ii) allows for two distinct error constraints and captures the asymmetry between a false alarm, i.e., falsely identifying a source as anomalous, and a missed detection, i.e., failing to detect an anomalous source, and most importantly, (iii) relaxes the hard sampling constraint that the same number of sources must be sampled at each time instance, and (iv) admits an asymptotically optimal solution that is convenient to implement under any setup of the problem.

To be more specific, in this paper we assume an arbitrary, user-specified lower bound and an arbitrary, user-specified upper bound on the number of anomalous sources. This setup includes the case of no prior information, the case where the number of anomalous sources is known in advance, as well as more realistic cases of prior information, such as when there is only a non-trivial upper bound on the number of anomalous sources. Moreover, we require control of the probabilities of at least one false alarm and at least one missed detection below arbitrary, user-specified levels. Both these features are taken into account in [6] when all sources are observed at all times. Thus, the present paper can be seen as a generalization of [6] to the case that it is not possible to observe all sources at all times. However, instead of demanding that the number of sampled sources per time instance be fixed, as in [9, 10, 11, 14, 12, 13, 15, 16, 17], we only require that the ratio of the expected number of observations over the expected time until stopping not exceed a user-specified level. This leads to a more general formulation for sequential anomaly detection compared to those in [9, 10, 11, 14, 12, 13, 15, 16, 17], which at the same time is not a special case of the sequential multi-hypothesis testing problem with controlled sensing in [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]. Thus, while existing policies in the literature are applicable to the proposed setup, this is not the case for the existing universal lower bounds.

Our first main result on the proposed problem is a criterion for a policy (that employs the stopping and decision rules in [6] and satisfies the sampling constraint) to achieve the optimal expected time until stopping to a first-order asymptotic approximation as the two familywise error probabilities go to 0. Indeed, we show that such a policy is asymptotically optimal in the above sense, if it samples each source with a certain minimum long-run frequency that depends on the source itself and the true subset of anomalous sources. Our second main result is that the latter condition holds, simultaneously for every possible scenario regarding the anomalous sources, by policies that sample each source at each time instance with a probability that is not smaller than the above minimum long-run frequency that corresponds to the current estimate of the subset of anomalous sources. This implies the asymptotic optimality of the unmodified policy in [20], as well as of a much simpler policy according to which the sources are sampled at each time instance conditionally independent of one another given the current estimate of the subset of anomalous sources. Indeed, the implementation of the latter policy, unlike that in [20], involves minimal computational and storage requirements under any setup of the problem. Moreover, we present simulation results that suggest that this computational simplicity does not come at the price of performance deterioration (relative to the policy in [20]).

Finally, to illustrate the gains of asymptotic optimality, we consider the straightforward policy in which the sources are sampled in tandem. We compute its asymptotic relative efficiency and we show that it is asymptotically optimal only in a very special setup of the problem. Moreover, our simulation results suggest that, apart from this special setup, its actual performance loss relative to the above asymptotically optimal rules is (much) larger than the one implied by its asymptotic relative efficiency when the target error probabilities are not (very) small.

The remainder of the paper is organized as follows: in Section II we formulate the proposed problem. In Section III we present a family of policies that satisfy the error constraints, and we introduce an auxiliary consistency property.In Section IV we introduce a family of sampling rules on which we focus in this paper. In Section V we present the asymptotic optimality theory of this work. In Section VI we discuss alternative sampling approaches in the literature, and in Section VII we present the results of our simulation studies. In Section VIII we conclude and discuss potential generalizations, as well as directions for further research. The proofs of all main results are presented in Appendices B, C, D, whereas in Appendix A we state and prove two supporting lemmas.

We end this section with some notations we use throughout the paper. We use :=:= to indicate the definition of a new quantity and \equiv to indicate a duplication of notation. We set :={1,2,}\mathbb{N}:=\{1,2\ldots,\} and [n]:={1,,n}[n]:=\{1,\ldots,n\} for nn\in\mathbb{N}, we denote by AcA^{c} the complement, by |A||A| the size and by 2A2^{A} the powerset of a set AA, by a\lfloor a\rfloor the floor and by a\lceil a\rceil the ceiling of a positive number aa, and by 𝟏\mathbf{1} the indicator of an event. We write xyx\sim y when lim(x/y)=1\lim(x/y)=1, xyx\gtrsim y when lim inf(x/y)1\liminf(x/y)\geq 1, and xyx\lesssim y when lim sup(x/y)1\limsup(x/y)\leq 1, where the limit is taken in some sense that will be specified. Moreover, iid stands for independent and identically distributed, and we say that a sequence of positive numbers (an)(a_{n}) is summable if n=1an<\sum_{n=1}^{\infty}a_{n}<\infty and exponentially decaying if there are real numbers c,d>0c,d>0 such that ancexp{dn}a_{n}\leq c\exp\{-d\,n\} for every nn\in\mathbb{N}. A property that we use in our proofs is that if (an)(a_{n}) is exponentially decaying, so is the sequence (mζnam)(\sum_{m\geq\zeta n}a_{m}), for any ζ(0,1]\zeta\in(0,1].

II Problem formulation

Let (𝕊,𝒮)(\mathbb{S},\mathcal{S}) be an arbitrary measurable space and let (Ω,,𝖯)(\Omega,\mathcal{F},\mathsf{P}) be a probability space that hosts MM independent sequences of iid 𝕊\mathbb{S}-valued random elements, {Xi(n):n},i[M]\{X_{i}(n):n\in\mathbb{N}\},i\in[M], which are generated by MM distinct data sources, as well as an independent sequence of iid random vectors, {Z(n):n=0,1,}\{Z(n):n=0,1,\ldots\}, to be used for randomization purposes. Specifically, each Z(n):=(Z0(n),Z1(n),,ZM(n))Z(n):=(Z_{0}(n),Z_{1}(n),\ldots,Z_{M}(n)) is a vector of independent random variables, uniform in (0,1)(0,1), and each Xi(n)X_{i}(n) has density fif_{i}, with respect to some σ\sigma-finite measure νi\nu_{i}, that is equal to either f1if_{1i} or f0if_{0i}. For every i[M]i\in[M] we say that source ii is “anomalous” if fi=f1if_{i}=f_{1i} and we assume that the Kullback-Leibler divergences of f1if_{1i} and f0if_{0i} are positive and finite, i.e.,

Ii:=𝕊log(f1i/f0i)f1i𝑑νi(0,),Ji:=𝕊log(f0i/f1i)f0i𝑑νi(0,).\displaystyle\begin{split}I_{i}&:=\int_{\mathbb{S}}\log(f_{1i}/f_{0i})\,f_{1i}\,d\nu_{i}\in(0,\infty),\\ J_{i}&:=\int_{\mathbb{S}}\log(f_{0i}/f_{1i})\,f_{0i}\,d\nu_{i}\in(0,\infty).\end{split} (1)

We assume that it is known a priori that there are at least \ell and at most uu anomalous sources, where \ell and uu are given, user-specified integers such that 0uM0\leq\ell\leq u\leq M, with the understanding that if =u\ell=u, then 0<<M0<\ell<M. Thus, the family of all possible subsets of anomalous sources is 𝒫,u:={A[M]:|A|u}\mathcal{P}_{\ell,u}:=\{A\subseteq[M]:\ell\leq|A|\leq u\}. In what follows, we denote by 𝖯A\mathsf{P}_{A} the underlying probability measure and by 𝖤A\mathsf{E}_{A} the corresponding expectation when the subset of anomalous sources is A𝒫,uA\in\mathcal{P}_{\ell,u}, and we simply write 𝖯\mathsf{P} and 𝖤\mathsf{E} whenever the identity of the subset of anomalous sources is not relevant.

The problem we consider in this work is the identification of the anomalous sources, if any, on the basis of sequentially acquired observations from all sources, when however it is not possible to observe all of them at every sampling instance. Specifically, we have to specify a random time TT, at which sampling is terminated, and two random sequences, R:={R(n),n1}R:=\{R(n),n\geq 1\} and Δ:={Δ(n),n}\Delta:=\{\Delta(n),n\in\mathbb{N}\}, so that R(n)[M]R(n)\subseteq[M] represents the subset of sources that are sampled at time nn when nTn\leq T, and Δ(n)Δn𝒫,u\Delta(n)\equiv\Delta_{n}\in\mathcal{P}_{\ell,u} represents the subset of sources that are identified as anomalous when T=nT=n. The decisions whether to stop or not at each time instance, which sources to sample next in the latter case, and which ones to identify as anomalous in the former, must be based on the already available information. Thus, we say that RR is a sampling rule if R(n)R(n) is n1R\mathcal{F}_{n-1}^{R}–measurable for every nn\in\mathbb{N}, where

nR:=σ(n1R,Z(n),{Xi(n):iR(n)}),n,0R:=σ(Z(0)).\displaystyle\begin{split}\mathcal{F}_{n}^{R}&:=\sigma\left(\mathcal{F}_{n-1}^{R},\,Z(n),\,\{X_{i}(n):i\in R(n)\}\right),\quad n\in\mathbb{N},\\ \mathcal{F}_{0}^{R}&:=\sigma(Z(0)).\end{split} (2)

Moreover, we say that the triplet (R,T,Δ)(R,T,\Delta) is a policy if RR is a sampling rule, {T=n}nR\{T=n\}\in\mathcal{F}^{R}_{n} and Δn\Delta_{n} is nR\mathcal{F}^{R}_{n}-measurable for every nn\in\mathbb{N}, in which case we refer to TT as a stopping rule and to Δ\Delta as a decision rule. For any sampling rule RR, we denote by Ri(n)R_{i}(n) the indicator of whether source ii is sampled at time nn, i.e., Ri(n):=𝟏{iR(n)}R_{i}(n):=\mathbf{1}\{i\in R(n)\}, and by NiR(n)N_{i}^{R}(n) (resp. πiR(n)\pi_{i}^{R}(n)) the number (resp. proportion) of times source ii is sampled in the first nn time instances, i.e.,

NiR(n)\displaystyle N^{R}_{i}(n) :=m=1nRi(m),πiR(n):=NiR(n)/n.\displaystyle:=\sum_{m=1}^{n}R_{i}(m),\quad\ \pi_{i}^{R}(n):=N^{R}_{i}(n)/n.

We say that a policy (R,T,Δ)(R,T,\Delta) belongs to class 𝒞(α,β,,u,K)\mathcal{C}(\alpha,\beta,\ell,u,K) if its probabilities of at least one false alarm and at least one missed detection upon stopping do not exceed α\alpha and β\beta respectively, i.e.,

𝖯A(T<,ΔTA)αA𝒫,u,𝖯A(T<,AΔT)βA𝒫,u,\displaystyle\begin{split}\mathsf{P}_{A}\left(T<\infty,\,\Delta_{T}\setminus A\neq\emptyset\right)\leq\alpha\quad\forall\,A\in\mathcal{P}_{\ell,u},\\ \mathsf{P}_{A}\left(T<\infty,\,A\setminus\Delta_{T}\neq\emptyset\right)\leq\beta\quad\forall\,A\in\mathcal{P}_{\ell,u},\end{split} (3)

where α,β\alpha,\beta are user-specified numbers in (0,1)(0,1), and the ratio of its expected total number of observations over its expected time until stopping does not exceed KK, i.e.,

i=1M𝖤[NiR(T)]K𝖤[T],\sum_{i=1}^{M}\mathsf{E}\left[N_{i}^{R}(T)\right]\leq K\;\mathsf{E}[T], (4)

where KK is a user-specified, real number in (0,M](0,M]. Note that, in view of the identity

i=1M𝖤[NiR(T)]\displaystyle\sum_{i=1}^{M}\mathsf{E}\left[N_{i}^{R}(T)\right] =𝖤[n=1T𝖤[|R(n)||n1R]],\displaystyle=\mathsf{E}\left[\sum_{n=1}^{T}\mathsf{E}\left[|R(n)|\;|\;\mathcal{F}_{n-1}^{R}\right]\right],

constraint (4) is clearly satisfied when

supnT𝖤[|R(n)||n1R]K,\sup_{n\leq T}\,\mathsf{E}\left[|R(n)|\;|\;\mathcal{F}_{n-1}^{R}\right]\leq K, (5)

This is the case, for example, when at most K\lfloor K\rfloor sources are sampled at each time instance up to stopping, i.e., when |R(n)|K|R(n)|\leq\lfloor K\rfloor for every nTn\leq T.

Our main goal in this work is, for any given ,u,K\ell,u,K, to obtain policies that attain the smallest possible expected time until stopping,

𝒥A(α,β,,u,K):=inf(R,T,Δ)𝒞(α,β,,u,K)𝖤A[T],\mathcal{J}_{A}(\alpha,\beta,\ell,u,K):=\inf_{(R,T,\Delta)\in\mathcal{C}(\alpha,\beta,\ell,u,K)}\;\mathsf{E}_{A}[T], (6)

simultaneously under every A𝒫,uA\in\mathcal{P}_{\ell,u}, to a first-order asymptotic approximation as α\alpha and β\beta go to 0. Specifically, when =u\ell=u, we allow α\alpha and β\beta to go to 0 at arbitrary rates, but when <u\ell<u, we assume that

r(0,):|logα|r|logβ|.\displaystyle\exists\;r\in(0,\infty):|\log\alpha|\sim r\,|\log\beta|. (7)

III A family of policies

In this section we introduce the statistics that we use in this work, a family of policies that satisfy the error constraint (3), as well as an auxiliary consistency property.

III-A Log-likelihood ratio statistics

Let A,C𝒫,uA,C\in\mathcal{P}_{\ell,u} and nn\in\mathbb{N}. We denote by ΛA,CR(n)\Lambda^{R}_{A,C}(n) the log-likelihood ratio of 𝖯A\mathsf{P}_{A} versus 𝖯C\mathsf{P}_{C} based on the first nn time instances when the sampling rule is RR, i.e.,

ΛA,CR(n):=logd𝖯Ad𝖯C(nR),\displaystyle\Lambda^{R}_{A,C}(n):=\log\frac{d\mathsf{P}_{A}}{d\mathsf{P}_{{C}}}\left(\mathcal{F}^{R}_{n}\right), (8)

and we observe that it admits the following recursion:

ΛA,CR(n)\displaystyle\Lambda^{R}_{A,C}(n) =ΛA,CR(n1)+iACgi(Xi(n))Ri(n)jCAgj(Xj(n))Rj(n),\displaystyle=\Lambda^{R}_{A,C}(n-1)+\sum_{i\in A\setminus{C}}g_{i}(X_{i}(n))\,R_{i}(n)-\sum_{j\in{C}\setminus A}g_{j}(X_{j}(n))\,R_{j}(n), (9)

where ΛA,CR(0):=0\Lambda^{R}_{A,C}(0):=0 and gi:=log(f1i/f0i),i[M]g_{i}:=\log\left(f_{1i}/f_{0i}\right),\,i\in[M]. Indeed, this follows by (2)\eqref{filtration} and the fact that R(n)R(n) is n1R\mathcal{F}_{n-1}^{R}-measurable, Xi(n)X_{i}(n) is independent of n1R\mathcal{F}_{n-1}^{R} and its density under 𝖯A\mathsf{P}_{A} is f1if_{1i} if iAi\in A and f0if_{0i} if iAi\notin A, and Z(n)Z(n) is independent of both n1R\mathcal{F}_{n-1}^{R} and {Xi(n):i[M]}\{X_{i}(n):i\in[M]\} and has the same density under 𝖯A\mathsf{P}_{A} and 𝖯C\mathsf{P}_{C}.

For any i,j[M]i,j\in[M] we write

ΛA,CR(n)\displaystyle\Lambda^{R}_{A,C}(n) ΛijR(n)when A={i},C={j},\displaystyle\equiv\Lambda^{R}_{ij}(n)\quad\text{when }\quad A=\{i\},C=\{j\},
ΛA,CR(n)\displaystyle\Lambda^{R}_{A,C}(n) ΛiR(n)when A={i},C=,\displaystyle\equiv\Lambda^{R}_{i}(n)\quad\text{when }\quad A=\{i\},C=\emptyset,

and we observe that the above recursion implies that

ΛiR(n)\displaystyle\Lambda^{R}_{i}(n) =m=1ngi(Xi(m))Ri(m),\displaystyle=\sum_{m=1}^{n}g_{i}(X_{i}(m))\,R_{i}(m), (10)
ΛA,CR(n)\displaystyle\Lambda^{R}_{A,C}(n) =iACΛiR(n)jCAΛjR(n).\displaystyle=\sum_{i\in A\setminus{C}}\Lambda^{R}_{i}(n)-\sum_{j\in{C}\setminus A}\Lambda^{R}_{j}(n). (11)

In particular, for any i,j[M]i,j\in[M] we have

ΛijR(n)=ΛiR(n)ΛjR(n).\Lambda^{R}_{ij}(n)=\Lambda^{R}_{i}(n)-\Lambda^{R}_{j}(n).

In what follows, we refer to ΛiR(n)\Lambda^{R}_{i}(n) as the local log-likelihood ratio (LLR) of source ii at time nn. We introduce the order statistics of the LLRs at time nn,

Λ(1)R(n)Λ(M)R(n),\Lambda^{R}_{(1)}(n)\geq\ldots\geq\Lambda^{R}_{(M)}(n),

and we denote by wiR(n),i[M]w^{R}_{i}(n),i\in[M] the corresponding indices, i.e.,

Λ(i)R(n):=ΛwiR(n)R(n),i[M].\Lambda^{R}_{(i)}(n):=\Lambda^{R}_{w^{R}_{i}(n)}(n),\quad i\in[M].

Moreover, we denote by pR(n)p^{R}(n) the number of positive LLRs at time nn, i.e.,

pR(n):=i=1M𝟏{ΛiR(n)>0},p^{R}(n)\ :=\sum_{i=1}^{M}\mathbf{1}\{\Lambda^{R}_{i}(n)>0\},

and we also set

Λ(0)R(n):=+,Λ(M+1)R(n):=.\Lambda^{R}_{(0)}(n):=+\infty,\qquad\Lambda^{R}_{(M+1)}(n):=-\infty.

III-B Stopping and decision rules

We next show that for any sampling rule RR there is a stopping rule, which we will denote by TRT^{R}, and a decision rule, which we will denote by ΔR\Delta^{R}, such that the policy (R,TR,ΔR)(R,T^{R},\Delta^{R}) satisfies the error constraint (3). Their forms depend on whether the number of anomalous sources is known in advance or not, i.e., on whether =u\ell=u or <u\ell<u. Specifically, when =u\ell=u, we stop as soon as the th\ell^{th} largest LLR exceeds the next one by c>0c>0, i.e.,

TR\displaystyle T^{R} :=inf{n:Λ()R(n)Λ(+1)R(n)c},\displaystyle:=\inf\left\{n\in\mathbb{N}:\;\Lambda^{R}_{(\ell)}(n)-\Lambda^{R}_{(\ell+1)}(n)\geq c\right\}, (12)

and we identify as anomalous the sources with the \ell largest LLRs, i.e.,

ΔnR\displaystyle\Delta^{R}_{n} :={w1R(n),,wR(n)},n.\displaystyle:=\left\{w^{R}_{1}(n),\ldots,w^{R}_{\ell}(n)\right\},\quad n\in\mathbb{N}. (13)

When the number of anomalous sources is completely unknown (=0\ell=0 and u=Mu=M), we stop as soon as the value of every LLR is outside (a,b)(-a,b) for some a,b>0a,b>0, i.e.,

TR\displaystyle T^{R} :=inf{n:ΛiR(n)(a,b)for alli[M]},\displaystyle:=\inf\left\{n\in\mathbb{N}:\;\Lambda^{R}_{i}(n)\notin(-a,b)\quad\text{for all}\quad i\in[M]\right\}, (14)

and we identify as anomalous the sources with positive LLRs, i.e.,

ΔnR\displaystyle\Delta^{R}_{n} :={i[M]:ΛiR(n)>0},n.\displaystyle:=\left\{i\in[M]:\;\Lambda^{R}_{i}(n)>0\right\},\quad n\in\mathbb{N}. (15)

When <u\ell<u, in general, we combine the stopping rules of the two previous cases and we set

TR:=inf{n:eitherΛ(+1)R(n)a&Λ()R(n)Λ(+1)R(n)c,orpR(n)u&ΛiR(n)(a,b)i[M],orΛ(u)R(n)b&Λ(u)R(n)Λ(u+1)R(n)d},\displaystyle\begin{split}T^{R}:=\inf\{n\in\mathbb{N}:\quad&\text{either}\quad\Lambda^{R}_{(\ell+1)}(n){\leq}-a\quad\&\quad\Lambda^{R}_{(\ell)}(n)-\Lambda^{R}_{(\ell+1)}(n)\geq c,\\ \quad&\;\text{or}\quad\quad\ell\leq p^{R}(n)\leq u\qquad\&\qquad\Lambda^{R}_{i}(n)\notin(-a,b)\quad\forall\;i\in[M],\\ \quad&\;\text{or}\quad\quad\;\Lambda^{R}_{(u)}(n)\geq b\qquad\&\quad\Lambda^{R}_{(u)}(n)-\Lambda^{R}_{(u+1)}(n)\geq d\},\end{split} (16)

where a,b,c,d>0a,b,c,d>0, and we use the following decision rule:

ΔnR\displaystyle\Delta^{R}_{n} :={wiR(n):i=1,,(pR(n))u},n.\displaystyle:=\left\{w^{R}_{i}(n):\;i=1,\ldots,(p^{R}(n)\vee\ell)\wedge u\right\},\quad n\in\mathbb{N}. (17)

That is, we identify as anomalous the sources with positive LLRs as long as their number is between \ell and uu. If this number is larger than uu (resp. smaller than \ell), then we declare as anomalous the sources with the uu (resp. \ell) largest LLRs.

Proposition III.1

Let RR be an arbitrary sampling rule.

  • When =u\ell=u, (R,TR,ΔR)(R,T^{R},\Delta^{R}) satisfies the error constraint (3) if

    c=|log(αβ)|+log((M)).\displaystyle c=|\log(\alpha\wedge\beta)|+\log(\ell(M-\ell)). (18)
  • When <u\ell<u, (R,TR,ΔR)(R,T^{R},\Delta^{R}) satisfies the error constraint (3) if

    a=|logβ|+logM,c=|logα|+log((M)M),b=|logα|+logM,d=|logβ|+log(uM).\displaystyle\begin{split}a&=|\log\beta|+\log M,\quad c=|\log\alpha|+\log((M-\ell)M),\\ b&=|\log\alpha|+\log M,\quad d=|\log\beta|+\log(uM).\end{split} (19)
Proof:

When all sources are sampled at all times, this is shown in [6, Theorems 3.1, 3.2]. The same proof applies when K<MK<M for any sampling rule RR.

In view of the above result, in what follows we assume that the thresholds in TRT^{R} are selected according to (18) when =u\ell=u and according to (19) when <u\ell<u. While this is a rather conservative choice, it will be sufficient for obtaining asymptotically optimal policies. For this reason, in what follows we say that a sampling rule, RR, that satisfies the sampling constraint (4) with T=TRT=T^{R}, is asymptotically optimal under 𝖯A\mathsf{P}_{A}, for some A𝒫,uA\in\mathcal{P}_{\ell,u}, if

𝖤A[TR]𝒥A(α,β,,u,K)\mathsf{E}_{A}\left[T^{R}\right]\sim\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)

as α\alpha and β\beta go to 0 at arbitrary rates when =u\ell=u and so that (7) holds when <u\ell<u. We simply say that RR is asymptotically optimal if it is asymptotically optimal under 𝖯A\mathsf{P}_{A} for every A𝒫,uA\in\mathcal{P}_{\ell,u}. We next introduce a weaker property, which will be useful for establishing asymptotic optimality.

III-C Exponential consistency

For any sampling rule RR and any subset A𝒫,uA\in\mathcal{P}_{\ell,u} we denote by σAR\sigma_{A}^{R} the random time starting from which the sources in AA are the ones estimated as anomalous by ΔR\Delta^{R}, i.e.,

σAR:=inf{n:ΔmR=Afor allmn},\sigma^{R}_{A}:=\inf\left\{n\in\mathbb{N}:\Delta^{R}_{m}=A\quad\text{for all}\;\;m\geq n\right\}, (20)

and we say that RR is exponentially consistent under 𝖯A\mathsf{P}_{A} if 𝖯A(σAR>n)\mathsf{P}_{A}(\sigma_{A}^{R}>n) is an exponentially decaying sequence. We simply say that RR is exponentially consistent if it is exponentially consistent under 𝖯A\mathsf{P}_{A} for every A𝒫,uA\in\mathcal{P}_{\ell,u}. The following theorem states sufficient conditions for exponential consistency under 𝖯A\mathsf{P}_{A}.

Theorem III.1

Let A𝒫,uA\in\mathcal{P}_{\ell,u} and let RR be an arbitrary sampling rule.

  • When <u\ell<u, RR is exponentially consistent under 𝖯A\mathsf{P}_{A} if there exists a ρ>0\rho>0 such that 𝖯A(πiR(n)<ρ)\mathsf{P}_{A}\left(\pi^{R}_{i}(n)<\rho\right) is exponentially decaying for every iAi\in A if |A|>|A|>\ell and for every iAi\notin A if |A|<u|A|<u.

  • When =u\ell=u, RR is exponentially consistent under 𝖯A\mathsf{P}_{A} if there exists a ρ>0\rho>0 such that 𝖯A(πiR(n)<ρ)\mathsf{P}_{A}\left(\pi^{R}_{i}(n)<\rho\right) is exponentially decaying either for every iAi\in A or for every iAi\notin A.

Proof:

Appendix B.

Remark: Theorem III.1 reveals that when |A|=>0|A|=\ell>0 or |A|=u<M|A|=u<M, it is possible to have exponentially consistency under 𝖯A\mathsf{P}_{A} without sampling at all certain sources. Specifically, when |A|=<u|A|=\ell<u (resp. |A|=u>|A|=u>\ell) it is not necessary to sample any source in AA (resp. AcA^{c}). On the other hand, when |A|==u|A|=\ell=u, it suffices to sample either all sources in AA or all of them in AcA^{c}.

IV Probabilistic sampling rules

In this section we introduce a family of sampling rules and we show how to design them in order to satisfy the sampling constraint (5) and in order to be exponentially consistent. Thus, we say that a sampling rule RR is probabilistic if there exists a function qR:2[M]×𝒫,u[0,1]q^{R}:2^{[M]}\times\mathcal{P}_{\ell,u}\to[0,1] such that for every nn\in\mathbb{N}, D𝒫,uD\in\mathcal{P}_{\ell,u}, and B[M]B\subseteq[M] we have

qR(B;D):=𝖯(R(n+1)=B|nR,ΔnR=D),\displaystyle q^{R}\left(B;D\right):=\mathsf{P}\left(R(n+1)=B\,|\,\mathcal{F}^{R}_{n},\Delta^{R}_{n}=D\right), (21)

i.e., qR(B;D)q^{R}\left(B;D\right) is the probability that BB is the subset of sampled sources when DD is the currently estimated subset of anomalous sources. If RR is a probabilistic sampling rule, then for every i[M]i\in[M] and D𝒫,uD\in\mathcal{P}_{\ell,u} we set

ciR(D):=𝖯(Ri(n+1)=1|nR,ΔnR=D)=B[M]:iBqR(B;D),\displaystyle\begin{split}c^{R}_{i}\left(D\right)&:=\mathsf{P}\left(R_{i}(n+1)=1\,|\,\mathcal{F}^{R}_{n},\Delta^{R}_{n}=D\right)\\ &=\sum_{B\subseteq[M]:\,i\in B}q^{R}\left(B;D\right),\end{split} (22)

i.e., ciR(D)c^{R}_{i}(D) is the probability that source ii is sampled when DD is the currently estimated subset of anomalous sources.

We refer to a probabilistic sampling rule RR as Bernoulli if for every D𝒫,uD\in\mathcal{P}_{\ell,u} and B[M]B\subseteq[M] we have

qR(B;D)=iBciR(D)jB(1cjR(D)),\displaystyle q^{R}(B;D)=\prod_{i\in B}c^{R}_{i}(D)\prod_{j\notin B}\left(1-c^{R}_{j}(D)\right), (23)

i.e., if the sources are sampled at each time instance conditionally independent of one another given the currently estimated subset of anomalous sources. Indeed, such a sampling rule admits a representation of the form

Ri(n+1)\displaystyle R_{i}(n+1) =𝟏{Zi(n)ciR(ΔnR)},i[M],n,\displaystyle=\mathbf{1}\left\{Z_{i}(n)\leq c^{R}_{i}\left(\Delta_{n}^{R}\right)\right\},\quad i\in[M],n\in\mathbb{N}, (24)

where Z1(n),,ZM(n)Z_{1}(n),\ldots,Z_{M}(n) are iid and uniform in (0,1)(0,1), thus, its implementation at each time instance requires the realization of MM Bernoulli random variables.

The following proposition provides a sufficient condition for a probabilistic sampling rule to satisfy the sampling constraint (5), and consequently (4), for any {nR}\{\mathcal{F}_{n}^{R}\}-stopping time TT.

Proposition IV.1

If RR is a probabilistic sampling rule such that

i=1M\displaystyle\sum_{i=1}^{M} ciR(D)Kfor every D𝒫,u,\displaystyle c^{R}_{i}(D)\leq K\quad\text{for every }\;D\in\mathcal{P}_{\ell,u}, (25)

then (5) holds for any {nR}\{\mathcal{F}_{n}^{R}\}-stopping time TT.

Proof:

For any nn\in\mathbb{N} and any probabilistic sampling rule RR we have

𝖤[|R(n)||n1R]=i=1M𝖯(Ri(n)=1|n1R)=i=1MciR(ΔnR).\mathsf{E}\left[|R(n)|\;|\;\mathcal{F}_{n-1}^{R}\right]=\sum_{i=1}^{M}\mathsf{P}\left(R_{i}(n)=1\;|\;\mathcal{F}_{n-1}^{R}\right)=\sum_{i=1}^{M}c^{R}_{i}(\Delta_{n}^{R}).

As a result, if (25) is satisfied, then (5) holds for any {nR}\{\mathcal{F}_{n}^{R}\}-stopping time, TT.

Finally, we establish sufficient conditions for the exponentially consistency of a probabilistic sampling rule.

Theorem IV.1

Let RR be a probabilistic sampling rule.

  • When <u\ell<u, RR is exponentially consistent if, for every D𝒫,uD\in\mathcal{P}_{\ell,u}, ciR(D)c_{i}^{R}(D) is positive for every iDi\in D if |D|>|D|>\ell and every iDi\notin D if |D|<u|D|<u.

  • When =u\ell=u, RR is exponentially consistent if, for every D𝒫,uD\in\mathcal{P}_{\ell,u}, ciR(D)c_{i}^{R}(D) is positive either for every iDi\in D or for every iDi\notin D.

Proof:

The proof consists in showing that the sufficient conditions of Theorem III.1 are satisfied for every A𝒫,uA\in\mathcal{P}_{\ell,u}, and is presented in Appendix B.

Remark: Theorem IV.1 implies that when <u\ell<u, a probabilistic sampling rule is exponentially consistent if, whenever the number of estimated anomalous sources is larger than \ell (resp. smaller than uu), it samples with positive probability any source that is currently estimated as anomalous (resp. non-anomalous). When =u\ell=u, on the other hand, it suffices to sample with positive probability at any time instance either every source that is currently estimated as anomalous or every source that is currently estimated as non-anomalous.

Remark: Unlike the proof of Proposition IV.1, the proof of Theorem IV.1 is quite challenging and it is one of the main contributions of this work from a technical point of view. It relies on two Lemmas, B.1 and B.2, which we state in Appendix B. The proof becomes much easier if we strengthen the assumption of the theorem and assume that every source be sampled with positive probability at every time instance, i.e., if we assumed that ciR(D)>0c_{i}^{R}(D)>0 for every D𝒫,uD\in\mathcal{P}_{\ell,u}. Indeed, in this case Lemma B.2 becomes redundant, the proof simplifies considerably, and it essentially relies on ideas developed in [18]. However, such an assumption would limit considerably the scope of the asymptotic optimality theory we develop in the next section.

V Asymptotic optimality

In this section we present the asymptotic optimality theory of this work and discuss some of its implications. For this, we first need to introduce some additional notation.

V-A Notation

For A[M]A\subseteq[M] with AA\neq\emptyset we set

IA\displaystyle I^{*}_{A} :=miniAIi,IA:=|A|iA(1/Ii),K^A:=|A|IAIA,\displaystyle:=\min\limits_{i\in A}I_{i},\quad I_{A}:=\frac{|A|}{\sum_{i\in A}(1/I_{i})},\quad\hat{K}_{A}:=|A|\;\frac{I^{*}_{A}}{I_{A}}, (26)

and for A[M]A\subseteq[M] with A[M]A\neq[M] we set

JA:=miniAJiJA:=|Ac|iA(1/Ji),KˇA:=|Ac|JAJA.\displaystyle J^{*}_{A}:=\min\limits_{i\notin A}J_{i}\quad J_{A}:=\frac{|A^{c}|}{\sum_{i\notin A}(1/J_{i})},\quad\check{K}_{A}:=|A^{c}|\;\frac{J^{*}_{A}}{J_{A}}. (27)

That is, IAI^{*}_{A} is the minimum and IAI_{A} the harmonic mean of the Kullback-Leibler information numbers in {Ii:iA}\{I_{i}:i\in A\}, whereas JAJ^{*}_{A} is the minimum and JAJ_{A} the harmonic mean of the Kullback-Leibler information numbers in {Ji:iA}\{J_{i}:i\notin A\}. Moreover, K^A\hat{K}_{A} (resp. KˇA\check{K}_{A}) is a positive real number smaller or equal to the size of AA (resp. AcA^{c}), with the equality attained when Ii=II_{i}=I (resp. Ji=JJ_{i}=J) for every ii in AA (resp. AcA^{c}).

Moreover, for A[M]A\subseteq[M] with 0<|A|<M0<|A|<M we set

θA\displaystyle\theta_{A} :=IA/JA,\displaystyle:=\;I^{*}_{A}/J^{*}_{A}, (28)

thus, θA\theta_{A} is a positive real number that is equal to 1 when Ii=JiI_{i}=J_{i} for every i[M]i\in[M].

In Theorem V.1 we also introduce for each A𝒫,uA\in\mathcal{P}_{\ell,u} two quantities,

xA(r,,u,K)andyA(r,,u,K),x_{A}(r,\ell,u,K)\quad\text{and}\quad y_{A}(r,\ell,u,K), (29)

which will play an important role in the formulation of all results in this section. Although their values depend on r,,u,Kr,\ell,u,K, in order to lighten the notation we simply write xAx_{A} and yAy_{A} unless we want to emphasize their dependence on one or more of these parameters.

V-B A universal asymptotic lower bound

Theorem V.1

Let A𝒫,uA\in\mathcal{P}_{\ell,u}.

(i) Suppose that =u\ell=u. Then, as α,β0\alpha,\beta\to 0 we have

𝒥A(α,β,,u,K)\displaystyle\mathcal{J}_{A}(\alpha,\beta,\ell,u,K) |log(αβ)|xAIA+yAJA,\displaystyle\gtrsim\frac{|\log(\alpha\wedge\beta)|}{x_{A}\,I^{*}_{A}+y_{A}\,J^{*}_{A}}, (30)

where if K^AθAKˇA\hat{K}_{A}\leq\theta_{A}\check{K}_{A},

xA:=(K/K^A)1yA:=((KK^A)+/KˇA)1,\displaystyle\begin{split}x_{A}&:=(K/\hat{K}_{A})\wedge 1\\ y_{A}&:=\bigl{(}(K-\hat{K}_{A})^{+}/\check{K}_{A}\bigr{)}\wedge 1,\end{split} (31)

and if K^A>θAKˇA\hat{K}_{A}>\theta_{A}\check{K}_{A},

xA:=((KKˇA)+/K^A)1yA:=(K/KˇA)1.\displaystyle\begin{split}x_{A}&:=\bigl{(}(K-\check{K}_{A})^{+}/\hat{K}_{A}\bigr{)}\wedge 1\\ y_{A}&:=(K/\check{K}_{A})\wedge 1.\end{split} (32)

(ii) Suppose that <u\ell<u and let α,β0\alpha,\beta\to 0 so that (7) holds.

  • If <|A|<u\ell<|A|<u, then

    𝒥A(α,β,,u,K)|logα|xAIA|logβ|yAJA,wherexA:=KK^A+(θA/r)KˇA(r/θA)1yA:=KKˇA+(r/θA)K^A(θA/r)1.\displaystyle\begin{split}\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)&\gtrsim\frac{|\log\alpha|}{x_{A}\,I_{A}^{*}}\sim\frac{|\log\beta|}{y_{A}\,J_{A}^{*}},\\ \text{where}\qquad x_{A}&:=\frac{K}{\hat{K}_{A}+(\theta_{A}/r)\check{K}_{A}}\wedge(r/\theta_{A})\wedge 1\\ y_{A}&:=\frac{K}{\check{K}_{A}+(r/\theta_{A})\hat{K}_{A}}\wedge(\theta_{A}/r)\wedge 1.\end{split} (33)
  • If |A|=|A|=\ell, we distinguish two cases:

    • If either =0\ell=0 or r1r\leq 1, then

      𝒥A(α,β,,u,K)|logβ|xAIA+yAJA,wherexA:=0yA:=(K/KˇA)1.\displaystyle\begin{split}\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)&\gtrsim\frac{|\log\beta|}{x_{A}\,I_{A}^{*}+y_{A}\,J_{A}^{*}},\quad\\ \text{where}\qquad x_{A}&:=0\\ y_{A}&:=(K/\check{K}_{A})\wedge 1.\end{split} (34)

      If >0\ell>0 and r>1r>1, we set zA:=θA/(r1)z_{A}:=\theta_{A}/(r-1) and we distinguish two further cases:

      • If either zA1z_{A}\geq 1 or KK^A+zAKˇAK\leq\hat{K}_{A}+z_{A}\,\check{K}_{A}, then

        𝒥A(α,β,,u,K)|logβ|yAJA|logα|xAIA+yAJA,wherexA:=KK^A+zAKˇA(1/zA)1yA:=KKˇA+(1/zA)K^AzA1.\displaystyle\begin{split}\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)&\gtrsim\frac{|\log\beta|}{y_{A}\,J_{A}^{*}}\sim\frac{|\log\alpha|}{x_{A}\,I_{A}^{*}+y_{A}\,J_{A}^{*}},\\ \text{where}\qquad x_{A}&:=\frac{K}{\hat{K}_{A}+z_{A}\,\check{K}_{A}}\wedge(1/z_{A})\wedge 1\\ y_{A}&:=\frac{K}{\check{K}_{A}+(1/z_{A})\,\hat{K}_{A}}\wedge z_{A}\wedge 1.\end{split} (35)
      • If   zA<1z_{A}<1 and K>K^A+zAKˇAK>\hat{K}_{A}+z_{A}\,\check{K}_{A}, then

        𝒥A(α,β,,u,K)|logα|xAIA+yAJAwherexA:=1yA:=((KK^A)/KˇA)1.\displaystyle\begin{split}\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)&\gtrsim\frac{|\log\alpha|}{x_{A}\,I^{*}_{A}+y_{A}\,J^{*}_{A}}\\ \;\;\text{where}\qquad x_{A}&:=1\\ y_{A}&:=\bigl{(}(K-\hat{K}_{A})/\check{K}_{A}\bigr{)}\wedge 1.\end{split} (36)
  • If |A|=u|A|=u, then we distinguish again two cases:

    • If either u=Mu=M or r1r\geq 1, then

      𝒥A(α,β,,u,K)|logα|xAIA+yAJAwhereyA:=0xA:=(K/K^A)1.\displaystyle\begin{split}\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)&\gtrsim\frac{|\log\alpha|}{x_{A}\,I_{A}^{*}+y_{A}\,J_{A}^{*}}\\ \text{where}\qquad y_{A}&:=0\\ x_{A}&:=(K/\hat{K}_{A})\wedge 1.\end{split} (37)

      If u<Mu<M and r<1r<1, we set wA:=(1/θA)/(1/r1)w_{A}:=(1/\theta_{A})/(1/r-1) and we distinguish two further cases:

      • If either wA1w_{A}\geq 1 or KKˇA+wAK^AK\leq\check{K}_{A}+w_{A}\hat{K}_{A}, then

        𝒥A(α,β,,u,K)|logα|xAIA|logβ|xAIA+yAJA,whereyA:=KKˇA+wAK^A(1/wA)1xA:=KK^A+(1/wA)KˇAwA1.\displaystyle\begin{split}\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)&\gtrsim\frac{|\log\alpha|}{x_{A}\,I_{A}^{*}}\sim\frac{|\log\beta|}{x_{A}\,I_{A}^{*}+y_{A}\,J_{A}^{*}},\\ \text{where}\qquad y_{A}&:=\frac{K}{\check{K}_{A}+w_{A}\hat{K}_{A}}\wedge(1/w_{A})\wedge 1\\ x_{A}&:=\frac{K}{\hat{K}_{A}+(1/w_{A})\check{K}_{A}}\wedge w_{A}\wedge 1.\end{split} (38)
      • If wA<1w_{A}<1 and K>KˇA+wAK^AK>\check{K}_{A}+w_{A}\hat{K}_{A}, then

        𝒥A(α,β,,u,K)|logβ|xAIA+yAJA,whereyA:=1xA:=((KKˇA)/K^A)1.\displaystyle\begin{split}\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)&\gtrsim\frac{|\log\beta|}{x_{A}\,I_{A}^{*}+y_{A}\,J_{A}^{*}},\\ \text{where}\qquad y_{A}&:=1\\ x_{A}&:=\bigl{(}(K-\check{K}_{A})/\hat{K}_{A}\bigr{)}\wedge 1.\end{split} (39)
Proof:

The proof is presented in Appendix C. It follows similar steps as the one in the full sampling case in [6, Theorem 5.1], with the difference that it uses a version of Doob’s optional sampling theorem in the place of Wald’s identity and requires the solution of a max-min optimization problem, in each of the cases we distinguish, to determine the denominator in the lower bound.

Remark: By the definition of xAx_{A} and yAy_{A} in Theorem V.1 we can see that

  • they both take values in [0,1][0,1] and at least one of them is positive,

  • they are both increasing as functions of KK,

  • at least one of them is equal to 1 when K=MK=M,

  • if xA=0x_{A}=0 (resp. yA=0y_{A}=0), then |A|=|A|=\ell (resp. |A|=u|A|=u).

In the next section we obtain an interpretation of the values of xAx_{A} and yAy_{A}.

V-C A criterion for asymptotic optimality

Based on the universal asymptotic lower bound of Theorem V.1, we next establish the asymptotic optimality under 𝖯A\mathsf{P}_{A} of a sampling rule RR that satisfies the sampling constraint and samples each source i[M]i\in[M], when the true subset of anomalous sources is AA, with a long-run frequency that is not smaller than

ci(A):={xAIA/IiifiA,yAJA/JiifiA.c_{i}^{*}(A):=\begin{cases}x_{A}\;I_{A}^{*}/I_{i}\quad\text{if}\;\;i\in A,\\ y_{A}\;J_{A}^{*}/J_{i}\quad\text{if}\;\;i\notin A.\end{cases} (40)
Theorem V.2

Let A𝒫,uA\in\mathcal{P}_{\ell,u} and let RR be a sampling rule that satisfies (4) with T=TRT=T^{R}. If for every i[M]i\in[M] such that ci(A)>0c_{i}^{*}(A)>0 the sequence 𝖯A(πiR(n)<ρ)\mathsf{P}_{A}\left(\pi^{R}_{i}(n)<\rho\right) is summable for every ρ(0,ci(A))\rho\in(0,c_{i}^{*}(A)), then RR is asymptotically optimal under 𝖯A\mathsf{P}_{A}.

Proof:

The proof consists in establishing asymptotic upper bounds for 𝖤A[TR]\mathsf{E}_{A}[T^{R}], when the thresholds are selected according to (18)-(19), that match the universal asymptotic lower bounds of Theorem V.1. The proof is presented Appendix D.

Remark: Let A𝒫,uA\in\mathcal{P}_{\ell,u}. By the definition of {ci(A):i[M]}\{c_{i}^{*}(A):i\in[M]\} in (40) it follows that:

  • ci(A)[0,1]c_{i}^{*}(A)\in[0,1] for every i[M]i\in[M], since xA,yA[0,1]x_{A},y_{A}\in[0,1],

  • xA=0x_{A}=0 (resp. yA=0y_{A}=0) \Leftrightarrow ci(A)=0c_{i}^{*}(A)=0 for every ii in AA (resp. AcA^{c}),

  • xA>0x_{A}>0 (resp. yA>0y_{A}>0) \Leftrightarrow ci(A)>0c_{i}^{*}(A)>0 for every ii in AA (resp. AcA^{c}).

Therefore, the above theorem implies that when xAx_{A} (resp. yAy_{A}) is equal to 0, it is not necessary to sample any source in AA (resp. AcA^{c}) in order to achieve asymptotic optimality under 𝖯A\mathsf{P}_{A}. Moreover, recalling the definitions of IAI_{A} and JAJ_{A} in (26) and (27), we can see that:

xA=IA/IA|A|iAci(A)whenA,yA=JA/JA|Ac|iAci(A)whenA[M].\displaystyle\begin{split}x_{A}&=\frac{I_{A}/I^{*}_{A}}{|A|}\sum_{i\in A}\;c_{i}^{*}(A)\quad\text{when}\quad A\neq\emptyset,\\ y_{A}&=\frac{J_{A}/J^{*}_{A}}{|A^{c}|}\sum_{i\notin A}c_{i}^{*}(A)\quad\text{when}\quad A\neq[M].\end{split} (41)

These identities reveal that xAx_{A} (resp. yAy_{A}) is equal to the average of the minimum limiting sampling frequencies of the sources in AA (resp. AcA^{c}) required for asymptotic optimality under 𝖯A\mathsf{P}_{A}, multiplied by IA/IAI_{A}/I_{A}^{*} (resp. JA/JAJ_{A}/J_{A}^{*}).

Remark: By the definitions of K^A\hat{K}_{A} and KˇA\check{K}_{A} in (26) and (27), we can see that (41) implies

i=1Mci(A)\displaystyle\sum_{i=1}^{M}c_{i}^{*}(A) =xAK^A+yAKˇA.\displaystyle=x_{A}\hat{K}_{A}+y_{A}\check{K}_{A}. (42)

Moreover, by a direct inspection of the values of xAx_{A} and yAy_{A} we have

xAK^A+yAKˇAK,x_{A}\hat{K}_{A}+y_{A}\check{K}_{A}\leq K, (43)

and consequently:

i=1Mci(A)K.\displaystyle\sum_{i=1}^{M}c^{*}_{i}(A)\leq K. (44)

V-D Asymptotically optimal probabilistic sampling rules

We next specialize Theorem V.2 to the case of probabilistic sampling rules. Specifically, we obtain a sufficient condition for the asymptotic optimality, simultaneously under every possible scenario, of an arbitrary probabilistic sampling rule, RR, in terms of the quantities {ciR(A):i[M],A𝒫,u}\{c_{i}^{R}(A):i\in[M],A\in\mathcal{P}_{\ell,u}\}, defined in (22).

Theorem V.3

If RR is a probabilistic sampling rule and for every A𝒫,uA\in\mathcal{P}_{\ell,u} it satisfies

ciR(A)ci(A),i[M],\displaystyle c^{R}_{i}(A)\geq c_{i}^{*}(A),\quad\forall\;i\in[M], (45)

then it is exponentially consistent. If also RR satisfies (25), then it is asymptotically optimal under 𝖯A\mathsf{P}_{A} for every A𝒫,uA\in\mathcal{P}_{\ell,u}.

Proof:

Exponential consistency is established by showing that the conditions of Theorem IV.1 are satisfied, and asymptotic optimality by showing that the conditions of Theorem V.2 are satisfied. The proof is presented in Appendix D.

In the next corollary we show that both conditions of Theorem V.3 are satisfied when the equality holds in (45) for every A𝒫,uA\in\mathcal{P}_{\ell,u}.

Corollary V.1

If RR is a probabilistic sampling rule such that for every A𝒫,uA\in\mathcal{P}_{\ell,u} we have

ciR(A)=ci(A),i[M],\displaystyle c^{R}_{i}(A)=c^{*}_{i}(A),\quad\forall\;i\in[M], (46)

then RR is exponentially consistent and asymptotically optimal under 𝖯A\mathsf{P}_{A} for every A𝒫,uA\in\mathcal{P}_{\ell,u}.

Proof:

By Theorem V.3 it clearly suffices to show that (25) holds, which follows directly by (44).

While (46) suffices for the asymptotic optimality of a probabilistic sampling rule under 𝖯A\mathsf{P}_{A}, it is not always necessary. In the following proposition we characterize the setups for which the necessity holds.

Proposition V.1

Let A𝒫,uA\in\mathcal{P}_{\ell,u}. Then:

(46) holds for any probabilistic sampling rule RR that satisfies (25) and (45)

\Leftrightarrow xAK^A+yAKˇA=Kx_{A}\hat{K}_{A}+y_{A}\check{K}_{A}=K.

Proof:

For any probabilistic sampling rule RR that satisfies (25) and (45) we have

Ki=1MciR(A)i=1Mci(A)=xAK^A+yAKˇA,K\geq\sum_{i=1}^{M}c^{R}_{i}(A)\geq\sum_{i=1}^{M}c_{i}^{*}(A)=x_{A}\hat{K}_{A}+y_{A}\check{K}_{A}, (47)

where the equality follows by (42).

()(\Leftarrow) If K=xAK^A+yAKˇAK=x_{A}\hat{K}_{A}+y_{A}\check{K}_{A}, then by (47) we obtain

i=1MciR(A)=i=1Mci(A).\displaystyle\sum_{i=1}^{M}c^{R}_{i}(A)=\sum_{i=1}^{M}c^{*}_{i}(A).

In view of (45), this proves that ciR(A)=ci(A)c^{R}_{i}(A)=c_{i}^{*}(A) for every i[M]i\in[M].

()(\Rightarrow) We argue by contradiction and assume that xAK^A+yAKˇA=Kx_{A}\hat{K}_{A}+y_{A}\check{K}_{A}=K does not hold. By (42) and (43) it then follows that i=1Mci(A)<K\sum_{i=1}^{M}c^{*}_{i}(A)<K. This implies that there is a probabilistic sampling rule RR that satisfies (45) with strict inequality for at least one i[M]i\in[M], and also satisfies (25). Thus, we have reached a contradiction, which completes the proof. ∎

V-E The optimal performance under full sampling

Corollary V.1 implies that the asymptotic lower bound in Theorem V.1 is always achieved, and as a result it is a first-order asymptotic approximation to the optimal expected time until stopping. By this approximation it follows that for any A𝒫,uA\in\mathcal{P}_{\ell,u} we have

𝒥A(α,β,,u,K)𝒥A(α,β,,u,M)\displaystyle\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)\sim\mathcal{J}_{A}(\alpha,\beta,\ell,u,M)\; KQA,\displaystyle\Leftrightarrow\;K\geq Q_{A}, (48)

where QAQA(r,,u)Q_{A}\equiv Q_{A}(r,\ell,u) is defined as follows:

QA\displaystyle Q_{A} :=xA(r,,u,M)K^A+yA(r,,u,M)KˇA.\displaystyle:=x_{A}(r,\ell,u,M)\hat{K}_{A}+y_{A}(r,\ell,u,M)\check{K}_{A}. (49)

Moreover, by an inspection of the values of xAx_{A} and yAy_{A} in Theorem V.2 it follows that

QA=min{K(0,M]:xA(r,,u,K)=xA(r,,u,M)&yA(r,,u,K)=yA(r,,u,M)}.\displaystyle\begin{split}Q_{A}&=\min\{K\in(0,M]:\quad x_{A}(r,\ell,u,K)=x_{A}(r,\ell,u,M)\\ &\qquad\qquad\qquad\qquad\quad\&\quad y_{A}(r,\ell,u,K)=y_{A}(r,\ell,u,M)\}.\end{split} (50)

The equivalence in (48) implies that if QA<MQ_{A}<M and K[QA,M)K\in[Q_{A},M), then the optimal expected time until stopping under 𝖯A\mathsf{P}_{A} in the case of full sampling can be achieved, to a first-order asymptotic approximation as α,β0\alpha,\beta\to 0, without sampling all sources at all times.

Corollary V.2

If QA<MQ_{A}<M and K[QA,M)K\in[Q_{A},M) for some A𝒫,uA\in\mathcal{P}_{\ell,u}, then there is a probabilistic sampling rule RR such that (R,TR,ΔR)𝒞(α,β,,u,K)(R,T^{R},\Delta^{R})\in\mathcal{C}(\alpha,\beta,\ell,u,K) and 𝖤A[TR]𝒥A(α,β,,u,M),\mathsf{E}_{A}[T^{R}]\sim\mathcal{J}_{A}(\alpha,\beta,\ell,u,M), while ciR(A)<1c_{i}^{R}(A)<1 for some i[M]i\in[M].

Proof:

This follows by Corollary V.1 and (48).

The next proposition, in conjunction with Proposition V.1, shows that QAQ_{A} can also be used to characterize the setups for which the equality must hold in (45). In particular, it shows that this is always the case when K1K\leq 1.

Proposition V.2

For every A𝒫,uA\in\mathcal{P}_{\ell,u} we have QA1Q_{A}\geq 1 and

xAK^A+yAKˇA=KKQA.x_{A}\hat{K}_{A}+y_{A}\check{K}_{A}=K\;\Leftrightarrow\;K\leq Q_{A}.
Proof:

By the definition of K^A,KˇA\hat{K}_{A},\check{K}_{A} in (26), (27) it follows that

K^A\displaystyle\hat{K}_{A} =iA(IA/Ii)1forA\displaystyle=\sum_{i\in A}(I_{A}^{*}/I_{i})\geq 1\quad\text{for}\;A\neq\emptyset
KˇA\displaystyle\check{K}_{A} =iA(JA/Ji)1forA[M].\displaystyle=\sum_{i\notin A}(J_{A}^{*}/J_{i})\geq 1\quad\text{for}\;A\neq[M].

Since also at least one of xA(r,,u,M)x_{A}(r,\ell,u,M) and yA(r,,u,M)y_{A}(r,\ell,u,M) is equal to 11, by the definition of QAQ_{A} in (49) we conclude that QA1Q_{A}\geq 1. It remains to prove the equivalence. ()(\Rightarrow) It suffices to observe that since xAx_{A}, yAy_{A} are both increasing in KK, by the definition of QAQ_{A} in (49) we have xAK^A+yAKˇAQAx_{A}\hat{K}_{A}+y_{A}\check{K}_{A}\leq Q_{A}. ()(\Leftarrow) Follows by direct verification. ∎

V-F The Chernoff sampling rule

When KK is an integer, we will refer to a probabilistic sampling rule as Chernoff if it satisfies the conditions of Theorem V.3 and samples exactly KK sources per time instance. Indeed, such a sampling rule is implied from [18],[20], [21] when these works are applied to the sequential anomaly detection problem (with a fixed number of sampled sources per time instance). In fact, if the class 𝒞(α,β,,u,K)\mathcal{C}(\alpha,\beta,\ell,u,K) is restricted to policies that sample exactly KK sources per time instance, the asymptotic optimality of this rule under 𝖯A\mathsf{P}_{A} is implied by the general results in [20], as long as xAx_{A} and yAy_{A} are both positive. However, this is not always the case. Indeed, in the simplest formulation of the sequential anomaly detection problem, where it is known a priori that there is only one anomaly (l=u=1l=u=1), only one source can be sampled per time instance (K=1)(K=1), and the sources are homogeneous, i.e., f0i=f0f_{0i}=f_{0}, f1i=f1f_{1i}=f_{1} for every i[M]i\in[M], then one of xAx_{A} and yAy_{A} is 0 for any A𝒫,uA\in\mathcal{P}_{\ell,u}. In this setup, the asymptotic optimality of a Chernoff rule was shown in [21, Appendix A] if also f0f_{0}, f1f_{1} are both Bernoulli pmf’s. Our results in this section remove all these restrictions and establish the asymptotic optimality of the Chernoff rule for any values of ,u,r,K\ell,u,r,K, and without artificially modifying it at a subsequence of sampling instances, as in [25].

On the other hand, to implement a Chernoff rule one needs to determine a function qRq^{R} that satisfies simultaneously the conditions of Theorem V.3 and also

qR(B;D)\displaystyle q^{R}\left(B;D\right) =0for allD𝒫,uandB[M]with|B|K,\displaystyle=0\quad\text{for all}\;D\in\mathcal{P}_{\ell,u}\quad\text{and}\quad B\subseteq[M]\;\text{with}\;|B|\neq K, (51)

which can be a computationally demanding task unless the problem has a special structure or K=1K=1. This should be compared with the corresponding asymptotically optimal Bernoulli sampling rule, whose implementation does not require essentially any computation under any setup of the problem.

V-G The homogeneous setup

We now specialize the previous results to the case that Ii=II_{i}=I and Ji=JJ_{i}=J for every i[M]i\in[M], where the quantities in (26), (27), (28), (45) simplify as follows:

IA=IAI,JA=JAJ,K^A=|A|,KˇA=|Ac|,θA=I/Jθ,ci(A)={xA,ifiA,yA,ifiA.\displaystyle\begin{split}I_{A}&=I^{*}_{A}\equiv I,\;J_{A}=J^{*}_{A}\equiv J,\\ \hat{K}_{A}&=|A|,\;\check{K}_{A}=|A^{c}|,\;\theta_{A}=I/J\equiv\theta,\quad\\ c^{*}_{i}(A)&=\begin{cases}x_{A},\;\text{if}\;\;i\in A,\\ y_{A},\;\text{if}\;\;i\notin A.\end{cases}\end{split} (52)

\bullet Suppose first that the number of anomalous sources is known in advance, i.e., =u\ell=u. Then, xAx_{A} and yAy_{A} do not depend on AA, we denote them simply by xx and yy respectively, and present their values in Table I.

(M)IJ(M-\ell)I\geq J\ell (M)IJ(M-\ell)I\leq J\ell
xx min{K/,1}\min\{K/\ell,1\} (KM+)+/(K-M+\ell)^{+}/\ell
yy (K)+/(M)(K-\ell)^{+}/(M-\ell) min{K/(M),1}\min\{K/(M-\ell),1\}
TABLE I: xxAx\equiv x_{A} and yyAy\equiv y_{A} when =u\ell=u, Ii=II_{i}=I, Ji=JJ_{i}=J for every i[M]i\in[M].

From Table I and (49) it follows that QA=MQ_{A}=M for every A𝒫,uA\in\mathcal{P}_{\ell,u}. Then, from Propositions V.1 and V.2 it follows that if RR is a probabilistic sampling rule that satisfies the conditions of Theorem V.3, then it samples at each time instance each source that is currently estimated as anomalous (resp. non-anomalous) with probability xx (resp. yy), i.e.,

ciR(A)\displaystyle c^{R}_{i}(A) ={x,ifiAy,ifiA,A𝒫,u.\displaystyle=\begin{cases}x,\quad\text{if}\;\;i\in A\\ y,\quad\text{if}\;\;i\notin A,\end{cases}\forall\;A\in\mathcal{P}_{\ell,u}. (53)

Moreover, we observe that the first-order asymptotic approximation to the optimal performance is independent of the true subset of anomalous sources. Specifically, for every A𝒫,uA\in\mathcal{P}_{\ell,u} we have

𝒥A(α,β,,u,K)|log(αβ)|xI+yJ.\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)\sim\frac{|\log(\alpha\wedge\beta)|}{x\,I+y\,J}. (54)

\bullet  When the number of anomalous sources is not known a priori, i.e., <u\ell<u, we focus on the special case that r=1r=1 and θ=1\theta=1, or equivalently, I=JI=J. Then, the values of xAx_{A} and yAy_{A}, presented in Table II, do not depend on II, and the optimal asymptotic performance under 𝖯A\mathsf{P}_{A} takes the following form:

𝒥A(α,β,,u,K)|log(α)|I{max{(M)/K,1},if|A|=,M/K,if<|A|<u,max{u/K,1},if|A|=u.\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)\sim\frac{|\log(\alpha)|}{I}\cdot\,\begin{cases}\max\{(M-\ell)/K,1\},\quad&\text{if}\quad|A|=\ell,\\ M/K,\quad&\text{if}\quad\ell<|A|<u,\\ \max\{u/K,1\},\quad&\text{if}\quad|A|=u.\end{cases} (55)

From (48) and (55) we further obtain

QA\displaystyle Q_{A} ={M,if|A|=,M,if<|A|<u,u,if|A|=u.\displaystyle=\begin{cases}M-\ell,\;&\text{if}\quad|A|=\ell,\\ M,\;&\text{if}\quad\ell<|A|<u,\\ u,\;&\text{if}\quad|A|=u.\end{cases} (56)

Note that, in this setup, QA=MQ_{A}=M if and only if one of the following holds:

<|A|<u,|A|==0,|A|=u=M.\ell<|A|<u,\;\;|A|=\ell=0,\;\;|A|=u=M.

Moreover, from Theorem V.2 it follows that, in each of these three cases, asymptotic optimality is achieved by any sampling rule, not necessarily probabilistic, that satisfies the sampling constraint and samples all sources with the same long-run frequency. This is the content of the following corollary.

Corollary V.3

Let RR be a sampling rule such that

  • the sampling constraint (4) holds with T=TRT=T^{R},

  • 𝖯(|πiR(n)K/M|>ϵ)\mathsf{P}\left(|\pi^{R}_{i}(n)-K/M|>\epsilon\right) is summable for every i[M]i\in[M] and ϵ>0\epsilon>0.

If <u\ell<u, r=1r=1, and Ii=JjI_{i}=J_{j} for every i,j[M]i,j\in[M], then RR is asymptotically optimal under 𝖯A\mathsf{P}_{A} for every A𝒫,uA\in\mathcal{P}_{\ell,u} with <|A|<u\ell<|A|<u. If also =0\ell=0 and u=Mu=M, then RR is asymptotically optimal under 𝖯A\mathsf{P}_{A} for every A[M]A\subseteq[M].

Proof:

This follows directly by Theorem V.2, in view of Table II.

The conditions of the previous Corollary are satisfied when RR is a probabilistic sampling rule with ciR(A)=K/Mc_{i}^{R}(A)=K/M for every i[M]i\in[M] and A𝒫,uA\in\mathcal{P}_{\ell,u}, e.g., when RR is a Bernoulli sampling rule that samples each source at each time instance with probability K/MK/M, independently of the other sources. Moreover, in the setup of Corollary V.3 it is quite convenient to find and implement a Chernoff rule (Subsection V-F). Indeed, when KK is an integer, the conditions of Corollary V.3 are satisfied when we take a simple random sample of KK sources at each time instance, i.e., when

qR(B;D)=1/(MK)for allD𝒫,uandB[M]with|B|=K.q^{R}(B;D)=1/\binom{M}{K}\quad\text{for all}\;\;D\in\mathcal{P}_{\ell,u}\;\;\text{and}\;\;B\subseteq[M]\;\;\text{with}\;\;|B|=K.

Finally, in Subsection VI-A we will introduce a non-probabilistic sampling rule that satisfies the conditions of Corollary V.3.

|A|==0|A|=\ell=0 |A|=u=M|A|=u=M <|A|<u\ell<|A|<u 0<=|A|<u0<\ell=|A|<u <|A|=u<M\ell<|A|=u<M
xAx_{A} 0 K/MK/M K/MK/M 0 min{K/u,1}\min\{K/u,1\}
yAy_{A} K/MK/M 0 K/MK/M min{K/(M),1}\min\{K/(M-\ell),1\} 0
TABLE II: xAx_{A} and yAy_{A} when <u\ell<u, r=1r=1, Ii=JjI_{i}=J_{j} for every i,j[M]i,j\in[M].

V-H A heterogeneous example

We end this section by considering a setup where MM is an even number, <M/2<u\ell<M/2<u, r=1r=1, and

Ii=Ji={I,1iM/2,I/ϕM/2<iM,I_{i}=J_{i}=\begin{cases}I,\;&1\leq i\leq M/2,\\ I/\phi\;&M/2<i\leq M,\end{cases} (57)

for some ϕ(0,1]\phi\in(0,1]. Moreover, we focus on the case that the subset of anomalous sources is of the form A={1,,|A|}A=\{1,\ldots,|A|\}. Then, the optimal asymptotic performance under 𝖯A\mathsf{P}_{A} takes the following form:

  • when |A|=|A|=\ell,

    𝒥A(α,β,,u,K)|logα|I/ϕmax{(ϕ+1)M/2K,1},\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)\sim\frac{|\log\alpha|}{I/\phi}\max\left\{\frac{(\phi+1)M/2-\ell}{K},1\right\}, (58)
  • when <|A|<u\ell<|A|<u,

    𝒥A(α,β,,u,K)|logα|Imax{(ϕ+1)M/2K,1},\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)\sim\frac{|\log\alpha|}{I}\,\max\left\{\frac{(\phi+1)M/2}{K},1\right\}, (59)
  • when |A|=u|A|=u,

    𝒥A(α,β,,u,K)|logα|Imax{(1ϕ)(M/2)+ϕuK,1}.\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)\sim\frac{|\log\alpha|}{I}\max\left\{\frac{(1-\phi)(M/2)+\phi u}{K},1\right\}. (60)

From these expressions and (48) we also obtain

QA={(ϕ+1)M/2,|A|=,(ϕ+1)M/2,<|A|<u,(1ϕ)(M/2)+ϕu,|A|=u,Q_{A}=\begin{cases}(\phi+1)M/2-\ell,\quad&|A|=\ell,\\ (\phi+1)M/2,\quad&\ell<|A|<u,\\ (1-\phi)(M/2)+\phi u,\quad&|A|=u,\end{cases} (61)

and we note that QAQ_{A} is always strictly smaller than MM in this setup as long as ϕ<1\phi<1.

VI Non-probabilstic sampling rules

In this section, we discuss certain non-probabilistic sampling rules.

VI-A Sampling in tandem

Suppose that KK is an integer and consider the straightforward sampling approach according to which the sources are sampled in tandem, KK of them at a time. Specifically, sources 11 to KK are sampled at time n=1n=1, and if 2KM2K\leq M, then sources K+1K+1 to 2K2K are sampled at time n=2n=2, whereas if 2K>M2K>M, then sources K+1K+1 to MM and 11 to 2KM2K-M are sampled at time n=2n=2, etc. In this way, each source is sampled exactly KK times in an interval of the form ((m1)M,mM]((m-1)M,mM], where mm\in\mathbb{N}, to which we will refer as a cycle. This sampling rule satisfies the conditions of Corollary V.3, which means that in the special case that <u\ell<u, r=1r=1, and Ii=JjI_{i}=J_{j} for every i,j[M]i,j\in[M], it achieves asymptotic optimality under 𝖯A\mathsf{P}_{A} when <|A|<u\ell<|A|<u, and for every A[M]A\subseteq[M] when =0\ell=0 and u=Mu=M.

In general, by the formula for the optimal asymptotic performance under full sampling, which is obtained by the lower bound of Theorem V.1 with K=MK=M, we can see that if sampling is terminated at a time that is a multiple of MM, the expected number of cycles until stopping is, to a first-order asymptotic approximation, equal to 𝒥(α,β,,u,M)/K\mathcal{J}(\alpha,\beta,\ell,u,M)/K. Since each cycle is of length MM, the expected time until stopping is, again to a first-order asymptotic approximation, equal to (M/K)𝒥(α,β,,u,M)(M/K)\mathcal{J}(\alpha,\beta,\ell,u,M). Thus, the asymptotic relative efficiency of this sampling approach can be defined as follows:

𝖠𝖱𝖤:=MKlimα,β0𝒥A(α,β,,u,M)𝒥A(α,β,,u,K),{\sf ARE}:=\frac{M}{K}\lim_{\alpha,\beta\to 0}\frac{\mathcal{J}_{A}(\alpha,\beta,\ell,u,M)}{\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)}, (62)

where the limit is taken so that (7) holds when <u\ell<u.

Consider in particular the homogeneous setup of Subsection V-G, where (52) holds. When =u\ell=u, by (53) and (54) we have:

  • if θ/(M)\theta\geq\ell/(M-\ell),

    𝖠𝖱𝖤=θ1+θMmax{,K}+11+θ(1/K)+1/M,{\sf ARE}=\frac{\theta}{1+\theta}\,\frac{M}{\max\{\ell,K\}}+\frac{1}{1+\theta}\frac{(1-\ell/K)^{+}}{1-\ell/M}, (63)
  • if θ</(M)\theta<\ell/(M-\ell),

    𝖠𝖱𝖤=11+θMmax{M,K}+θ1+θ(1(M)/K)+/M.{\sf ARE}=\frac{1}{1+\theta}\,\frac{M}{\max\{M-\ell,K\}}+\frac{\theta}{1+\theta}\frac{(1-(M-\ell)/K)^{+}}{\ell/M}. (64)

When <u\ell<u and r=θ=1r=\theta=1, by (55) we have

𝖠𝖱𝖤={M/max{M,K},if|A|=,1,if<|A|<u,M/max{u,K}if|A|=u.{\sf ARE}=\begin{cases}M/\max\{M-\ell,K\},\quad&\text{if}\quad|A|=\ell,\\ 1,\quad&\text{if}\quad\ell<|A|<u,\\ M/\max\{u,K\}\quad&\text{if}\quad|A|=u.\end{cases} (65)

On the other hand, in the heterogeneous setup of Subsection V-H, by (58), (59), (60) we have:

𝖠𝖱𝖤={M/max{(ϕ+1)M/2,K},if|A|=,M/max{(ϕ+1)M/2,K},if<|A|<u,M/max{(1ϕ)M/2+ϕu,K}if|A|=u.{\sf ARE}=\begin{cases}M/\max\{(\phi+1)M/2-\ell,\;K\},\;&\text{if}\quad|A|=\ell,\\ M/\max\{(\phi+1)M/2,\;K\},\;&\text{if}\quad\ell<|A|<u,\\ M/\max\{(1-\phi)M/2+\phi u,\;K\}\;&\text{if}\quad|A|=u.\end{cases} (66)

VI-B Equalizing empirical and limiting sampling frequencies

We next consider a sampling approach, which has been applied to a general controlled sensing problem [28], as well as to a bandit problem [29], and we show that not only it may not achieve asymptotic optimality in the sequential anomaly detection problem, but it may even lead to a detection procedure that fails to terminate with positive probability.

To be more specific, we consider the homogeneous setup of Subsection V-G with K=1K=1. In this setup, a probabilistic sampling rule that satisfies the conditions of Theorem V.3 samples a source in DD (resp. DcD^{c}) with probability xDx_{D} (resp. yDy_{D}), whenever D𝒫,uD\in\mathcal{P}_{\ell,u} is the subset of sources currently identified as anomalous. The sampling rule RR that we consider in this Subsection is not probabilistic, as it requires knowledge of not only the currently estimated subset of anomalous sources, but also of the current empirical sampling frequencies. Specifically, if DD is the subset of sources currently estimated as anomalous, for every source in DD (resp. DcD^{c}) it computes the distance between its current empirical sampling frequency and xDx_{D} (resp. yDy_{D}), and it samples next a source for which this distance is the maximum. That is, for every nn\in\mathbb{N} and D𝒫,uD\in\mathcal{P}_{\ell,u} we have

R(n+1)argmax{|πiR(n)xD|,|πjR(n)yD|:iD,jD}on{ΔnR=D}.\displaystyle\begin{split}R(n+1)&\in\text{argmax}\left\{|\pi_{i}^{R}(n)-x_{D}|,|\pi_{j}^{R}(n)-y_{D}|:i\in D,\;j\notin D\right\}\\ &\text{on}\quad\{\Delta^{R}_{n}=D\}.\end{split} (67)

Without loss of generality, we also assume that each source has positive probability to be sampled at the first time instance, i.e.,

𝖯A(iR(1))>0i[M],A𝒫,u.\displaystyle\mathsf{P}_{A}(i\in R(1))>0\quad\forall\;i\in[M],\;A\in\mathcal{P}_{\ell,u}. (68)
Proposition VI.1

Consider the homogeneous setup of Subsection V-G with K=1K=1 and let RR be sampling rule that satisfies (67)-(68). Suppose further that there is only one anomalous source, i.e., that the subset of anomalous source, AA, is a singleton, and also that xA+yA<1x_{A}+y_{A}<1. Then, there is an event of positive probability under 𝖯A\mathsf{P}_{A} on which

  1. (i)

    the same source is sampled at every time instance,

  2. (ii)

    and if also =0\ell=0 and u=Mu=M, TRT^{R} fails to terminate for any selection of its thresholds.

Proof:

If (i) holds, there is an event of positive probability under 𝖯A\mathsf{P}_{A} on which all LLRs but one are always equal to 0. Thus, (ii) follows directly from (i) and the fact that when =0\ell=0 and u=Mu=M, the stopping rule (14) requires that all LLRs be non-zero upon stopping. Therefore, it remains to prove (i). Without loss of generality, we set A={1}A=\{1\}. The event {R1(1)=1}\{R_{1}(1)=1\} has positive probability under 𝖯A\mathsf{P}_{A}, by (68), and so does the event

Γ{m=1ng1(X1(m))>0n},\Gamma\equiv\left\{\sum_{m=1}^{n}g_{1}(X_{1}(m))>0\;\;\;\forall\;n\in\mathbb{N}\right\}, (69)

since {g1(X1(n)):n}\{g_{1}(X_{1}(n)):n\in\mathbb{N}\} is an iid sequence with expectation I>0I>0 under 𝖯A\mathsf{P}_{A} (see, e.g., [30, Proposition 7.2.1]). Since these two events are also independent, their intersection has positive probability under 𝖯A\mathsf{P}_{A}. Therefore, it suffices to show that only source 1 is sampled on Γ{R1(1)=1}\Gamma\cap\{R_{1}(1)=1\}. Indeed, on this event sampling starts from source 1, and as a result the vector of empirical frequencies, (π1R(n),,πMR(n))(\pi_{1}^{R}(n),\ldots,\pi_{M}^{R}(n)) at time n=1n=1 is (1,0,,0)(1,0,\ldots,0). Moreover, the estimated subset of anomalous sources at n=1n=1 is A={1}A=\{1\}, independently of whether <u\ell<u or =u\ell=u. Since xA+yA<1x_{A}+y_{A}<1, or equivalently,

|π1(1)xA|=|1xA|=1xA>yA=|yA0|=|πi(1)yA||\pi_{1}(1)-x_{A}|=|1-x_{A}|=1-x_{A}>y_{A}=|y_{A}-0|=|\pi_{i}(1)-y_{A}|

for every i1i\neq 1, by (67) it follows that source 11 is sampled again at time n=2n=2, and as a result the vector of empirical frequencies at time n=2n=2 remains (1,0,,0)(1,0,\ldots,0). Applying the same reasoning as before, we conclude that source 1 is sampled again at time n=3n=3. The same argument can be repeated indefinitely, and this proves (i).

Remark: When =0\ell=0, u=Mu=M, each f0if_{0i} is exponential with rate 11, and each f1if_{1i} is exponential with rate λ>0\lambda>0, the conditions of the previous proposition are satisfied as long as M>2M>2. Indeed, in this case we have

θ\displaystyle\theta =I/J=log(λ)+λ1log(λ)+1/λ1,\displaystyle=I/J=\frac{-\log(\lambda)+\lambda-1}{\log(\lambda)+1/\lambda-1},
xA\displaystyle x_{A} =11+(M1)θ,yA=θxA\displaystyle=\frac{1}{1+(M-1)\theta},\quad y_{A}=\theta x_{A}

and as a result xA+yA<1M>2x_{A}+y_{A}<1\Leftrightarrow M>2.

VI-C Sampling based on the ordering of the LLRs

A different, non-probabilistic sampling approach, which goes back to [18, Remark 5], suggests sampling at each time instance the sources with the smallest, in absolute value, LLRs among those estimated as anomalous/non-anomalous. Such a sampling rule was proposed in [12], in the homogeneous setup of Subsection V-G, under the assumption that the number of anomalous sources is known a priori, i.e., =u\ell=u. An extension of this rule in the heterogeneous setup was studied in [13], under the assumption that =u\ell=u and K=1K=1. A similar sampling rule, that also has a randomization feature, was proposed in [16] when =u\ell=u, as well as in the case of no prior information, i.e., when =0,u=M\ell=0,u=M. For each of them, the criterion of Theorem V.2 can be applied to establish their asymptotic optimality. Its verification, however, is a quite difficult task that we plan to consider in the future.

VII Simulation study

In this section we present the results of a simulation study where we compare two probabilistic sampling rules, Bernoulli (Section IV) and Chernoff (Subsection V-F), between them and against sampling in tandem (Subsection VI-A). Throughout this section, for every i[M]i\in[M] we have f0i=𝒩(0,1)f_{0i}=\mathcal{N}(0,1) and f1i=𝒩(μi,1)f_{1i}=\mathcal{N}(\mu_{i},1), i.e., all observations from source ii are Gaussian with variance 11 and mean equal to μi\mu_{i} if the source is anomalous and 0 otherwise, and as a result Ii=Ji=(μi)2/2I_{i}=J_{i}=(\mu_{i})^{2}/2. We consider a homogeneous setup where

μi=μ,i[M],\mu_{i}=\mu,\quad\forall\;i\in[M], (70)

as well as a heterogeneous setup where

μi={μ,1iM/22μ,M/2<iM,\mu_{i}=\begin{cases}\mu,\quad&1\leq i\leq M/2\\ 2\mu,\quad&M/2<i\leq M,\end{cases} (71)

in which case (57) holds with I=μ2/2I=\mu^{2}/2 and ϕ=0.25\phi=0.25. In both setups we set α=β\alpha=\beta, M=10M=10, K=5K=5, =1\ell=1, u=6u=6, μ=0.5\mu=0.5, we assume that the subset of anomalous sources is of the form A={1,,|A|}A=\{1,\ldots,|A|\}, and consider different values for its size. The two probabilistic sampling rules are designed so that (46) holds for every A𝒫,uA\in\mathcal{P}_{\ell,u}. As a result, by Corollary V.1 it follows that in both setups they are asymptotically optimal under 𝖯A\mathsf{P}_{A} for every A𝒫,uA\in\mathcal{P}_{\ell,u}. On the other hand, by Corollary V.3 it follows that sampling in tandem is asymptotically optimal under 𝖯A\mathsf{P}_{A} only in the homogeneous setup (71) and when l<|A|<ul<|A|<u (since 0<l<u<M0<l<u<M).

In Figure 1 we plot the expected value of the stopping time that is induced by each of the three sampling rules against the true number of anomalous sources. Specifically, in Figure 1a we consider the homogeneous setup (70) and in Figure 1b the heterogeneous setup (71). In all cases, the thresholds are chosen, via Monte Carlo simulation, so that the familywise error probability of each kind is (approximately) equal to α=β=103\alpha=\beta=10^{-3}. The Monte Carlo standard error that corresponds to each estimated expected value is approximately 10210^{-2}. From Figure 1 we can see that the performance implied by the two probabilistic sampling rules is always essentially the same. Sampling in tandem performs significantly worse in all cases apart from the homogeneous setup with <|A|<u\ell<|A|<u, where all three sampling rules lead to essentially the same performance. We note also that, in both setups and for all three sampling rules, the expected time until stopping is much smaller when the number of anomalous sources is equal to either \ell or uu than when it is between \ell and uu.

Refer to caption
(a) Homogeneous case
Refer to caption
(b) Heterogeneous case
Figure 1: Expected value of the stopping time that corresponds to each of the three sampling rules versus the number of anomalous sources (α=β=103\alpha=\beta=10^{-3}).
Refer to caption
(a) |A|=1|A|=1
Refer to caption
(b) |A|=6|A|=6
Figure 2: In both graphs the xx-axis represents |log10(α)||\log_{10}(\alpha)| and the yy-axis the ratio of the expected value of the stopping time implied by sampling in tandem over that implied by the asymptotically optimal Bernoulli rule in the homogeneous setup (70). The horizontal line refers to the value of (65).

In Figures 2 and 3 we plot the ratio of the expected value of the stopping time induced by sampling in tandem over that induced by the Bernoulli sampling rule against |log10(α)||\log_{10}(\alpha)| as α\alpha ranges from 10110^{-1} to 101010^{-10}. (We do not present the corresponding results for the Chernoff rule, as they are almost identical). Specifically, in Figure 2 we consider the homogeneous setup (70) when the number of anomalous sources is 11 and 66, whereas in Figure 3 we consider the heterogeneous setup (71) when the number of anomalous sources is 1, 3,1,\,3, and 6. For each value of α\alpha, the thresholds are selected according to (19) and each expectation is computed using 10410^{4} simulation runs. The standard error for each estimated expectation is approximately 11, whereas the standard error for each ratio is approximately 10210^{-2} in the homogeneous setup and 31023\cdot 10^{-2} in the heterogeneous setup. Moreover, in each case we plot the limiting value of this ratio, which is the limit defined in (62). In the homogeneous case this is given by (65) and is equal to

{10/max{101, 5}=10/91.1when|A|=1,10/max{6, 5}=10/61.6when|A|=6.\begin{cases}10/\max\{10-1,\;5\}=10/9\approx 1.1&\;\text{when}\quad|A|=1,\\ 10/\max\{6,\;5\}=10/6\approx 1.6\;&\;\text{when}\quad|A|=6.\\ \end{cases} (72)

In the heterogeneous case it is given by (66) and is equal to

{10/max{(1+0.25)51, 5}=10/5.251.9when|A|=1,10/max{(1+0.25)5, 5}=10/6.251.6when1<|A|<6,10/max{(10.25)5+0.256, 5}=10/5.251.9when|A|=6.\begin{cases}10/\max\{(1+0.25)\cdot 5-1,\;5\}=10/5.25\approx 1.9\;&\text{when}\quad|A|=1,\\ 10/\max\{(1+0.25)\cdot 5,\;5\}=10/6.25\approx 1.6\;&\text{when}\quad 1<|A|<6,\\ 10/\max\{(1-0.25)\cdot 5+0.25\cdot 6,\;5\}=10/5.25\approx 1.9\;&\text{when}\quad|A|=6.\end{cases} (73)

From Figures 2 and 3 we can see that, in all cases, the efficiency loss due to sampling in tandem is (much) larger than the one suggested by the corresponding asymptotic relative efficiency when α\alpha is not (very) small.

Refer to caption
(a) |A|=1|A|=1
Refer to caption
(b) |A|=3|A|=3
Refer to caption
(c) |A|=6|A|=6
Figure 3: In all graphs the xx-axis represents |log10(α)||\log_{10}(\alpha)| and the yy-axis the ratio of the expected value of the stopping time implied by sampling in tandem over that implied by the asymptotically optimal Bernoulli rule in the heterogeneous setup (71). The horizontal line refers to the value of (66).

VIII Conclusions and extensions

In this paper we propose a novel formulation of the sequential anomaly detection problem with sampling constraints, in which arbitrary, user-specified bounds are assumed on the number of anomalous sources, the probabilities of at least one false alarm and at least one missed detection are controlled below distinct tolerance levels, and the number of sampled sources per time instance is not necessarily fixed. We obtain a general criterion for achieving the optimal expected time until stopping, to a first-order asymptotic approximation as the error probabilities go to 0, as long as the log-likelihood ratio statistic of each observation has a finite first moment. We show that asymptotic optimality is achieved, simultaneously under every possible subset of anomalous sources, for any version of the proposed problem, using the unmodified sampling rule in [20], to which we refer as Chernoff, but also using a novel sampling rule whose implementation requires minimal computations, to which we refer in this work as Bernoulli. Despite their very different computational requirements, these two rules are found in simulation studies to lead to essentially the same performance. On the other hand, it has been shown in simulation studies under various setups [12, 13, 16], that the Chernoff rule leads to significantly worse performance in practice compared to non-probabilistic sampling rules as the ones discussed in Subsection VI-C. The study of sampling rules of this nature in the general sequential anomaly detection framework we propose in this work is an interesting problem that we plan to consider in the future. Another interesting problem is that of establishing stronger than first-order asymptotic optimality, which has been solved in certain special cases of the sequential multi-hypothesis testing problem with controlled sensing [23, 24].

The results of this paper can be shown, using the techniques in [8], to remain valid for a variety of error metrics beyond the familywise error rates that we consider in this work, such as the false discovery rate and the false non-discovery rate. However, this is not the case for the generalized familywise error rates proposed in [31, 32], for which different policies and a different analysis is required. These error metrics have been considered in [4, 5, 7] in the case of full sampling, whereas certain results in the presence of sampling constraints have been presented in [33].

The results in this work can also be generalized in a natural way when the sampling cost varies per source, as in [15], or when the two hypotheses in each source are not completely specified, as it is done for example in [7] in the case of full sampling. Another potential generalization is the removal of the assumption that the acquired observations are conditionally independent of the past given the current sampling choice, as it is done in [27] in a general controlled sensing setup. Finally, another direction of interest is a setup where the focus is on the dependence structure of the sources rather than their marginal distributions, as for example in [34].

IX Acknowledgments

The authors would like to thank Prof. Venugopal Veeravalli for stimulating discussions.

Appendix A

In this Appendix we state and prove two auxiliary lemmas that are used in the proofs of various results of this paper. Specifically, we fix A𝒫,uA\in\mathcal{P}_{\ell,u}, and for any i[M]i\in[M], nn\in\mathbb{N} and any sampling rule RR we set

Λ~iR(n):=Λ~iR(n1)+(gi(Xi(n))𝖤A[gi(Xi(n))])Ri(n),Λ~iR(0):=0,\displaystyle\begin{split}\widetilde{\Lambda}^{R}_{i}(n)&:=\widetilde{\Lambda}^{R}_{i}(n-1)+\Bigl{(}g_{i}(X_{i}(n))-\mathsf{E}_{A}[g_{i}(X_{i}(n))]\Bigr{)}\,R_{i}(n),\\ \widetilde{\Lambda}^{R}_{i}(0)&:=0,\end{split} (74)

and comparing with (10) we observe that

Λ~iR(n)\displaystyle\widetilde{\Lambda}^{R}_{i}(n) ={ΛiR(n)IiNiR(n),iAΛiR(n)+JiNiR(n),iA.\displaystyle=\begin{cases}\Lambda^{R}_{i}(n)-I_{i}\,N^{R}_{i}(n),\quad i\in A\\ \Lambda^{R}_{i}(n)+J_{i}\,N^{R}_{i}(n),\quad i\notin A.\end{cases} (75)
Lemma A.1

Let RR be an arbitrary sampling rule, iAi\in A, jAj\notin A, ρ(0,1]\rho\in(0,1], and ϵ>0\epsilon>0. Then, the sequences

𝖯A(1nΛiR(n)<ρIiϵ,πiR(n)ρ),\displaystyle\mathsf{P}_{A}\left(\frac{1}{n}\Lambda^{R}_{i}(n)<\rho I_{i}-\epsilon,\;\pi_{i}^{R}(n)\geq\rho\right),
𝖯A(1nΛjR(n)>ρJj+ϵ,πjR(n)ρ),\displaystyle\mathsf{P}_{A}\left(\frac{1}{n}\Lambda^{R}_{j}(n)>-\rho J_{j}+\epsilon,\;\pi_{j}^{R}(n)\geq\rho\right),
𝖯A(1nΛijR(n)<ρIiϵ,πiR(n)ρ)\displaystyle\mathsf{P}_{A}\left(\frac{1}{n}\Lambda^{R}_{ij}(n)<\rho I_{i}-\epsilon,\;\pi^{R}_{i}(n)\geq\rho\right)
𝖯A(1nΛijR(n)<ρJjϵ,πjR(n)ρ)\displaystyle\mathsf{P}_{A}\left(\frac{1}{n}\Lambda^{R}_{ij}(n)<\rho J_{j}-\epsilon,\;\pi^{R}_{j}(n)\geq\rho\right)

are exponentially decaying.

Proof:

We prove the result for the third probability, as the proofs for the other ones are similar. To lighten the notation, we suppress the dependence on RR and we write Λ~ij(n),Λ~i(n),πi(n),n\widetilde{\Lambda}_{ij}(n),\widetilde{\Lambda}_{i}(n),\pi_{i}(n),\mathcal{F}_{n} instead of Λ~ijR(n),Λ~iR(n),πiR(n),nR\widetilde{\Lambda}^{R}_{ij}(n),\widetilde{\Lambda}^{R}_{i}(n),\pi_{i}^{R}(n),\mathcal{F}^{R}_{n}. By (75), for every nn\in\mathbb{N} we have

Λij(n)\displaystyle\Lambda_{ij}(n) =Λi(n)Λj(n)\displaystyle=\Lambda_{i}(n)-\Lambda_{j}(n)
=Λ~i(n)Λ~j(n)+n(Iiπi(n)+Jjπj(n)),\displaystyle=\widetilde{\Lambda}_{i}(n)-\widetilde{\Lambda}_{j}(n)+n(I_{i}\pi_{i}(n)+J_{j}\pi_{j}(n)),

which shows that if πi(n)ρ\pi_{i}(n)\geq\rho, then Λij(n)Λ~i(n)Λ~j(n)+nρIi.\Lambda_{ij}(n)\geq\widetilde{\Lambda}_{i}(n)-\widetilde{\Lambda}_{j}(n)+n\rho I_{i}. Thus, for every nn\in\mathbb{N} we have

𝖯A(1nΛij(n)<ρIiϵ,πi(n)ρ)\displaystyle\mathsf{P}_{A}\left(\frac{1}{n}\Lambda_{ij}(n)<\rho I_{i}-\epsilon,\;\pi_{i}(n)\geq\rho\right)
𝖯A(Λ~i(n)Λ~j(n)<nϵ)\displaystyle\leq\mathsf{P}_{A}\left(\widetilde{\Lambda}_{i}(n)-\widetilde{\Lambda}_{j}(n)<-n\,\epsilon\right)
𝖯A(Λ~i(n)<nϵ/2)+𝖯A(Λ~j(n)<nϵ/2),\displaystyle\leq\mathsf{P}_{A}\left(\widetilde{\Lambda}_{i}(n)<-n\,\epsilon/2\right)+\mathsf{P}_{A}\left(-\widetilde{\Lambda}_{j}(n)<-n\,\epsilon/2\right),

and it suffices to show that the two terms in the upper bound are exponentially decaying. We show this only for the first one, as the proof for the second is similar. For this, we fix δ(0,ϵ/2)\delta\in(0,\epsilon/2) and we observe that, for any t>0t>0 and nn\in\mathbb{N}, by Markov’s inequality we have

𝖯A(Λ~i(n)<nϵ/2)exp{ntϵ/2}𝖤A[exp{tΛ~i(n)}].\displaystyle\begin{split}&\mathsf{P}_{A}\left(\widetilde{\Lambda}_{i}(n)<-n\,\epsilon/2\right)\\ &\leq\exp\{-n\,t\,\epsilon/2\}\,\mathsf{E}_{A}\left[\exp\left\{-t\,\widetilde{\Lambda}_{i}(n)\right\}\right].\end{split} (76)

By (74) and the law of iterated expectation it follows that the expectation in the upper bound can be written as follows:

𝖤A[exp{tΛ~i(n1)}𝖤A[exp{t(gi(Xi(n))Ii)Ri(n)}|n1]].\displaystyle\mathsf{E}_{A}\left[\exp\left\{-t\,\widetilde{\Lambda}_{i}(n-1)\right\}\cdot\mathsf{E}_{A}\left[\exp\left\{-t\,(g_{i}(X_{i}(n))-I_{i})\,R_{i}(n)\right\}\,|\,\mathcal{F}_{n-1}\right]\right]. (77)

Since R(n)R(n) is an n1\mathcal{F}_{n-1}-measurable Bernoulli random variable and Xi(n)X_{i}(n) is independent of n1\mathcal{F}_{n-1} and has the same distribution as Xi(1)X_{i}(1), we also have

𝖤A[exp{t(gi(Xi(n))Ii)Ri(n)}|n1]\displaystyle\mathsf{E}_{A}\left[\exp\left\{-t\,(g_{i}(X_{i}(n))-I_{i})\,R_{i}(n)\right\}\,|\,\mathcal{F}_{n-1}\right]
=𝖤A[exp{t(gi(Xi(1))Ii)}]Ri(n)\displaystyle=\mathsf{E}_{A}\left[\exp\left\{-t\,(g_{i}(X_{i}(1))-I_{i})\right\}\right]^{R_{i}(n)}
=𝖤A[exp{t(gi(Xi(1))+Iiδ+δ)}]Ri(n)\displaystyle=\mathsf{E}_{A}\left[\exp\left\{t\,(-g_{i}(X_{i}(1))+I_{i}-\delta+\delta)\right\}\right]^{R_{i}(n)}
exp{Ri(n)(ψi(t)+tδ)},\displaystyle\leq\exp\left\{R_{i}(n)\,(\psi_{i}(t)+t\,\delta)\right\},

where ψi\psi_{i} is the cumulant generating function of gi(Xi(1))+Iiδ-g_{i}(X_{i}(1))+I_{i}-\delta, i.e.,

ψi(t):=log𝖤A[exp{t(gi(Xi(1))+Iiδ)}],t>0.\psi_{i}(t):=\log\mathsf{E}_{A}\left[\exp\left\{t(-g_{i}(X_{i}(1))+I_{i}-\delta)\right\}\right],\;t>0.

Since ψi\psi_{i} is convex on (0,1)(0,1), ψi(1)=Iiδ<\psi_{i}(1)=I_{i}-\delta<\infty, and

ψi(0+)=𝖤A[gi(Xi(1))+Iiδ]=δ<0,\psi_{i}^{\prime}(0+)=\mathsf{E}_{A}[-g_{i}(X_{i}(1))+I_{i}-\delta]=-\delta<0,

there is an s>0s>0 such that ψi(s)<0\psi_{i}(s)<0, and as a result

exp{Ri(n)(ψi(s)+sδ)}\displaystyle\exp\left\{R_{i}(n)\,(\psi_{i}(s)+s\,\delta)\right\} exp{sδ}.\displaystyle\leq\exp\left\{s\,\delta\right\}.

Therefore, setting t=st=s in (76)-(77) we obtain

𝖯A(Λ~i(n)<nϵ/2)\displaystyle\mathsf{P}_{A}\left(\widetilde{\Lambda}_{i}(n)<-n\,\epsilon/2\right) exp{n(ϵ/2)s+δs}𝖤A[exp{sΛ~i(n1)}].\displaystyle\leq\exp\{-n\,(\epsilon/2)s+\delta s\}\;\mathsf{E}_{A}\left[\exp\left\{-s\,\widetilde{\Lambda}_{i}(n-1)\right\}\right].

Repeating the same argument n1n-1 times we conclude that there exists an s>0s>0 such that

𝖯A(Λ~i(n)<nϵ/2)exp{n(ϵ/2δ)s}.\mathsf{P}_{A}\left(\widetilde{\Lambda}_{i}(n)<-n\epsilon/2\right)\leq\exp\left\{-n(\epsilon/2-\delta)s\right\}.

Since δ(0,ϵ/2)\delta\in(0,\epsilon/2), this completes the proof.

Lemma A.2

Let iAi\in A, jAj\notin A, ζ>0\zeta>0, ρi,ρj[0,1]\rho_{i},\rho_{j}\in[0,1], and let RR be an arbitrary sampling rule.

  • (i)

    If ρi>0\rho_{i}>0 and 𝖯A(πiR(n)<ρi)\mathsf{P}_{A}(\pi_{i}^{R}(n)<\rho_{i}) is summable, then so is

    𝖯A(1nΛiR(n)<ρiIiζ).\mathsf{P}_{A}\left(\frac{1}{n}\Lambda^{R}_{i}(n)<\rho_{i}I_{i}-\zeta\right).
  • (ii)

    If ρj>0\rho_{j}>0 and 𝖯A(πjR(n)<ρj)\mathsf{P}_{A}(\pi_{j}^{R}(n)<\rho_{j}) is summable, then so is

    𝖯A(1nΛjR(n)>ρjJj+ζ).\mathsf{P}_{A}\left(\frac{1}{n}\Lambda^{R}_{j}(n)>-\rho_{j}J_{j}+\zeta\right).
  • (iii)

    If ρi+ρj>0\rho_{i}+\rho_{j}>0 and both 𝖯A(πjR(n)<ρi)\mathsf{P}_{A}(\pi_{j}^{R}(n)<\rho_{i}) and 𝖯A(πjR(n)<ρj)\mathsf{P}_{A}(\pi_{j}^{R}(n)<\rho_{j}) are summable, then so is

    𝖯A(1nΛijR(n)<ρiIi+ρjJjζ).\mathsf{P}_{A}\left(\frac{1}{n}\Lambda^{R}_{ij}(n)<\rho_{i}I_{i}+\rho_{j}J_{j}-\zeta\right).
Proof:

For (i), by the law of total probability we have

𝖯A(1nΛiR(n)<ρiIiζ)\displaystyle\mathsf{P}_{A}\left(\frac{1}{n}\Lambda^{R}_{i}(n)<\rho_{i}I_{i}-\zeta\right)
𝖯A(πiR(n)<ρi)+𝖯A(1nΛiR(n)<ρiIiζ,πiR(n)ρi).\displaystyle\leq\mathsf{P}_{A}\left(\pi_{i}^{R}(n)<\rho_{i}\right)+\mathsf{P}_{A}\left(\frac{1}{n}\Lambda^{R}_{i}(n)<\rho_{i}I_{i}-\zeta,\,\pi^{R}_{i}(n)\geq\rho_{i}\right).

The first term in the upper bound is summable by assumption, whereas the second one is summable, as exponentially decaying, by Lemma A.1.

The proof of (ii) is similar and is omitted. Since (iii) follows directly by (i) and (ii) when ρi\rho_{i} and ρj\rho_{j} are both positive, it suffices to consider the case where only one them is positive. Without loss of generality, we assume that ρi>0=ρj\rho_{i}>0=\rho_{j}. Then, by the law of total probability again we have

𝖯A(1nΛijR(n)<ρiIi+ρjJjζ)\displaystyle\mathsf{P}_{A}\left(\frac{1}{n}\Lambda^{R}_{ij}(n)<\rho_{i}I_{i}+\rho_{j}J_{j}-\zeta\right)
𝖯A(πiR(n)<ρi)+𝖯A(1nΛijR(n)<ρiIiζ,πiR(n)ρi).\displaystyle\leq\mathsf{P}_{A}\left(\pi_{i}^{R}(n)<\rho_{i}\right)+\mathsf{P}_{A}\left(\frac{1}{n}\Lambda^{R}_{ij}(n)<\rho_{i}I_{i}-\zeta,\,\pi^{R}_{i}(n)\geq\rho_{i}\right).

The first term in the upper bound is summable by assumption, whereas the second one is summable, as exponentially decaying, by Lemma A.1.

Appendix B

In this Appendix we prove Theorems III.1 and IV.1, which provide sufficient conditions for the exponential consistency of a sampling rule. In order to lighten the notation, throughout this Appendix we suppress dependence on RR and we write πi(n)\pi_{i}(n), Λi(n),Λij(n),Δn\Lambda_{i}(n),\Lambda_{ij}(n),\Delta_{n}, ci(A)c_{i}(A), σA\sigma_{A} instead of πiR(n)\pi_{i}^{R}(n), ΛiR(n)\Lambda_{i}^{R}(n), ΛijR(n)\Lambda_{ij}^{R}(n), ΔnR\Delta^{R}_{n}, ciR(A)c^{R}_{i}(A), σAR\sigma^{R}_{A}.

Proof:

We prove the result first when =u\ell=u. By the definition of the decision rule in (13) it follows that when σA>n\sigma_{A}>n, there are mnm\geq n, iAi\in A, jAj\notin A such that Λij(m)0\Lambda_{ij}(m)\leq 0, and as a result by the union bound we have

𝖯A(σA>n)iA,jAm=n𝖯A(Λij(m)0).\mathsf{P}_{A}(\sigma_{A}>n)\leq\sum\limits_{i\in A,\,j\notin A}\;\sum_{m=n}^{\infty}\mathsf{P}_{A}\left(\Lambda_{ij}(m)\leq 0\right).

Therefore, to show that 𝖯A(σA>n)\mathsf{P}_{A}(\sigma_{A}>n) is exponentially decaying, it suffices to show that this is the case for 𝖯A(Λij(n)0)\mathsf{P}_{A}\left(\Lambda_{ij}(n)\leq 0\right) for every iAi\in A and jAj\notin A. To show this, we fix such ii and jj and we note that, by assumption, either 𝖯A(πi(n)<ρ)\mathsf{P}_{A}(\pi_{i}(n)<\rho) or 𝖯A(πj(n)<ρ)\mathsf{P}_{A}(\pi_{j}(n)<\rho) is exponentially decaying for ρ>0\rho>0 small enough. Without loss of generality, suppose that this is the case for the former. By an application of the law of total probability we then obtain

𝖯A(Λij(n)0)\displaystyle\mathsf{P}_{A}\left(\Lambda_{ij}(n)\leq 0\right) 𝖯A(Λij(n)0,πi(n)ρ)\displaystyle\leq\mathsf{P}_{A}(\Lambda_{ij}(n)\leq 0,\pi_{i}(n)\geq\rho)
+𝖯A(πi(n)<ρ).\displaystyle+\mathsf{P}_{A}(\pi_{i}(n)<\rho).

As mentioned earlier, the second term in the upper bound is, by assumption, exponentially decaying for ρ>0\rho>0 small enough. By Lemma A.1 it follows that this is also the case for the first one.

Suppose next that <|A|<u\ell<|A|<u. By the definition of the decision rule in (17) it follows that when σA>n\sigma_{A}>n, there is an mnm\geq n such that either Λi(m)<0\Lambda_{i}(m)<0 for some iAi\in A, or Λj(m)>0\Lambda_{j}(m)>0 for some jAj\notin A. As a result, by the union bound we have

𝖯A(σA>n)\displaystyle\mathsf{P}_{A}({\sigma}_{A}>n) iAm=n𝖯A(Λi(m)<0)\displaystyle\leq\sum\limits_{i\in A}\;\sum_{m=n}^{\infty}\mathsf{P}_{A}(\Lambda_{i}(m)<0)
+jAm=n𝖯A(Λj(m)0).\displaystyle+\sum\limits_{j\notin A}\;\sum_{m=n}^{\infty}\mathsf{P}_{A}(\Lambda_{j}(m)\geq 0).

Therefore, to prove that 𝖯A(σA>n)\mathsf{P}_{A}(\sigma_{A}>n) is exponentially decaying, it suffices to show that this is true for 𝖯A(Λi(n)<0)\mathsf{P}_{A}\left(\Lambda_{i}(n)<0\right) and 𝖯A(Λj(n)0)\mathsf{P}_{A}\left(\Lambda_{j}(n)\geq 0\right) for every iAi\in A and jAj\notin A. This can be shown similarly to the case =u\ell=u, in view of the fact that the sequences 𝖯A(πi(n)<ρ)\mathsf{P}_{A}(\pi_{i}(n)<\rho) and 𝖯A(πj(n)<ρ)\mathsf{P}_{A}(\pi_{j}(n)<\rho) are, by assumption, both exponentially decaying for ρ>0\rho>0 small enough and for every iAi\in A and jAj\notin A.

The two remaining cases are <|A|=u\ell<|A|=u and =|A|<u\ell=|A|<u. We consider only the former, as the proof for the latter is similar, and assume that <|A|=u\ell<|A|=u. By the definition of the decision rule in (13) it follows that when σA>n\sigma_{A}>n, there is an mnm\geq n such that either Λij(m)0\Lambda_{ij}(m)\leq 0 for some iAi\in A and jAj\notin A, or Λi(m)<0\Lambda_{i}(m)<0 for some iAi\in A. As a result, by the union bound we have

𝖯A(σA>ζn)\displaystyle\mathsf{P}_{A}(\sigma_{A}>\zeta n) iAm=n𝖯A(Λi(m)<0)\displaystyle\leq\sum\limits_{i\in A}\sum_{m=n}^{\infty}\mathsf{P}_{A}(\Lambda_{i}(m)<0)
+iA,jAm=n𝖯A(Λij(m)0).\displaystyle+\sum\limits_{i\in A,j\notin A}\sum_{m=n}^{\infty}\mathsf{P}_{A}(\Lambda_{ij}(m)\leq 0).

Since |A|>|A|>\ell, the sequence 𝖯A(πi(n)<ρ)\mathsf{P}_{A}(\pi_{i}(n)<\rho) is, by assumption, exponentially decaying for every iAi\in A and ρ>0\rho>0 small enough. Therefore, similarly to the previous cases we can show that each term in the upper bound is exponentially decaying.

In the remainder of this Appendix we prove Theorem IV.1, whose proof relies on two preliminary lemmas. To state those, for every D𝒫,uD\in\mathcal{P}_{\ell,u} and nn\in\mathbb{N} we denote by τnD\tau^{D}_{n} the first time instance mm at which DD has been estimated as the subset of anomalous sources for at least 2ζm2\,\zeta\,m times since n/2\lceil n/2\rceil, i.e.

τnD:=inf{mn/2:u=n/2m𝟏{Δu1=D}2ζm},\tau^{D}_{n}:=\inf\left\{m\geq\lceil n/2\rceil:\sum_{u=\lceil n/2\rceil}^{m}\mathbf{1}\{\Delta_{u-1}=D\}\geq 2\zeta m\right\}, (78)

where ζ>0\zeta>0 is an arbitrary constant, which in the proof of Theorem IV.1 will be selected to be small enough.

Lemma B.1

Let D𝒫,uD\in\mathcal{P}_{\ell,u} and i[M]i\in[M]. If ci(D)>0c_{i}(D)>0, then 𝖯(τnDn,πi(τnD)<ρ)\mathsf{P}(\tau^{D}_{n}\leq n,\;\pi_{i}(\tau^{D}_{n})<\rho) and 𝖯(τnDn,πi(n)<ρ)\mathsf{P}(\tau^{D}_{n}\leq n,\;\pi_{i}(n)<\rho) are both exponentially decaying for all ρ>0\rho>0 small enough.

Proof:

We prove the two claims together by showing that 𝖯(τnDn,πi(σn)<ρ)\mathsf{P}(\tau^{D}_{n}\leq n,\;\pi_{i}(\sigma_{n})<\rho) is exponentially decaying for all ρ>0\rho>0 small enough, where σn\sigma_{n} stands for either nn or τnD\tau_{n}^{D}. For any given nn\in\mathbb{N} we set

π~i(n)\displaystyle\widetilde{\pi}_{i}(n) :=1nm=1n(Ri(m)ci(Δm1))\displaystyle:=\frac{1}{n}\sum_{m=1}^{n}(R_{i}(m)-c_{i}(\Delta_{m-1}))
=πi(n)1nm=1nci(Δm1),n.\displaystyle=\pi_{i}(n)-\frac{1}{n}\sum_{m=1}^{n}c_{i}(\Delta_{m-1}),\quad n\in\mathbb{N}.

Then, on the event {τnDn}\{\tau^{D}_{n}\leq n\} we have n/2τnDσnnn/2\leq\tau^{D}_{n}\leq\sigma_{n}\leq n and consequently

πi(σn)π~i(σn)\displaystyle\pi_{i}(\sigma_{n})-\widetilde{\pi}_{i}(\sigma_{n}) =1σnm=1σnci(Δm1)\displaystyle=\frac{1}{\sigma_{n}}\sum_{m=1}^{\sigma_{n}}c_{i}(\Delta_{m-1})
ci(D)nm=n/2τnD 1{Δm1=D}\displaystyle\geq\frac{c_{i}(D)}{n}\sum_{m=\lceil n/2\rceil}^{\tau^{D}_{n}}\,\mathbf{1}\{\Delta_{m-1}=D\}
ci(D)n 2ζτnDci(D)ζ,\displaystyle\geq\frac{c_{i}(D)}{n}\,2\,\zeta\,\tau_{n}^{D}\geq\;c_{i}(D)\,\zeta,

where in the last inequality we have used the definition of τnD\tau_{n}^{D}. Consequently, for every ρ>0\rho>0 and nn\in\mathbb{N} we obtain

{τnDn,πi(σn)ρ}{τnDn,π~i(σn)ρci(D)ζ}.\{\tau^{D}_{n}\leq n,\;\pi_{i}(\sigma_{n})\leq\rho\}\subseteq\{\tau^{D}_{n}\leq n,\,\widetilde{\pi}_{i}(\sigma_{n})\leq\rho-c_{i}(D)\,\zeta\}.

Since ci(D)>0c_{i}(D)>0, there is an ϵ>0\epsilon>0 such that for all ρ(0,ci(D)ζϵ)\rho\in(0,c_{i}(D)\zeta-\epsilon) we have

{τnDn,πi(σn)ρ}{maxn/2mn|π~i(m)|ϵ}.\{\tau^{D}_{n}\leq n,\;\pi_{i}(\sigma_{n})\leq\rho\}\subseteq\left\{\max_{\lceil n/2\rceil\leq m\leq n}|\widetilde{\pi}_{i}(m)|\geq\epsilon\right\}.

Therefore, it suffices to show that, for all ϵ>0\epsilon>0, the sequence

𝖯(maxn/2mn|π~i(m)|ϵ)\displaystyle\mathsf{P}\left(\max_{\lceil n/2\rceil\leq m\leq n}|\widetilde{\pi}_{i}(m)|\geq\epsilon\right) (79)

is exponentially decaying. This follows by the fact that (Ri(n)ci(Δn1))(R_{i}(n)-c_{i}(\Delta_{n-1})) is a uniformly bounded martingale difference, in view of (22), and an application of the maximal Azuma-Hoeffding submartingale inequality [35, Remark 2.3].

Lemma B.2

Let A,D𝒫,uA,D\in\mathcal{P}_{\ell,u} be such that |D|{l,u}|D|\in\{l,u\} and DAD\neq A. If RR is a sampling rule that satisfies the conditions of Theorem IV.1, then the sequence 𝖯A(τnDn)\mathsf{P}_{A}(\tau^{D}_{n}\leq n) is exponentially decaying.

Proof:

We consider first the case that <u\ell<u, where we only prove the result when |D|=u|D|=u, as the proof when |D|=|D|=\ell is similar. Since A𝒫,uA\in\mathcal{P}_{\ell,u}, |D|=u|D|=u and DAD\neq A, there exists a jDAj\in D\setminus A. Since also |D|>|D|>\ell, the assumptions of Theorem IV.1 imply that cj(D)>0c_{j}(D)>0. Thus, by Lemma B.1 it follows that

𝖯A(τnDn,πj(τnD)<ρ)\displaystyle\mathsf{P}_{A}\left(\tau^{D}_{n}\leq n,\;\pi_{j}(\tau^{D}_{n})<\rho\right) (80)

is an exponentially decaying sequence for ρ>0\rho>0 small enough, and it suffices to show that this is also the case for

𝖯A(τnDn,πj(τnD)ρ).\displaystyle\mathsf{P}_{A}\left(\tau^{D}_{n}\leq n,\;\;\pi_{j}(\tau^{D}_{n})\geq\rho\right). (81)

Indeed, by the definition of τnD\tau^{D}_{n} in (78) it follows that on the event {τnD<}\{\tau^{D}_{n}<\infty\} we have Δ(τnD1)=D\Delta(\tau_{n}^{D}-1)=D. Since |D|=u|D|=u and jDj\in D, by the definition of the decision rule in (17) it follows that Λj(τnD1)>0\Lambda_{j}(\tau^{D}_{n}-1)>0. Therefore,

𝖯A(τnDn,πj(τnD)ρ)=𝖯A(τnDn,Λj(τnD1)>0,πj(τnD)ρ)m=n/2n𝖯A(Λj(m1)>0,πj(m)ρ),\displaystyle\begin{split}&\mathsf{P}_{A}\left(\tau^{D}_{n}\leq n,\;\;\pi_{j}(\tau^{D}_{n})\geq\rho\right)\\ &=\mathsf{P}_{A}(\tau^{D}_{n}\leq n,\,\Lambda_{j}(\tau_{n}^{D}-1)>0,\,\pi_{j}(\tau_{n}^{D})\geq\rho)\\ &\leq\sum_{m=\lceil n/2\rceil}^{n}\mathsf{P}_{A}\left(\Lambda_{j}(m-1)>0,\,\pi_{j}(m)\geq\rho\right),\end{split}

where the inequality holds because τnD\tau_{n}^{D} takes values in [n/2,n][n/2,n] on the event {τnDn}\{\tau_{n}^{D}\leq n\}. Therefore, it remains to show that the sequence

𝖯A(Λj(n1)>0,πj(n)ρ)\displaystyle\mathsf{P}_{A}\left(\Lambda_{j}(n-1)>0,\,\pi_{j}(n)\geq\rho\right) (82)

is exponentially decaying for ρ>0\rho>0 small enough. Indeed, for large enough nn, πj(n)ρ\pi_{j}(n)\geq\rho implies that

πj(n1)nπj(n)1n1ρn1n1ρ1n1.\pi_{j}(n-1)\geq\frac{n\pi_{j}(n)-1}{n-1}\geq\frac{\rho n-1}{n-1}\geq\rho-\frac{1}{n-1}.

For any given ρ>0\rho>0, there exists a ρ>0\rho^{\prime}>0 so that for all nn large enough we have

ρ1n1>ρ,\displaystyle\rho-\frac{1}{n-1}>\rho^{\prime}, (83)

and so that the probability in (82) is bounded by

𝖯A(Λj(n1)>0,πj(n1)ρ).\mathsf{P}_{A}\left(\Lambda_{j}(n-1)>0,\,\pi_{j}(n-1)\geq\rho^{\prime}\right).

For ρ>0\rho>0 small enough, ρ>0\rho^{\prime}>0 is small enough, and by Lemma A.1 it follows that the latter probability, and consequently (82), is exponentially decaying in nn.

It remains to prove the lemma when =u\ell=u and ADA\neq D. In this case there are iADi\in A\setminus D and jDAj\in D\setminus A, and by the assumptions of Theorem IV.1 it follows that either ci(D)>0c_{i}(D)>0 or cj(D)>0c_{j}(D)>0. Without loss of generality, we assume that the latter holds. Then, by Lemma B.1 it follows that (80) is exponentially decaying for all ρ>0\rho>0 small enough, and it suffices to show that this is also the case for (81). Indeed, by the definition of τnD\tau^{D}_{n} in (78) it follows that on the event {τnD<}\{\tau^{D}_{n}<\infty\} we have Δ(τnD1)=D\Delta(\tau_{n}^{D}-1)=D. Consequently, by the definition of the decision rule in (13) it follows that there is an iADi\in A\setminus D such that Λij(τnD1)<0\Lambda_{ij}(\tau^{D}_{n}-1)<0. Therefore, by the union bound we have

𝖯A(τnDn,πj(τnD)ρ)\displaystyle\mathsf{P}_{A}\left(\tau^{D}_{n}\leq n,\;\;\pi_{j}(\tau^{D}_{n})\geq\rho\right)
=iAD𝖯A(τnDn,Λij(τnD1)<0,πj(τnD)ρ)\displaystyle=\sum_{i\in A\setminus D}\mathsf{P}_{A}\left(\tau^{D}_{n}\leq n,\,\Lambda_{ij}(\tau^{D}_{n}-1)<0,\,\pi_{j}(\tau_{n}^{D})\geq\rho\right)
iADm=n/2n𝖯A(Λij(m1)<0,πj(m)ρ),\displaystyle\leq\sum_{i\in A\setminus D}\sum_{m=\lceil n/2\rceil}^{n}\mathsf{P}_{A}\left(\Lambda_{ij}(m-1)<0,\,\pi_{j}(m)\geq\rho\right),

where as before the inequality holds because τnD\tau_{n}^{D} takes values in [n/2,n][n/2,n] on the event {τnDn}\{\tau_{n}^{D}\leq n\}. Therefore, it remains to show that the sequence

𝖯A(Λij(n1)<0,πj(n)ρ)\displaystyle\mathsf{P}_{A}\left(\Lambda_{ij}(n-1)<0,\,\pi_{j}(n)\geq\rho\right)

is exponentially decaying for ρ>0\rho>0 small enough. As before, this follows by an application of Lemma A.1. ∎

Proof:

Fix A𝒫,uA\in\mathcal{P}_{\ell,u}. By Theorem III.1 it suffices to show that, for all ρ>0\rho>0 small enough, 𝖯A(πi(n)<ρ)\mathsf{P}_{A}(\pi_{i}(n)<\rho) is an exponentially decaying sequence

  • for every iAi\in A, if |A|>|A|>\ell, and for every iAi\notin A, if |A|<u|A|<u, when <u\ell<u,

  • either for every iAi\in A or for every iAi\notin A, when =u\ell=u.

In order to do so, we select the positive constant ζ\zeta in (78) to be smaller than 1/|𝒫,u|1/|\mathcal{P}_{\ell,u}|. Then, for every nn\in\mathbb{N} there is at least one D𝒫,uD\in\mathcal{P}_{\ell,u} for which {τnDn}\{\tau^{D}_{n}\leq n\}\neq\emptyset. As a result, for every i[M]i\in[M] and ρ>0\rho>0, by the union bound we have

𝖯A(πi(n)<ρ)D𝒫,u𝖯A(πi(n)<ρ,τnDn).\mathsf{P}_{A}(\pi_{i}(n)<\rho)\leq\sum_{D\in\mathcal{P}_{\ell,u}}\mathsf{P}_{A}\left(\pi_{i}(n)<\rho,\;\tau_{n}^{D}\leq n\right).

Suppose first that <u\ell<u. Then, it suffices to show that, for every D𝒫,uD\in\mathcal{P}_{\ell,u} and all ρ>0\rho>0 small enough,

𝖯A(πi(n)<ρ,τnDn)\mathsf{P}_{A}\left(\pi_{i}(n)<\rho,\tau_{n}^{D}\leq n\right) (84)

is exponentially decaying for every iAi\in A when |A|>|A|>\ell and for every iAi\notin A when |A|<u|A|<u. We only consider the former case, as the proof for the latter is similar. Thus, suppose that |A|>|A|>\ell and let iAi\in A.

  • If ci(D)>0c_{i}(D)>0, by Lemma B.1 it follows that (84) is an exponentially decaying sequence.

  • If ci(D)=0c_{i}(D)=0, the assumption of the theorem implies that either |D|=u|D|=u and iDi\notin D, or |D|=|D|=\ell and iDi\in D. In either case, ADA\neq D and by Lemma B.2 it follows that 𝖯A(τnDn)\mathsf{P}_{A}(\tau_{n}^{D}\leq n), and consequently (84), is an exponentially decaying sequence.

Suppose now that =u\ell=u. Then, it suffices to show that (84) is exponentially decaying for every D𝒫,uD\in\mathcal{P}_{\ell,u} and all ρ>0\rho>0 small enough, either for every iAi\in A or for every iAi\notin A.

  • When DAD\neq A, by Lemma B.2 it follows that 𝖯A(τnDn)\mathsf{P}_{A}(\tau_{n}^{D}\leq n), and consequently (84), is exponentially decaying.

  • When D=AD=A, then by assumption ci(A)>0c_{i}(A)>0 holds either for every iAi\in A or for every iAi\notin A. By Lemma B.1 it then follows that (84) is exponentially decaying either for every iAi\in A or for every iAi\notin A, and this completes the proof.

Appendix C

In this Appendix we fix A𝒫,uA\in\mathcal{P}_{\ell,u} and prove the universal asymptotic lower bound of Theorem V.2. The proof relies on two lemmas, for the statement of which we need to introduce the following function:

ϕ(α,β):=αlog(α1β)+(1α)log(1αβ),\displaystyle\phi(\alpha,\beta):=\alpha\log\left(\frac{\alpha}{1-\beta}\right)+(1-\alpha)\log\left(\frac{1-\alpha}{\beta}\right),

where α,β(0,1)\alpha,\beta\in(0,1), i.e., the Kullback-Leibler divergence between a Bernoulli distribution with parameter α\alpha and one with parameter 1β1-\beta. Moreover, we set ϕ(α)ϕ(α,α)\phi(\alpha)\equiv\phi(\alpha,\alpha).

The first lemma states a non-asymptotic, information-theoretic inequality that generalizes the one used in Wald’s universal lower bound in the problem of testing two simple hypotheses [36, p. 156].

Lemma C.1

Let α+β<1\alpha+\beta<1 and let (R,T,Δ)(R,T,\Delta) be a policy that satisfies the error constraint (3) and 𝖯A(T<)=1\mathsf{P}_{A}(T<\infty)=1. Then, for any C𝒫,uC\in\mathcal{P}_{\ell,u} such that CAC\neq A we have

𝖤A[ΛA,CR(T)]{ϕ(α,β)if CA, AC=,ϕ(β,α)if CA=, AC,ϕ(αβ)if CA, AC.\mathsf{E}_{A}\left[\Lambda^{R}_{A,{C}}(T)\right]\geq\begin{cases}\phi(\alpha,\beta)\;&\text{if }\;{C}{\setminus}A\neq\emptyset,\mbox{ }A{\setminus}{C}=\emptyset,\\ \phi(\beta,\alpha)\;&\text{if }\;{C}{\setminus}A=\emptyset,\mbox{ }A{\setminus}{C}\neq\emptyset,\\ \phi(\alpha\wedge\beta)\quad&\text{if }\;{C}{\setminus}A\neq\emptyset,\mbox{ }A{\setminus}{C}\neq\emptyset.\end{cases} (85)
Proof:

The proof is identical to that in the full sampling case in [6, Theorem 5.1], and can be obtained by an application of the data processing inequality for Kullback-Leibler divergences (see, e.g., [37, Lemma 3.2.1]). Indeed, the left-hand side is the Kullback-Leibler divergence between 𝖯A\mathsf{P}_{A} and 𝖯C\mathsf{P}_{C} given the available information up to time TT, when the sampling rule RR is utilized, whereas the right hand side is obtained by considering the Kullback-Leibler divergence between 𝖯A\mathsf{P}_{A} and 𝖯C\mathsf{P}_{C} based on a single event of TR\mathcal{F}^{R}_{T}. ∎

We next make use of the previous inequality to establish lower bounds on the expected number of samples taken from each source until stopping.

Lemma C.2

Let α+β<1\alpha+\beta<1 and let (R,T,Δ)(R,T,\Delta) be a policy that satisfies the error constraint (3) and 𝖤A[T]<\mathsf{E}_{A}[T]<\infty.

  1. (i)

    If |A|<u|A|<u, then

    minjA(Jj𝖤A[NjR(T)])ϕ(α,β).\min_{j\notin A}\left(J_{j}\,\mathsf{E}_{A}\left[N_{j}^{R}(T)\right]\right)\geq\phi(\alpha,\beta). (86)
  2. (ii)

    If |A|>|A|>\ell, then

    miniA(Ii𝖤A[NiR(T)])ϕ(β,α).\min_{i\in A}\left(I_{i}\,\mathsf{E}_{A}\left[N_{i}^{R}(T)\right]\right)\geq\phi(\beta,\alpha). (87)
  3. (iii)

    If either |A|=>0|A|=\ell>0 or |A|=u<M|A|=u<M, then

    miniA(Ii𝖤A[NiR(T)])+minjA(Jj𝖤A[NjR(T)])ϕ(αβ).\displaystyle\min_{i\in A}\left(I_{i}\,\mathsf{E}_{A}\left[N_{i}^{R}(T)\right]\right)+\min_{j\notin A}\left(J_{j}\,\mathsf{E}_{A}\left[N_{j}^{R}(T)\right]\right)\geq\phi(\alpha\wedge\beta). (88)
Proof:

We recall the sequence Λ~iR,\widetilde{\Lambda}^{R}_{i}, defined in (74), and note that it is a zero-mean, {nR}\{\mathcal{F}^{R}_{n}\}-martingale under 𝖯A\mathsf{P}_{A}. Moreover, by the finiteness of the Kullback-Leibler divergences in (1) we have:

supn𝖤A[|Λ~iR(n)Λ~iR(n1)||n1R]<.\sup_{n\in\mathbb{N}}\mathsf{E}_{A}\left[|\widetilde{\Lambda}^{R}_{i}(n)-\widetilde{\Lambda}^{R}_{i}(n-1)|\;|\;\mathcal{F}^{R}_{n-1}\right]<\infty.

Since also TT is an {nR}\{\mathcal{F}^{R}_{n}\}-stopping time such that 𝖤A[T]<\mathsf{E}_{A}[T]<\infty, by the Optional Sampling Theorem [38, pg. 251] we obtain:

𝖤A[Λ~iR(T)]=0for everyi[M].\mathsf{E}_{A}\left[\widetilde{\Lambda}^{R}_{i}(T)\right]=0\quad\text{for every}\quad i\in[M]. (89)

(i) If |A|<u|A|<u, there is a jAj\notin A such that C=A{j}𝒫,uC=A\cup\{j\}\in\mathcal{P}_{\ell,u}. By representation (11) and decomposition (75) it follows that the log-likelihood ratio process in (9) takes the form

ΛA,CR(T)=ΛjR(T)=Λ~jR(T)+JjNjR(T).\Lambda_{A,C}^{R}(T)=-\Lambda_{j}^{R}(T)=-\widetilde{\Lambda}_{j}^{R}(T)+J_{j}\,N_{j}^{R}(T).

By (85) and (89) we then obtain

Jj𝖤A[NjR(T)]ϕ(α,β).J_{j}\,\mathsf{E}_{A}\left[N_{j}^{R}(T)\right]\geq\phi(\alpha,\beta).

Since this inequality holds for every jAj\notin A, this completes the proof.

(ii) The proof is similar to (i) and is omitted.

(iii) If |A|=>0|A|=\ell>0 or |A|=u<M|A|=u<M, then there are iAi\in A and jAj\notin A such that C=A{j}{i}𝒫,uC=A\cup\{j\}\setminus\{i\}\in\mathcal{P}_{\ell,u}. By representation (11) and decomposition (75) we have

ΛA,CR(T)\displaystyle\Lambda^{R}_{A,C}(T) =ΛiR(T)ΛjR(T)\displaystyle=\Lambda_{i}^{R}(T)-\Lambda_{j}^{R}(T)
=Λ~iR(T)Λ~jR(T)+JjNjR(T)+IiNiR(T).\displaystyle=\widetilde{\Lambda}_{i}^{R}(T)-\widetilde{\Lambda}^{R}_{j}(T)+J_{j}\,N_{j}^{R}(T)+I_{i}\,N_{i}^{R}(T).

By (85) and (89) we then obtain

Ii𝖤A[NiR(T)]+Jj𝖤A[NjR(T)]ϕ(αβ).I_{i}\,\mathsf{E}_{A}\left[N_{i}^{R}(T)\right]+J_{j}\,\mathsf{E}_{A}\left[N_{j}^{R}(T)\right]\geq\phi(\alpha\wedge\beta).

Since this inequality holds for every iAi\in A and jAj\notin A, this proves (88).

For the proof of Theorem V.1 we introduce the following notation:

𝒟K:={(c1,,cM)[0,1]M:i=1MciK}𝒟K:={(p,q)[0,1]2:pK^A+qKˇAK}.\displaystyle\begin{split}\mathcal{D}_{K}&:=\left\{(c_{1},\ldots,c_{M})\in[0,1]^{M}:\sum_{i=1}^{M}c_{i}\leq K\right\}\\ \mathcal{D}^{\prime}_{K}&:=\{(p,q)\in[0,1]^{2}:\;p\hat{K}_{A}+q\check{K}_{A}\leq K\}.\end{split} (90)

Moreover, we observe that as α,β0\alpha,\beta\to 0 we have

ϕ(α,β)\displaystyle\phi(\alpha,\beta) |logβ|,\displaystyle\sim|\log\beta|, (91)
r(α,β)ϕ(β,α)ϕ(α,β)\displaystyle r(\alpha,\beta)\equiv\frac{\phi(\beta,\alpha)}{\phi(\alpha,\beta)} |logα||logβ|.\displaystyle\sim\frac{|\log\alpha|}{|\log\beta|}. (92)
Proof:

(i) Let α,β(0,1)\alpha,\beta\in(0,1) such that α+β<1\alpha+\beta<1 and (R,T,Δ)𝒞(α,β,,u,K)(R,T,\Delta)\in\mathcal{C}(\alpha,\beta,\ell,u,K) such that 𝖤A[T]<\mathsf{E}_{A}[T]<\infty. By Lemma C.2(iii) it then follows that

𝖤A[T]WA(T)ϕ(αβ),where\displaystyle\mathsf{E}_{A}[T]\;W_{A}(T)\geq\phi(\alpha\wedge\beta),\quad\text{where}
WA(T)\displaystyle W_{A}(T) :=miniA{Ii𝖤A[NiR[T]𝖤A[T]}+minjA{Jj𝖤A[NjR[T]𝖤A[T]},\displaystyle:=\min_{i\in A}\left\{I_{i}\,\frac{\mathsf{E}_{A}[N^{R}_{i}[T]}{\mathsf{E}_{A}[T]}\right\}+\min_{j\notin A}\left\{J_{j}\,\frac{\mathsf{E}_{A}[N^{R}_{j}[T]}{\mathsf{E}_{A}[T]}\right\},

and by constraint (4) we conclude that

𝖤A[T]VA\displaystyle\mathsf{E}_{A}[T]\;V_{A} ϕ(αβ),\displaystyle\geq\phi(\alpha\wedge\beta),
whereVA\displaystyle\text{where}\qquad V_{A} :=max(c1,,cM)𝒟K{miniA(ciIi)+minjA(cjJj)}.\displaystyle:=\max_{(c_{1},\ldots,c_{M})\in\mathcal{D}_{K}}\left\{\min\limits_{i\in A}(c_{i}I_{i})+\min\limits_{j{\notin}A}(c_{j}J_{j})\right\}.

Since the lower bound is independent of the policy (R,T,D)(R,T,D), we have

𝒥A(α,β,,u,K)VA\displaystyle\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)\;V_{A} ϕ(αβ).\displaystyle\geq\phi(\alpha\wedge\beta).

Comparing with (30) and recalling (91), we can see that it suffices to show that VA=xAIA+yAJAV_{A}=x_{A}I_{A}^{*}+y_{A}J_{A}^{*} with xAx_{A} and yAy_{A} as in (31)-(32). Indeed, the maximum in VAV_{A} is achieved by cic_{i}’s of the form

ciIi=pIA,iA,cjJj=qJA,jA,\displaystyle\begin{split}c_{i}I_{i}&=pI_{A}^{*},\quad i\in A,\\ c_{j}J_{j}&=qJ_{A}^{*},\quad j\notin A,\end{split} (93)

for p,q[0,1]p,q\in[0,1] such that the constraint in 𝒟K\mathcal{D}_{K} is satisfied, i.e.,

Ki=1Mci=piAIAIi+qjAJAJj=pK^A+qKˇA,K\geq\sum_{i=1}^{M}c_{i}=p\sum_{i\in A}\frac{I^{*}_{A}}{I_{i}}+q\sum_{j\notin A}\frac{J^{*}_{A}}{J_{j}}=p\,\hat{K}_{A}+q\,\check{K}_{A}, (94)

and as a result,

VA=max(p,q)𝒟K{pIA+qJA}.V_{A}=\max_{(p,q)\in\mathcal{D}^{\prime}_{K}}\{pI_{A}^{*}+qJ_{A}^{*}\}.

This maximum is achieved by p,q[0,1]p,q\in[0,1] such that pK^A+qKˇA=K(K^A+KˇA)p\hat{K}_{A}+q\check{K}_{A}=K\wedge(\hat{K}_{A}+\check{K}_{A}), in particular by pp and qq equal to xAx_{A} and yAy_{A} as in (31)-(32), which completes the proof.

(ii) Suppose first that <|A|<u\ell<|A|<u. As before, let α,β(0,1)\alpha,\beta\in(0,1) such that α+β<1\alpha+\beta<1 and (R,T,Δ)𝒞(α,β,,u,K)(R,T,\Delta)\in\mathcal{C}(\alpha,\beta,\ell,u,K) such that 𝖤A[T]<\mathsf{E}_{A}[T]<\infty. Then, by Lemma C.2(i) and Lemma C.2(ii) we obtain:

𝖤A[T]WA(T)ϕ(β,α),where\displaystyle\mathsf{E}_{A}[T]\;W_{A}(T)\geq\phi(\beta,\alpha),\;\text{where}
WA(T)\displaystyle W_{A}(T) :=min{miniA{Ii𝖤A[NiR[T]𝖤A[T]},r(α,β)minjA{Jj𝖤A[NjR[T]𝖤A[T]}},\displaystyle:=\min\left\{\min_{i\in A}\left\{I_{i}\,\frac{\mathsf{E}_{A}[N^{R}_{i}[T]}{\mathsf{E}_{A}[T]}\right\},\;r(\alpha,\beta)\;\min_{j\notin A}\left\{J_{j}\,\frac{\mathsf{E}_{A}[N^{R}_{j}[T]}{\mathsf{E}_{A}[T]}\right\}\right\},

and by constraint (4) we conclude that

𝖤A[T]VA(α,β)ϕ(β,α)whereVA(α,β):=max(c1,,cM)𝒟Kmin{miniA(ciIi),r(α,β)minjA(cjJj)}.\displaystyle\begin{split}&\mathsf{E}_{A}[T]\;V_{A}(\alpha,\beta)\geq\phi(\beta,\alpha)\\ \text{where}\qquad V_{A}(\alpha,\beta)&:=\max_{(c_{1},\ldots,c_{M})\in\mathcal{D}_{K}}\min\left\{\min_{i\in A}\left(c_{i}I_{i}\right),\,r(\alpha,\beta)\;\min_{j\notin A}\left(c_{j}J_{j}\right)\right\}.\end{split} (95)

Since the lower bound does not depend on the policy (R,T,Δ)(R,T,\Delta) we have

𝒥A(α,β,,u,K)VA(α,β)\displaystyle\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)\;V_{A}(\alpha,\beta) ϕ(β,α).\displaystyle\geq\phi(\beta,\alpha).

Comparing with (33) and recalling (91) we can see that it suffices to show that

VA(α,β)xAIA=ryAJAV_{A}(\alpha,\beta)\rightarrow x_{A}\,I_{A}^{*}=r\,y_{A}\,J_{A}^{*}

as α,β0\alpha,\beta\to 0 according to (7), with xAx_{A} and yAy_{A} as in (33). The equality follows directly from the values of xAx_{A} and yAy_{A} in (35). Moreover, the maximum in VA(α,β)V_{A}(\alpha,\beta) is achieved by c1,,cMc_{1},\ldots,c_{M} of the form (93) that satisfy (94). Therefore:

VA(α,β)=max(p,q)𝒟Kmin{pIA,r(α,β)qJA},V_{A}(\alpha,\beta)=\max_{(p,q)\in\mathcal{D}^{\prime}_{K}}\min\left\{pI_{A}^{*},\,r(\alpha,\beta)\,qJ_{A}^{*}\right\},

and this maximum is achieved for pp and qq such that the two terms in the minimum are equal. As a result, we obtain

VA(α,β)=pIA=r(α,β)qJA,V_{A}(\alpha,\beta)=pI_{A}^{*}=r(\alpha,\beta)\,qJ_{A}^{*},

where pp and qq are equal to xAx_{A} and yAy_{A} in (33), with rr replaced by r(α,β)r(\alpha,\beta). As α\alpha and β\beta go to 0 according to (7), we have r(α,β)rr(\alpha,\beta)\to r (recall (92)), and consequently VA(α,β)xAIAV_{A}(\alpha,\beta)\rightarrow x_{A}\,I_{A}^{*} with xAx_{A} as in (35).

Finally, we consider the case |A|=|A|=\ell and omit the proof when |A|=u|A|=u, as it is similar. When either =0\ell=0 or r1r\leq 1, we have to show (34). Indeed, working as before, using Lemma C.2(i), we obtain

𝒥A(α,β,,u,K)VA\displaystyle\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)\;V_{A} ϕ(α,β),\displaystyle\geq\phi(\alpha,\beta),
whereVA\displaystyle\text{where}\quad V_{A} :=max(c1,,cM)𝒟KminjA{cjJj}.\displaystyle:=\max_{(c_{1},\ldots,c_{M})\in\mathcal{D}_{K}}\;\min_{j\notin A}\;\left\{c_{j}J_{j}\right\}.

Comparing with (34), and recalling (91), we can see that it suffices to show that VA=JAyA,V_{A}=J_{A}^{*}\;y_{A}, with yAy_{A} as in (34). Indeed, the maximum in VAV_{A} is achieved by c1,,cMc_{1},\ldots,c_{M} of the form (93) with p=0p=0 and q[0,1]q\in[0,1] such that (94) is satisfied, i.e.,

VA=JAmaxq[0,1]:qKˇAKq,V_{A}=J_{A}^{*}\;\max_{q\in[0,1]:\;q\check{K}_{A}\leq K}q,

which shows that VA=JAyA,V_{A}=J_{A}^{*}\;y_{A}, with yAy_{A} as in (34) and completes the proof in this case.

It remains to establish the asymptotic lower bound when >0\ell>0 and r>1r>1, in which case we have to show (35)-(36). The asymptotic equivalence in (35) can be shown by direct evaluation, therefore it suffices to show only the asymptotic lower bound in this case.

Working as before, using Lemma C.2(i) and Lemma C.2(iii), for any α,β(0,1)\alpha,\beta\in(0,1) such that α+β<1\alpha+\beta<1 we obtain

𝒥A(α,β,,u,K)VA(α,β)ϕ(β,α),whereVA(α,β):=max(c1,,cM)𝒟Kmin{r(α,β)minjA(cjJj),miniA(ciIi)+minjA(cjJj)}.\displaystyle\begin{split}&\mathcal{J}_{A}(\alpha,\beta,\ell,u,K)\;\;V_{A}(\alpha,\beta)\geq\phi(\beta,\alpha),\quad\text{where}\\ V_{A}(\alpha,\beta)&:=\max_{(c_{1},\ldots,c_{M})\in\mathcal{D}_{K}}\min\left\{r(\alpha,\beta)\;\min_{j\notin A}\left(c_{j}J_{j}\right),\,\min_{i{\in}A}\left(c_{i}I_{i}\right)+\min_{j\notin A}\left(c_{j}J_{j}\right)\right\}.\end{split} (96)

The latter maximum is achieved by c1,,cMc_{1},\ldots,c_{M} of the form (93) that satisfy (94), thus,

VA(α,β)\displaystyle V_{A}(\alpha,\beta) =max(p,q)𝒟Kmin{r(α,β)qJA,pIA+qJA}.\displaystyle=\max_{(p,q)\in\mathcal{D}^{\prime}_{K}}\;\min\left\{r(\alpha,\beta)\;qJ_{A}^{*},\;pI_{A}^{*}+qJ_{A}^{*}\right\}.
  • If θAr(α,β)1\theta_{A}\geq r(\alpha,\beta)-1 or KK^A+(θA/(r(α,β)1))KˇAK\leq\hat{K}_{A}+(\theta_{A}/(r(\alpha,\beta)-1))\check{K}_{A}, the maximum in VA(α,β)V_{A}(\alpha,\beta) is achieved when the two terms in the minimum are equal. As a result, we have

    VA(α,β)=pIA+qJA=r(α,β)qJAV_{A}(\alpha,\beta)=p\,I_{A}^{*}+q\,J_{A}^{*}=r(\alpha,\beta)\;qJ_{A}^{*} (97)

    with pp and qq equal to xAx_{A} and yAy_{A} as in (35), but with rr replaced by r(α,β)r(\alpha,\beta).

  • Otherwise, the second term in the minimum is smaller and the first equality in (97) holds with pp and qq equal to xAx_{A} and yAy_{A} as in (36), but with rr replaced by r(α,β)r(\alpha,\beta).

Therefore, letting α\alpha and β\beta go to 0 in (96) according to (7), and recalling (91)-(92), proves the asymptotic lower bounds in both (35) and (36).

Appendix D

In this Appendix we prove Theorems V.2 and V.3, which provide sufficient conditions for asymptotic optimality. In both proofs we recall that: ci(A)>0c_{i}^{*}(A)>0 for every ii in AA (resp. AcA^{c}) when xA>0x_{A}>0 (resp. yA>0y_{A}>0), xAyA>0x_{A}\vee y_{A}>0, xA>0x_{A}>0 when |A|>|A|>\ell, and yA>0y_{A}>0 when |A|<u|A|<u.

Proof:

We prove the theorem first when =u\ell=u, where α\alpha and β\beta go to 0 at arbitrary rates. By the asymptotic lower bound (30) in Theorem V.1 it follows that in this case it suffices to show

𝖤A[TR]|log(αβ)|xAIA+yAJA.\mathsf{E}_{A}[T^{R}]\lesssim\frac{|\log(\alpha\wedge\beta)|}{x_{A}\,I_{A}^{*}+y_{A}\,J_{A}^{*}}. (98)

Then, for any ϵ>0\epsilon>0 small enough and c>0c>0 we set

Lϵ(c):=maxiA,jAc(ci(A)ϵ)Ii+(cj(A)ϵ)Jjϵ,L_{\epsilon}(c):=\max_{i\in A,j\notin A}\frac{c}{(c_{i}^{*}(A)-\epsilon)I_{i}+(c_{j}^{*}(A)-\epsilon)J_{j}-\epsilon}, (99)

and observe that

𝖤A[TR]Lϵ(c)+n>Lϵ(c)𝖯A(TR>n).\mathsf{E}_{A}[T^{R}]\leq L_{\epsilon}(c)+\sum_{n>L_{\epsilon}(c)}\mathsf{P}_{A}(T^{R}>n). (100)

For any nn\in\mathbb{N}, by the definition of TRT^{R} in (12) it follows that on the event {TR>n}\{T^{R}>n\} there are iAi\in A and jAj\notin A such that ΛijR(n)<c\Lambda^{R}_{ij}(n)<c, and as a result

𝖯A(TR>n)\displaystyle\mathsf{P}_{A}(T^{R}>n) iA,jA𝖯A(ΛijR(n)<c).\displaystyle\leq\sum_{i\in A,j\notin A}\mathsf{P}_{A}(\Lambda^{R}_{ij}(n)<c).

Moreover, for any n>Lϵ(c)n>L_{\epsilon}(c) and iA,jAi\in A,j\notin A we have

c<n((ci(A)ϵ)Ii+(cj(A)ϵ)Jjϵ),\displaystyle c<n\left((c_{i}^{*}(A)-\epsilon)I_{i}+(c_{j}^{*}(A)-\epsilon)J_{j}-\epsilon\right), (101)

and consequently for every c>0c>0 the series in (100) is bounded by

iA,jAn=1𝖯A(ΛijR(n)n<(ci(A)ϵ)Ii+(cj(A)ϵ)Jjϵ).\displaystyle\sum_{i\in A,j\notin A}\sum_{n=1}^{\infty}\mathsf{P}_{A}\left(\frac{\Lambda^{R}_{ij}(n)}{n}<(c_{i}^{*}(A)-\epsilon)I_{i}+(c_{j}^{*}(A)-\epsilon)J_{j}-\epsilon\right). (102)

By the assumption of the theorem and an application of Lemma A.2(iii) with ρi\rho_{i} equal to ci(A)ϵc_{i}^{*}(A)-\epsilon (resp. 0) when xA>0x_{A}>0 (resp. xA=0x_{A}=0) and ρj\rho_{j} equal to cj(A)ϵc_{j}^{*}(A)-\epsilon (resp. 0) when yA>0y_{A}>0 (resp. yA=0y_{A}=0), it follows that the series in (102) converges. Thus, letting first cc\to\infty and then ϵ0\epsilon\to 0 in (100) proves that as cc\to\infty we have

𝖤A[TR]maxiA,jAcci(A)Ii+cj(A)Jj.\mathsf{E}_{A}[T^{R}]\lesssim\max_{i\in A,j\notin A}\frac{c}{c_{i}^{*}(A)I_{i}+c_{j}^{*}(A)\,J_{j}}. (103)

In view of (40) and the selection of threshold cc according to (18), this proves (98).

We next consider the case <u\ell<u, where we let α,β0\alpha,\beta\to 0 so that (7) holds for some r(0,)r\in(0,\infty). We prove the result when |A|<u\ell\leq|A|<u, as the proof when <|A|u\ell<|A|\leq u is similar. Thus, in what follows, |A|<u\ell\leq|A|<u, and as a result yA>0y_{A}>0 and cj(A)>0c_{j}^{*}(A)>0 for every jAj\notin A. By the universal asymptotic lower bounds (33) and (34) it follows that when either |A|>|A|>\ell or |A|==0|A|=\ell=0, it suffices to show that

𝖤A[TR]|logβ|yAJA.\mathsf{E}_{A}[T^{R}]\lesssim\frac{|\log\beta|}{y_{A}\,J^{*}_{A}}. (104)

On the other hand, by the universal asymptotic lower bounds (35) and (36) it follows that when |A|=>0|A|=\ell>0, it suffices to show that

𝖤A[TR]|logβ|yAJA|logα|xAIA+yAJA.\mathsf{E}_{A}[T^{R}]\lesssim\frac{|\log\beta|}{y_{A}J_{A}^{*}}\bigvee\frac{|\log\alpha|}{x_{A}I_{A}^{*}+y_{A}J_{A}^{*}}. (105)

When in particular r1r\leq 1, the maximum is attained strictly by the first term, when r>1r>1, zA<1z_{A}<1 and K>K^A+zAKˇAK>\hat{K}_{A}+z_{A}\check{K}_{A}, the maximum is attained strictly by the second term, whereas in all other cases the two terms are equal to a first-order asymptotic approximation.

We start by proving (104) when <|A|<u\ell<|A|<u. In this case we also have xA>0x_{A}>0, and consequently ci(A)>0c_{i}^{*}(A)>0 for every iAi\in A. Then, for ϵ>0\epsilon>0 small enough and a,b>0a,\,b>0 we set

Nϵ(a,b):=maxjA,iA{a(cj(A)ϵ)Jjϵ,b(ci(A)ϵ)Iiϵ},N_{\epsilon}(a,b):=\max_{j\notin A,\,i\in A}\left\{\frac{a}{(c_{j}^{*}(A)-\epsilon)J_{j}-\epsilon},\,\frac{b}{(c_{i}^{*}(A)-\epsilon)I_{i}-\epsilon}\right\}, (106)

and observe that

𝖤A[TR]Nϵ(a,b)+n>Nϵ(a,b)𝖯A(TR>n).\mathsf{E}_{A}[T^{R}]\leq N_{\epsilon}(a,b)+\sum_{n>N_{\epsilon}(a,b)}\mathsf{P}_{A}(T^{R}>n). (107)

By the definition of TRT^{R} in (16) it follows that, for any nn\in\mathbb{N}, on the event {TR>n}\{T^{R}>n\} there is either a jAj\notin A such that ΛjR(n)>a\Lambda^{R}_{j}(n)>-a, or an iAi\in A such that ΛiR(n)<b\Lambda^{R}_{i}(n)<b. As a result, by the union bound we obtain

𝖯A(TR>n)\displaystyle\mathsf{P}_{A}(T^{R}>n) jA𝖯A(ΛjR(n)<a)+iA𝖯A(ΛiR(n)<b).\displaystyle\leq\sum_{j\notin A}\mathsf{P}_{A}(-\Lambda^{R}_{j}(n)<a)+\sum_{i\in A}\mathsf{P}_{A}(\Lambda^{R}_{i}(n)<b).

Moreover, for any n>Nϵ(a,b)n>N_{\epsilon}(a,b) and any iAi\in A, jAj\notin A we have a<n((cj(A)ϵ)Jjϵ)a<n((c_{j}^{*}(A)-\epsilon)J_{j}-\epsilon) and b<n((ci(A)ϵ)Iiϵ)b<n((c_{i}^{*}(A)-\epsilon)I_{i}-\epsilon), which implies that the series in (107) is bounded by

jAn=1𝖯A(1nΛjR(n)<(cj(A)ϵ)Jjϵ)+iAn=1𝖯A(1nΛiR(n)<(ci(A)ϵ)Iiϵ).\displaystyle\begin{split}&\sum_{j\notin A}\sum_{n=1}^{\infty}\mathsf{P}_{A}\left(-\frac{1}{n}\Lambda^{R}_{j}(n)<(c_{j}^{*}(A)-\epsilon)J_{j}-\epsilon\right)\\ &+\sum_{i\in A}\sum_{n=1}^{\infty}\mathsf{P}_{A}\left(\frac{1}{n}\Lambda^{R}_{i}(n)<(c_{i}^{*}(A)-\epsilon)I_{i}-\epsilon\right).\end{split} (108)

By the assumption of the theorem and an application of Lemma A.2(i) with ρi=ci(A)ϵ\rho_{i}=c_{i}^{*}(A)-\epsilon and of Lemma A.2(ii) with ρj=cj(A)ϵ\rho_{j}=c_{j}^{*}(A)-\epsilon it follows that (108) converges. Thus, letting first a,ba,\,b\to\infty and then ϵ0\epsilon\to 0 in (107) proves that as a,ba,\,b\to\infty

𝖤A[TR]maxjA,iA{acj(A)Jj,bci(A)Ii}.\mathsf{E}_{A}[T^{R}]\lesssim\max_{j\notin A,\,i\in A}\left\{\frac{a}{c_{j}^{*}(A)J_{j}},\,\frac{b}{c_{i}^{*}(A)I_{i}}\right\}.

In view of (40) and the selection of thresholds a,ba,b according to (19), this implies that

𝖤A[TR]|logβ|yAJA|logα|xAIA,\mathsf{E}_{A}[T^{R}]\lesssim\frac{|\log\beta|}{y_{A}\,J^{*}_{A}}\sim\frac{|\log\alpha|}{x_{A}\,I^{*}_{A}}, (109)

and proves (104).

The proof when |A|==0|A|=\ell=0, in which case xA=0x_{A}=0, is similar, with the difference that we use

Nϵ(a):=maxjA{a(cj(A)ϵ)Jjϵ}N_{\epsilon}(a):=\max_{j\notin A}\left\{\frac{a}{(c_{j}^{*}(A)-\epsilon)J_{j}-\epsilon}\right\} (110)

in the place of Nϵ(a,c)N_{\epsilon}(a,c), and apply only Lemma A.2(i).

It remains to show that (105) holds when |A|=>0|A|=\ell>0, in which case xAx_{A} is not always positive. We recall the definitions of Lϵ(c)L_{\epsilon}(c) and Nϵ(a)N_{\epsilon}(a) in (99) and (110) and observe that for any ϵ>0\epsilon>0 small enough and a,c>0a,c>0 we have

𝖤A[TR]Lϵ(c)Nϵ(a)+n>Lϵ(c)Nϵ(a)𝖯A(TR>n).\mathsf{E}_{A}[T^{R}]\leq L_{\epsilon}(c)\vee N_{\epsilon}(a)+\sum_{n>L_{\epsilon}(c)\vee N_{\epsilon}(a)}\mathsf{P}_{A}(T^{R}>n). (111)

By the definition of TRT^{R} in (16) it follows that, for any nn\in\mathbb{N}, on the event {TR>n}\{T^{R}>n\} there are either iAi\in A and jAj\notin A such that ΛijR(n)<c\Lambda^{R}_{ij}(n)<c or jAj\notin A such that ΛjR(n)>a\Lambda^{R}_{j}(n)>-a, and as a result

𝖯A(TR>n)\displaystyle\mathsf{P}_{A}(T^{R}>n) iA,jA𝖯A(ΛijR(n)<c)+jA𝖯A(ΛjR(n)>a).\displaystyle\leq\sum_{i\in A,j\notin A}\mathsf{P}_{A}(\Lambda^{R}_{ij}(n)<c)+\sum_{j\notin A}\mathsf{P}_{A}(\Lambda^{R}_{j}(n)>-a).

Following similar steps as in the previous cases, applying in particular Lemma A.2(ii) with ρj=cj(A)ϵ\rho_{j}=c_{j}^{*}(A)-\epsilon and Lemma A.2(iii) with ρj=cj(A)ϵ\rho_{j}=c_{j}^{*}(A)-\epsilon and ρi\rho_{i} equal to ci(A)ϵc_{i}^{*}(A)-\epsilon (resp. 0) when xA>0x_{A}>0 (resp. xA=0x_{A}=0), we conclude that as a,ca,c\to\infty

𝖤A[TR]maxiA,jA{acj(A)Jjcci(A)Ii+cj(A)Jj}.\mathsf{E}_{A}[T^{R}]\lesssim\max_{i\in A,j\notin A}\left\{\frac{a}{c_{j}^{*}(A)J_{j}}\bigvee\frac{c}{c_{i}^{*}(A)I_{i}+c_{j}^{*}(A)J_{j}}\right\}.

In view of (40) and the selection of thresholds a,ca,c according to (19), this proves (105).

Proof:

Fix A𝒫,uA\in\mathcal{P}_{\ell,u} and a probabilistic sampling rule RR that satisfies (45). Since ci(A)>0c_{i}^{*}(A)>0 for every ii in AA (resp. AcA^{c}) when xA>0x_{A}>0 (resp. yA>0y_{A}>0), xAyA>0x_{A}\vee y_{A}>0, xA>0x_{A}>0 when |A|>|A|>\ell, and yA>0y_{A}>0 when |A|<u|A|<u, the exponentially consistency of RR under 𝖯A\mathsf{P}_{A} follows by an application of Theorem IV.1. To establish its asymptotic optimality, by Theorem V.2 it follows that it suffices to show that 𝖯A(πiR(n)<ρ)\mathsf{P}_{A}(\pi_{i}^{R}(n)<\rho) is an exponentially decaying sequence for every ρ(0,ci(A))\rho\in(0,c_{i}^{*}(A)) and i[M]i\in[M] such that ci(A)>0c_{i}^{*}(A)>0. Fix such ii and ρ\rho. Then, there is an ϵ>0\epsilon>0 such that

ρ+ϵ<ci(A).\displaystyle\rho+\epsilon<c_{i}^{*}(A). (112)

By the definition of a probabilistic rule (recall (21)), R(n+1)R(n+1) is conditionally independent of nR\mathcal{F}_{n}^{R} given ΔnR\Delta_{n}^{R} and its conditional distribution, qRq^{R}, does not depend on nn. Thus, by [39, Prop. 6.13] there is a measurable function h:𝒫,u×[0,1]2[M]h:\mathcal{P}_{\ell,u}\times[0,1]\to 2^{[M]}, which does not depend on nn, such that

R(n+1)=h(ΔnR,Z0(n)),n,\displaystyle R(n+1)=h\left(\Delta_{n}^{R},Z_{0}(n)\right),\quad n\in\mathbb{N},

where {Z0(n),n}\{Z_{0}(n),n\in\mathbb{N}\} is a sequence of iid random variables, uniformly distributed in (0,1)(0,1). Consequently, there is a measurable function hi:𝒫,u×[0,1]{0,1}h_{i}:\mathcal{P}_{\ell,u}\times[0,1]\to\{0,1\} such that

Ri(n+1)=hi(ΔnR,Z0(n)),n.\displaystyle R_{i}(n+1)=h_{i}\left(\Delta_{n}^{R},Z_{0}(n)\right),\quad n\in\mathbb{N}. (113)

Then, for every nn\in\mathbb{N} we have

{πiR(n)<ρ}\displaystyle\{\pi_{i}^{R}(n)<\rho\} ={m=1nhi(A,Z0(m1))+m=1n(Ri(m)hi(A,Z0(m1))<n(ρ+ϵ)nϵ},\displaystyle=\left\{\sum_{m=1}^{n}h_{i}\left(A,Z_{0}(m-1)\right)+\sum_{m=1}^{n}(R_{i}(m)-h_{i}(A,Z_{0}(m-1))<n(\rho+\epsilon)-n\epsilon\right\},

and as a result

𝖯A(πiR(n)<ρ)𝖯A(m=1nhi(A,Z0(m1))<n(ρ+ϵ))+𝖯A(m=1n(Ri(m)hi(A,Z0(m1)))<nϵ).\displaystyle\begin{split}\mathsf{P}_{A}(\pi_{i}^{R}(n)<\rho)&\leq\mathsf{P}_{A}\left(\sum_{m=1}^{n}h_{i}(A,Z_{0}(m-1))<n(\rho+\epsilon)\right)\\ &+\mathsf{P}_{A}\left(\sum_{m=1}^{n}(R_{i}(m)-h_{i}(A,Z_{0}(m-1)))<-n\epsilon\right).\end{split} (114)

From (22) and (113) it follows that {hi(A,Z0(n1)),n}\{h_{i}(A,Z_{0}(n-1)),n\in\mathbb{N}\} is a sequence of iid Bernoulli random variables with parameter ciR(A)c^{R}_{i}(A), whereas by (45) and (112) it follows that ρ+ϵ<ciR(A)\rho+\epsilon<c^{R}_{i}(A). Therefore, by the Chernoff bound we conclude that the first term in the upper bound in (114) is exponentially decaying. The second term is bounded as follows

𝖯A(m=1n(Ri(m)hi(A,Z0(m1)))<nϵ)\displaystyle\mathsf{P}_{A}\left(\sum_{m=1}^{n}(R_{i}(m)-h_{i}(A,Z_{0}(m-1)))<-n\epsilon\right)
𝖯A(m=1n|Ri(m)hi(A,Z0(m1))|>nϵ)\displaystyle\leq\mathsf{P}_{A}\left(\sum_{m=1}^{n}|R_{i}(m)-h_{i}(A,Z_{0}(m-1))|>n\epsilon\right)
𝖯A(σAR>n)+𝖯A(σAR>nϵ),\displaystyle\leq\mathsf{P}_{A}\left(\sigma^{R}_{A}>n\right)+\mathsf{P}_{A}\left(\sigma^{R}_{A}>n\epsilon\right), (115)

where the first inequality follows from the triangle inequality and the second by an application of the total probability rule on the event {σARn}\{\sigma^{R}_{A}\leq n\}, in view of the fact that

σARnm=1n|Ri(m)hi(A,Z0(m1))|σAR.\sigma^{R}_{A}\leq n\quad\Rightarrow\quad\sum_{m=1}^{n}|R_{i}(m)-h_{i}(A,Z_{0}(m-1))|\leq\sigma^{R}_{A}.

By the exponentially consistency of RR, the upper bound in (D) is exponentially decaying, which means that the second term in the upper bound in (114) is also exponentially decaying, and this completes the proof. ∎

References

  • [1] J. Stiles and T. Jernigan, “The basics of brain development,” Neuropsychology review, vol. 20, pp. 327–48, 11 2010.
  • [2] R. J. Bolton and D. J. Hand, “Statistical Fraud Detection: A Review,” Statistical Science, vol. 17, no. 3, pp. 235 – 255, 2002. [Online]. Available: https://doi.org/10.1214/ss/1042727940
  • [3] S. K. De and M. Baron, “Sequential bonferroni methods for multiple hypothesis testing with strong control of family-wise error rates i and ii,” Sequential Analysis, vol. 31, no. 2, pp. 238–262, 2012.
  • [4] ——, “Step-up and step-down methods for testing multiple hypotheses in sequential experiments,” Journal of Statistical Planning and Inference, vol. 142, no. 7, pp. 2059–2070, 2012.
  • [5] J. Bartroff and J. Song, “Sequential tests of multiple hypotheses controlling type i and ii familywise error rates,” Journal of statistical planning and inference, vol. 153, pp. 100–114, 2014.
  • [6] Y. Song and G. Fellouris, “Asymptotically optimal, sequential, multiple testing procedures with prior information on the number of signals,” Electronic Journal of Statistics, vol. 11, 03 2016.
  • [7] ——, “Sequential multiple testing with generalized error control: An asymptotic optimality theory,” The Annals of Statistics, vol. 47, no. 3, pp. 1776 – 1803, 2019. [Online]. Available: https://doi.org/10.1214/18-AOS1737
  • [8] J. Bartroff and J. Song, “Sequential tests of multiple hypotheses controlling false discovery and nondiscovery rates,” Sequential Analysis, vol. 39, no. 1, pp. 65–91, 2020. [Online]. Available: https://doi.org/10.1080/07474946.2020.1726686
  • [9] K. S. Zigangirov, “On a problem in optimal scanning,” Theory of Probability & Its Applications, vol. 11, no. 2, pp. 294–298, 1966. [Online]. Available: https://doi.org/10.1137/1111025
  • [10] E. M. Klimko and J. Yackel, “Optimal search strategies for wienér processes,” Stochastic Processes and their Applications, vol. 3, pp. 19–33, 1975.
  • [11] V. Dragalin, “A simple and effective scanning rule for a multi-channel system,” Metrika, vol. 43, pp. 165–182, 02 1996.
  • [12] K. Cohen and Q. Zhao, “Asymptotically optimal anomaly detection via sequential testing,” IEEE Transactions on Signal Processing, vol. 63, no. 11, pp. 2929–2941, 2015.
  • [13] B. Huang, K. Cohen, and Q. Zhao, “Active anomaly detection in heterogeneous processes,” IEEE Transactions on Information Theory, vol. 65, no. 4, pp. 2284–2301, 2019.
  • [14] N. K. Vaidhiyan and R. Sundaresan, “Learning to detect an oddball target,” IEEE Transactions on Information Theory, vol. 64, no. 2, pp. 831–852, 2018.
  • [15] A. Gurevich, K. Cohen, and Q. Zhao, “Sequential anomaly detection under a nonlinear system cost,” IEEE Transactions on Signal Processing, vol. 67, no. 14, pp. 3689–3703, 2019.
  • [16] A. Tsopelakos, G. Fellouris, and V. V. Veeravalli, “Sequential anomaly detection with observation control,” in 2019 IEEE International Symposium on Information Theory (ISIT), 2019, pp. 2389–2393.
  • [17] B. Hemo, T. Gafni, K. Cohen, and Q. Zhao, “Searching for anomalies over composite hypotheses,” IEEE Transactions on Signal Processing, vol. 68, pp. 1181–1196, 2020.
  • [18] H. Chernoff, “Sequential design of experiments,” Ann. Math. Statist., vol. 30, no. 3, pp. 755–770, 09 1959. [Online]. Available: http://dx.doi.org/10.1214/aoms/1177706205
  • [19] A. E. Albert, “The Sequential Design of Experiments for Infinitely Many States of Nature,” The Annals of Mathematical Statistics, vol. 32, no. 3, pp. 774 – 799, 1961. [Online]. Available: https://doi.org/10.1214/aoms/1177704973
  • [20] S. A. Bessler, “Theory and applications of the sequential design of experiments, k-actions and infinitely many experiments, part i theory.” Department of Statistics, Stanford University, Technical Report 55, 1960.
  • [21] ——, “Theory and applications of the sequential design of experiments, k-actions and infinitely many experiments, part ii applications.” Department of Statistics, Stanford University, Technical Report 56, 1960.
  • [22] J. Kiefer and J. Sacks, “Asymptotically Optimum Sequential Inference and Design,” The Annals of Mathematical Statistics, vol. 34, no. 3, pp. 705 – 750, 1963. [Online]. Available: https://doi.org/10.1214/aoms/1177704000
  • [23] R. Keener, “Second Order Efficiency in the Sequential Design of Experiments,” The Annals of Statistics, vol. 12, no. 2, pp. 510 – 532, 1984. [Online]. Available: https://doi.org/10.1214/aos/1176346503
  • [24] S. P. Lalley and G. Lorden, “A Control Problem Arising in the Sequential Design of Experiments,” The Annals of Probability, vol. 14, no. 1, pp. 136 – 172, 1986. [Online]. Available: https://doi.org/10.1214/aop/1176992620
  • [25] S. Nitinawarat, G. K. Atia, and V. V. Veeravalli, “Controlled sensing for multihypothesis testing,” IEEE Transactions on Automatic Control, vol. 58, no. 10, pp. 2451–2464, 2013.
  • [26] M. Naghshvar and T. Javidi, “Active sequential hypothesis testing,” The Annals of Statistics, vol. 41, no. 6, pp. 2703–2738, 2013.
  • [27] S. Nitinawarat and V. V. Veeravalli, “Controlled sensing for sequential multihypothesis testing with controlled markovian observations and non-uniform control cost,” Sequential Analysis, vol. 34, no. 1, pp. 1–24, 2015. [Online]. Available: https://doi.org/10.1080/07474946.2014.961864
  • [28] A. Deshmukh, V. V. Veeravalli, and S. Bhashyam, “Sequential controlled sensing for composite multihypothesis testing,” Sequential Analysis, vol. 40, no. 2, pp. 259–289, 2021. [Online]. Available: https://doi.org/10.1080/07474946.2021.1912525
  • [29] A. Garivier and E. Kaufmann, “Optimal best arm identification with fixed confidence,” in Conference on Learning Theory.   PMLR, 2016, pp. 998–1027.
  • [30] S. I. Resnick, Adventures in stochastic processes.   Springer Science & Business Media, 1992.
  • [31] G. Hommel and T. Hoffmann, “Controlled uncertainty,” in Multiple Hypothesenprüfung/Multiple Hypotheses Testing.   Springer, 1988, pp. 154–161.
  • [32] E. L. Lehmann and J. P. Romano, “Generalizations of the familywise error rate,” Ann. Statist., vol. 33, no. 3, pp. 1138–1154, 06 2005. [Online]. Available: http://dx.doi.org/10.1214/009053605000000084
  • [33] A. Tsopelakos and G. Fellouris, “Sequential anomaly detection with observation control under a generalized error metric,” in 2020 IEEE International Symposium on Information Theory (ISIT), 2020, pp. 1165–1170.
  • [34] J. Heydari, A. Tajer, and H. V. Poor, “Quickest linear search over correlated sequences,” IEEE Transactions on Information Theory, vol. 62, no. 10, pp. 5786–5808, 2016.
  • [35] M. Raginsky and I. Sason, “Concentration of measure inequalities in information theory, communications, and coding,” Foundations and Trends® in Communications and Information Theory, vol. 10, no. 1-2, pp. 1–246, 2013. [Online]. Available: http://dx.doi.org/10.1561/0100000064
  • [36] A. Wald, “Sequential tests of statistical hypotheses,” The Annals of Mathematical Statistics, vol. 16, no. 2, pp. 117–186, 1945.
  • [37] A. Tartakovsky, I. Nikiforov, and M. Basseville, Sequential analysis: Hypothesis testing and changepoint detection.   CRC press, 2015.
  • [38] Y. Chow and H. Teicher, Probability Theory: Independence, Interchangeability, Martingales, ser. Springer Texts in Statistics.   Springer New York, 2012. [Online]. Available: https://books.google.com/books?id=213dBwAAQBAJ
  • [39] O. Kallenberg, Foundations of Modern Probability, ser. Probability and Its Applications.   Springer New York, 2002. [Online]. Available: https://books.google.com/books?id=TBgFslMy8V4C