This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sparsity-Exploiting Distributed Projections onto a Simplex

Yongzheng Dai Chen Chen
Abstract

Projecting a vector onto a simplex is a well-studied problem that arises in a wide range of optimization problems. Numerous algorithms have been proposed for determining the projection; however, the primary focus of the literature has been on serial algorithms. We present a parallel method that decomposes the input vector and distributes it across multiple processors for local projection. Our method is especially effective when the resulting projection is highly sparse; which is the case, for instance, in large-scale problems with i.i.d. entries. Moreover, the method can be adapted to parallelize a broad range of serial algorithms from the literature. We fill in theoretical gaps in serial algorithm analysis, and develop similar results for our parallel analogues. Numerical experiments conducted on a wide range of large-scale instances, both real-world and simulated, demonstrate the practical effectiveness of the method.

1 Introduction

Given a vector dnd\in\mathbb{R}^{n}, we consider the following projection of dd:

projΔb(d):=argminvΔbvd2,\mbox{proj}_{\Delta_{b}}(d):=\mbox{argmin}_{v\in\Delta_{b}}\|v-d\|_{2}, (1)

where Δb\Delta_{b} is a scaled standard simplex parameterized by some scaling factor b>0b>0,

Δb:={vn|i=1nvi=b,v0}.\Delta_{b}:=\{v\in\mathbb{R}^{n}\ |\ \sum_{i=1}^{n}v_{i}=b,v\geq 0\}.

1.1 Applications

Projection onto a simplex can be leveraged as a subroutine to determine projections onto more complex polyhedra. Such projections arise in numerous settings such as: image processing, e.g. labeling (Lellmann et al. 2009), or multispectral unmixing (Bioucas-Dias et al. 2012); portfolio optimization (Brodie et al. 2009); and machine learning (Blondel et al. 2014). As a particular example, projection onto a simplex can be used to project onto the parity polytope (Wasson et al. 2019):

projn(d):=argminvnvd2,\mbox{proj}_{\mathbb{PP}_{n}}(d):=\mbox{argmin}_{v\in\mathbb{PP}_{n}}\|v-d\|_{2}, (2)

where n\mathbb{PP}_{n} is a nn-dimensional parity polytope:

n:=conv({v{0,1}n|i=1nvi=0(mod2)}).\mathbb{PP}_{n}:=\mbox{conv}(\{v\in\{0,1\}^{n}\ |\ \sum_{i=1}^{n}v_{i}=0\ (\mbox{mod}2)\}).

Projection onto the parity polytope arises in linear programming (LP) decoding (Liu and Draper 2016, Barman et al. 2013, Zhang and Siegel 2013, Zhang et al. 2013, Wei et al. 2015), which is used for signal processing.

Another example is projection onto a 1\ell_{1} ball:

b:={vn|i=1n|vi|b}.\mathcal{B}_{b}:=\{v\in\mathbb{R}^{n}\ |\ \sum_{i=1}^{n}|v_{i}|\leq b\}. (3)

Duchi et al. (2008) demonstrate that the solution to this problem can be easily recovered from projection onto a simplex. Furthermore, projection onto a 1\ell_{1} ball can, in turn, be used as a subroutine in gradient-projection methods (see e.g. van den Berg (2020)) for a variety of machine learning problems that use 1\ell_{1} penalty, such as: Lasso (Tibshirani 1996); basis-pursuit denoising (Chen et al. 1998, van den Berg and Friedlander 2009, van den Berg 2020); sparse representation in dictionaries (Donoho and Elad 2003); variable selection (Tibshirani 1997); and classification (Barlaud et al. 2017).

Finally, we note that methods for projection onto the scaled standard simplex and 1\ell_{1} ball can be extended to projection onto the weighted simplex and weighted 1\ell_{1} ball (Perez et al. 2020a) (see Section B.2). Projection onto the weighted simplex can, in turn, be used to solve the continuous quadratic knapsack problem (Robinson et al. 1992). Moreover, p\ell_{p} regularization can be handled by iteratively solving weighted 1\ell_{1} projections (Candès et al. 2008, Chartrand and Yin 2008, Chen and Zhou 2014).

1.2 Contributions

This paper presents a method to decompose the projection problem and distribute work across (up to nn) processors. The key insight to our approach is captured by Proposition 3.4: the projection of any subvector of dd onto a simplex (in the corresponding space with scale factor bb) will have zero-valued entries only if the full-dimension projection has corresponding zero-valued entries. The method can be interpreted as a sparsity-exploiting distributed preprocessing method, and thus it can be adapted to parallelize a broad range of serial projection algorithms. We furthermore provide extensive theoretical and empirical analyses of several such adaptations. We also fill in gaps in the literature on serial algorithm complexity. Our computational results demonstrate significant speedups from our distributed method compared to the state-of-the-art over a wide range of large-scale problems involving both real-world and simulated data.

Our paper contributes to the limited literature on parallel computation for large-scale projection onto a simplex. Most of the algorithms for projection onto a simplex are for serial computing. Indeed, to our knowledge, there is only one published parallel method, and one distributed method for projection problem (1). Wasson et al. (2019) parallelize a basic sort and scan (specifically prefix sum) approach–a method that we use as a benchmark in our experiments. We also develop a modest but practically significant enhancement to their approach. Iutzeler and Condat (2018) propose a gossip-based distributed ADMM algorithm for projection onto a Simplex. In this gossip-based setup, one entry of dd is given to each agent (e.g. processor), and communication is restricted according to a particular network topology. This differs fundamentally from our approach both in context and intended use, as we aim to solve large-scale problems and moreover our method can accommodate any number of processors up to nn.

The remainder of the paper is organized as follows. Section 2 describes serial algorithms from the literature and develops new complexity results to fill in gaps in the literature. Section 3 develops parallel analogues of the aforementioned algorithms using our novel distributed method. Section 4 extends these parallelized algorithms to various applications of projection onto a simplex. Section 5 describes computational experiments. Section 6 concludes. Note that all appendices, mathematical proofs, as well as our code and data can be found in the online supplement.

2 Background and Serial Algorithms

This section begins with a presentation of some fundamental results regarding projection onto a simplex, followed by analysis of serial algorithms for the problem, filling in various gaps in the literature. The final subsection, Section 2.5, provides a summary. Note that, for the purposes of average case analysis, we assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u])—a typical choice of distribution in the literature (e.g. Condat (2016)).

2.1 Properties of Simplex Projection

KKT conditions characterize the unique optimal solution vv^{*} to problem (1):

Proposition 2.1 (Held et al. (1974)).

For a vector dnd\in\mathbb{R}^{n} and a scaled standard simplex Δbn\Delta_{b}\in\mathbb{R}^{n}, there exists a unique τ\tau\in\mathbb{R} such that

vi=max{diτ,0},i=1,,n,v^{*}_{i}=\max\{d_{i}-\tau,0\},\ \forall i=1,\cdots,n, (4)

where v:=projΔb(d)nv^{*}:=\mathrm{proj}_{\Delta_{b}}(d)\in\mathbb{R}^{n}.

Hence, (1) can be reduced to a univariate search problem for the optimal pivot τ\tau. Note that the nonzero (positive) entries of vv^{*} correspond to entries where di>τd_{i}>\tau. So for a given value tt\in\mathbb{R} and the index set :={1,,n}\mathcal{I}:=\{1,...,n\}, we denote the active index set

t:={it|di>t},\mathcal{I}_{t}:=\{i\in\mathcal{I}_{t}\ |\ d_{i}>t\},

as the set containing all indices of active elements where di>td_{i}>t. Now consider the following function, which will be used to provide an alternate characterization of τ\tau:

f(t):={itdib|t|t,t<maxi{di}b,tmaxi{di}f(t):=\begin{cases}\frac{\sum_{i\in\mathcal{I}_{t}}d_{i}-b}{|\mathcal{I}_{t}|}-t,&t<\max_{i}\{d_{i}\}\\ -b,&t\geq\max_{i}\{d_{i}\}\end{cases} (5)
Corollary 2.2.

For any t1,t2t_{1},t_{2}\in\mathbb{R} such that t1<τ<t2t_{1}<\tau<t_{2}, we have

f(t1)>f(τ)=0>f(t2).f(t_{1})>f(\tau)=0>f(t_{2}).

The sign of ff only changes once, and τ\tau is its unique root. These results can be leveraged to develop search algorithms for τ\tau, which are presented next. This corollary and the use of ff is our own contribution, as we have found it a convenient organizing principle for the sake of exposition We note, however, that the root finding framework has been in use in the more general constrained quadratic programming literature (see (Cominetti et al. 2014, Equation 5) and (Dai et al. 2006, Section 2, Paragraph 2)).

2.2 Sort and Scan

Observe that only the greatest |τ||\mathcal{I}_{\tau}| terms of dd are indexed in τ\mathcal{I}_{\tau}. Now suppose we sort dd in non-increasing order:

dπ1dπ2dπn.d_{\pi_{1}}\geq d_{\pi_{2}}\geq...\geq d_{\pi_{n}}.

We can sequentially test these values in descending order, f(dπ1),f(dπ2),f(d_{\pi_{1}}),f(d_{\pi_{2}}), etc. to determine |τ||\mathcal{I}_{\tau}|. In particular, from Corollary 2.2 we know there exists some κ:=|τ|\kappa:=|\mathcal{I}_{\tau}| such that f(dπκ)<0f(dπκ+1)f(d_{\pi_{\kappa}})<0\leq f(d_{\pi_{\kappa+1}}). Thus the projection must have κ\kappa active elements, and since f(τ)=0f(\tau)=0, we have τ=i=1κdπibκ\tau=\frac{\sum_{i=1}^{\kappa}d_{\pi_{i}}-b}{\kappa}. We also note that, rather than recalculating ff at each iteration, one can keep a running cumulative/prefix sum or scan of i=1jdπi\sum_{i=1}^{j}d_{\pi_{i}} as jj increments.

Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb.
Output: projection vv^{*}.
1 Sort dd as dπ1dπnd_{\pi_{1}}\geq\cdots\geq d_{\pi_{n}} ;
2 Set κ:=max1jn{j:i=1jdπibj<dπj}\kappa:=\mbox{max}_{1\leq j\leq n}\{j:\frac{\sum_{i=1}^{j}d_{\pi_{i}}-b}{j}<d_{\pi_{j}}\} (set κ=πn\kappa=\pi_{n} if maximum does not exist) ;
3 Set τ:=i=1κdπibκ\tau:=\frac{\sum_{i=1}^{\kappa}d_{\pi_{i}}-b}{\kappa};
4 Set τ:={i|di>τ}\mathcal{I}_{\tau}:=\{i\ |\ d_{i}>\tau\}, Vτ:={diτ|iτ}V_{\tau}:=\{d_{i}-\tau\ |\ i\in\mathcal{I}_{\tau}\};
return SparseVector(τ,Vτ)SparseVector(\mathcal{I}_{\tau},V_{\tau}).
Algorithm 1 Sort and Scan (Held et al. (1974))

The bottleneck is sorting, as all other operations are linear time; for instance, QuickSort executes the sort with average complexity O(nlogn)O(n\log n) and worst-case O(n2)O(n^{2}), while MergeSort has worst-case O(nlogn)O(n\log n) (see, e.g. (Bentley and McIlroy 1993)). Moreover, non-comparison sorting methods can achieve O(n)O(n) (see, e.g. (Mahmoud 2000)), albeit typically with a high constant factor as well as dependence on the bit-size of dd.

2.3 Pivot and Partition

Sort and Scan begins by sorting all elements of dd, but only the greatest |τ||\mathcal{I}_{\tau}| terms are actually needed to calculate τ\tau. Pivot and Partition, proposed by Duchi et al. (2008), can be interpreted as a hybrid sort-and-scan that attempts to avoid sorting all elements. We present as Algorithm 2 a variant of this method approach given by Condat (2016).

Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb.
Output: projection vv^{*}.
1 Set :={1,,n}\mathcal{I}:=\{1,...,n\}, τ:=\mathcal{I}_{\tau}:=\emptyset, p:=\mathcal{I}_{p}:=\emptyset;
2 while \mathcal{I}\neq\emptyset do
3       Select a pivot p[mini{di},maxi{di}]p\in[\min_{i\in\mathcal{I}}\{d_{i}\},\max_{i\in\mathcal{I}}\{d_{i}\}];
4       Set p:={i|di>p,i}\mathcal{I}_{p}:=\{i\ |\ d_{i}>p,i\in\mathcal{I}\};
5       if (ipτdib)/(|p|+|τ|)>p(\sum_{i\in{\mathcal{I}_{p}\cup\mathcal{I}_{\tau}}}d_{i}-b)/(|\mathcal{I}_{p}|+|\mathcal{I}_{\tau}|)>p then
6             Set :=p\mathcal{I}:=\mathcal{I}_{p};
7      else
8             Set τ:=τp\mathcal{I}_{\tau}:=\mathcal{I}_{\tau}\cup\mathcal{I}_{p}, :=p\mathcal{I}:=\mathcal{I}\setminus\mathcal{I}_{p};
9            
10       end if
11      
12 end while
13Set τ:=iτdib|τ|\tau:=\frac{\sum_{i\in\mathcal{I}_{\tau}}d_{i}-b}{|\mathcal{I}_{\tau}|};
14 Set Vτ:={diτ|iτ}V_{\tau}:=\{d_{i}-\tau\ |\ i\in\mathcal{I}_{\tau}\};
return SparseVector(τ,Vτ)SparseVector(\mathcal{I}_{\tau},V_{\tau}).
Algorithm 2 Pivot and Partition

The algorithm selects a pivot p[mini{di},maxi{di}]p\in[\min_{i}\{d_{i}\},\max_{i}\{d_{i}\}], which is intended as a candidate value for τ\tau; the corresponding value f(p)f(p) is calculated. From Corollary 2.2, if f(p)>0f(p)>0, then p<τp<\tau and so pτ\mathcal{I}_{p}\supset\mathcal{I}_{\tau}; consequently, a new pivot is chosen in the (tighter) interval [minip{di},maxip{di}][\min_{i\in\mathcal{I}_{p}}\{d_{i}\},\max_{i\in\mathcal{I}_{p}}\{d_{i}\}], which is known to contain τ\tau. Otherwise, if f(p)0f(p)\leq 0, then pτp\geq\tau, and so we can find a new pivot p[mini¯p{di},maxi¯p{di}]p\in[\min_{i\in\bar{\mathcal{I}}_{p}}\{d_{i}\},\max_{i\in\bar{\mathcal{I}}_{p}}\{d_{i}\}], where ¯p:={1,,n}p\bar{\mathcal{I}}_{p}:=\{1,...,n\}\setminus\mathcal{I}_{p} is the complement set. Repeatedly selecting new pivots and creating partitions in this manner results in a binary search to determine the correct active set τ\mathcal{I}_{\tau}, and consequently τ\tau.

Several strategies have been proposed for selecting a pivot within a given interval. Duchi et al. (2008) choose a random value in the interval, while Kiwiel (2008) uses the median value. The classical approach of Michelot (1986) can be interpreted as a special case that sets the initial pivot as p(1)=(idib)/||p^{(1)}=(\sum_{i\in\mathcal{I}}d_{i}-b)/|\mathcal{I}|, and subsequently p(i+1)=f(p(i))+p(i)p^{(i+1)}=f(p^{(i)})+p^{(i)}. This ensures that p(i)τp^{(i)}\leq\tau which avoids extraneous re-evaluation of sums in the if condition. Note that p(i)p^{(i)} generates an increasing sequence converging to τ\tau (Condat 2016, Page 579, Paragraph 2). Michelot’s algorithm is presented separately as Algorithm 3.

Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb.
Output: projection vv^{*}.
1 Set p:={1,,n}\mathcal{I}_{p}:=\{1,...,n\}, :=\mathcal{I}:=\emptyset;
2 do
3       Set :=p\mathcal{I}:=\mathcal{I}_{p};
4       Set p:=(idib)/||p:=(\sum_{i\in\mathcal{I}}d_{i}-b)/|\mathcal{I}|;
5       Set p:={i|di>p}\mathcal{I}_{p}:=\{i\in\mathcal{I}\ |\ d_{i}>p\};
6      
7while ||>|p||\mathcal{I}|>|\mathcal{I}_{p}|;
8Set Vτ:={diτ|iτ}V_{\tau}:=\{d_{i}-\tau\ |\ i\in\mathcal{I}_{\tau}\};
return SparseVector(τ,Vτ)SparseVector(\mathcal{I}_{\tau},V_{\tau}).
Algorithm 3 Michelot’s method

Condat (2016) provided worst-case runtimes for each of the aforementioned pivot rules, as well as average case complexity (over the uniform distribution) for the random pivot rule (see Table 2). We fill in the gaps here and establish O(n)O(n) runtimes for the median rule as well as Michelot’s method. We note that the median pivot method is a linear-time algorithm, but relies on a median-of-medians subroutine (Blum et al. 1973), which has a high constant factor. For Michelot’s method, we assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u], and we have

Proposition 2.3.

Michelot’s method has an average runtime of O(n)O(n).

The same argument holds for the median pivot rule, as half the elements are guaranteed to be removed each iteration, and the operations per iteration are within a constant factor of Michelot’s; we omit a formal proof of its O(n)O(n) average runtime for brevity.

2.4 Condat’s Method

Condat’s method (Condat 2016), presented as Algorithm 5, can be seen as a modification of Michelot’s method in two ways. First, Condat replaces the initial scan with a Filter to find an initial pivot, presented as Algorithm 4. Lemma 2 (see Appendix. A) establishes that the Filter provides a greater (or equal) initial starting pivot compared to Michelot’s initialization; furthermore, since Michelot approaches τ\tau from below, this results in fewer iterations (see proof of Proposition 2.4). Second, Condat’s method dynamically updates the pivot value whenever an inactive entry is removed from p\mathcal{I}_{p}, whereas Michelot’s method updates the pivot every iteration by summing over all entries.

Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb.
Output: t\mathcal{I}_{t}.
1 Set p:={1}\mathcal{I}_{p}:=\{1\}, w:=\mathcal{I}_{w}:=\emptyset, p:=d1bp:=d_{1}-b;
2 for i=2:ni=2:n do
3       if di>pd_{i}>p then
4             Set p:=p+dip|p|+1p:=p+\frac{d_{i}-p}{|\mathcal{I}_{p}|+1};
5             if p>dibp>d_{i}-b then
6                   Set p:=p{i}\mathcal{I}_{p}:=\mathcal{I}_{p}\cup\{i\};
7                  
8            else
9                   Set w:=wp\mathcal{I}_{w}:=\mathcal{I}_{w}\cup\mathcal{I}_{p}, p:={i}\mathcal{I}_{p}:=\{i\}, p:=dibp:=d_{i}-b;
10                  
11             end if
12            
13       end if
14      
15 end for
16
17for iwi\in\mathcal{I}_{w} do
18       if di>pd_{i}>p then
19             Set p:=p{i}\mathcal{I}_{p}:=\mathcal{I}_{p}\cup\{i\}, p:=p+dip|p|p:=p+\frac{d_{i}-p}{|\mathcal{I}_{p}|};
20            
21       end if
22      
23 end for
return p\mathcal{I}_{p}.
Algorithm 4 Filter
Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb.
Output: projection vv^{*}.
1 Set p:=Filter(d,b)\mathcal{I}_{p}:=\texttt{Filter}(d,b), p:=ipdib|p|p:=\frac{\sum_{i\in\mathcal{I}_{p}}d_{i}-b}{|\mathcal{I}_{p}|}, :=\mathcal{I}:=\emptyset;
2 do
3       Set :=p\mathcal{I}:=\mathcal{I}_{p};
4       for i:dipi\in\mathcal{I}:d_{i}\leq p do
5             Set p:=p\{i}\mathcal{I}_{p}:=\mathcal{I}_{p}\backslash\{i\};
6             Set p:=p+pdi|p|p:=p+\frac{p-d_{i}}{|\mathcal{I}_{p}|};
7            
8       end for
9      
10while ||>|p||\mathcal{I}|>|\mathcal{I}_{p}|;
11Set Vτ:={diτ|iτ}V_{\tau}:=\{d_{i}-\tau\ |\ i\in\mathcal{I}_{\tau}\};
return SparseVector(τ,Vτ)SparseVector(\mathcal{I}_{\tau},V_{\tau}).
Algorithm 5 Condat’s method

Condat (2016) supplies a worst-case complexity of O(n2)O(n^{2}). We supplement this with average-case analysis under uniformly distributed inputs, e.g.d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u].

Proposition 2.4.

Condat’s method has an average runtime of O(n)O(n).

2.5 Summary of Results

Table 1: Time complexity of serial algorithms for projection onto a simplex (new results bolded).
Pivot Rule Worst Case Average Case
(Quick)Sort and Scan O(n2)O(n^{2}) O(n)O(n)
Michelot’s method O(n2)O(n^{2}) 𝐎(𝐧)\mathbf{O(n)}
Pivot and Partition (Median) O(n)O(n) 𝐎(𝐧)\mathbf{O(n)}
Pivot and Partition (Random) O(n2)O(n^{2}) O(n)O(n)
Condat’s method O(n2)O(n^{2}) 𝐎(𝐧)\mathbf{O(n)}
Bucket method O(cn)O(cn) 𝐎(𝐜𝐧)\mathbf{O(cn)}

Table 1 shows that all presented algorithms attain O(n)O(n) performance on average given uniformly i.i.d. inputs. The methods are ordered by publication date, starting from the oldest result. As described in Section 2.2, Sort and Scan can be implemented with non-comparison sorting to achieve O(n)O(n) worst-case performance. However, as with the linear-time median pivot rule, there are tradeoffs: increased memory, overhead, dependence on factors such as input bit-size, etc.

Both sorting and scanning are (separately) well-studied in parallel algorithm design, so the Sort and Scan idea lends itself to a natural decomposition for parallelism (discussed in Section 3.1). The other methods integrate sorting and scanning in each iterate, and it is no longer clear how best to exploit parallelism directly. We develop in the next section a distributed preprocessing scheme that works around this issue in the case of sparse projections. Note that the table includes the Bucket Method; details on the algorithm are provided in Appendix B.1.

3 Parallel Algorithms

In Section 3.1 we consider the parallel method proposed by Wasson et al. (2019) and propose a modification. In Section 3.2 we develop a novel distributed scheme that can be used to preprocess and reduce the input vector dd. The remainder of this section analyzes how our method can be used to enhance Pivot and Partition, as well as Condat’s method via parallelization of the Filter method. Results are summarized in Section 3.5. We note that the parallel time complexities presented are all unaffected by the underlying PRAM model (e.g. EREW vs CRCW, see (Xavier and Iyengar 1998, Chapter 1.4) for further exposition). This is well-known for parallel mergesort and parallel scan; moreover, our distributed scheme (see Section 3.2) distributes work for the projection such that memory read/writes of each core are exclusive to that core’s partition of dd.

3.1 Parallel Sort and Parallel Scan

Wasson et al. (2019) parallelize Sort and Scan in a natural way: first applying a parallel merge sort (see e.g. (Cormen et al. 2009, p. 797)) and then a parallel scan (Ladner and Fischer 1980) on the input vector. However, their scan calculates i=1jdπi\sum_{i=1}^{j}d_{\pi_{i}} for all jj\in\mathcal{I}, but only i=1κdπκ\sum_{i=1}^{\kappa}d_{\pi_{\kappa}} is needed to calculate τ\tau. We modify the algorithm accordingly, presented as Algorithm 6: checks are added (lines 7 and 14) in the for-loops to allow for possible early termination of scans. As we are adding constant operations per loop, Algorithm 6 has the same complexity as the original Parallel Sort and Scan. We combine this with parallel mergesort in Algorithm 7 and empirically benchmark this method with the original (parallel) version in Section 5.

Input: sorted vector dπ1,,dπnd_{\pi_{1}},\cdots,d_{\pi_{n}}, scaling factor bb
Output: τ\tau
1 Set T:=log2nT:=\lceil\log_{2}n\rceil, s[1],,s[n]=dπ1,,dπns[1],...,s[n]=d_{\pi_{1}},...,d_{\pi_{n}};
2 for j=1:Tj=1:T do
3       for i=2j:2j:min(n,2T)i=2^{j}:2^{j}:\min(n,2^{T}) do Parallel
4             Set s[i]:=s[i]+s[i2j1]s[i]:=s[i]+s[i-2^{j-1}];
5            
6       end for
7      Set κ:=min(n,2j)\kappa:=\min(n,2^{j});
8       if s[κ]aκdπκ\frac{s[\kappa]-a}{\kappa}\geq d_{\pi_{\kappa}} then
9             break loop;
10            
11       end if
12      
13 end for
14Set p:=2j1p:=2^{j-1};
15 for i=j1:1:1i=j-1:-1:1 do
16       Set κ:=min(p+2i1,n)\kappa:=\min(p+2^{i-1},n), s[κ]:=s[κ]+s[p]s[\kappa]:=s[\kappa]+s[p];
17       if s[κ]aκ<dπκ\frac{s[\kappa]-a}{\kappa}<d_{\pi_{\kappa}} then
18             break loop;
19            
20       end if
21      
22 end for
23Set τ:=s[κ]bκ\tau:=\frac{s[\kappa]-b}{\kappa};
return τ\tau.
Algorithm 6 Parallel Partial Scan
Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb.
Output: projection vv^{*}.
1 Parallel mergesort dd so that dπ1dπnd_{\pi_{1}}\geq\cdots\geq d_{\pi_{n}};
2 Set τ=PPScan({dπi}1in,b)\tau=\texttt{PPScan}(\{d_{\pi_{i}}\}_{1\leq i\leq n},b);
3 Set τ:={i|di>τ}\mathcal{I}_{\tau}:=\{i\ |\ d_{i}>\tau\}, Vτ:={diτ|iτ}V_{\tau}:=\{d_{i}-\tau\ |\ i\in\mathcal{I}_{\tau}\};
return SparseVector(τ,Vτ)SparseVector(\mathcal{I}_{\tau},V_{\tau}).
Algorithm 7 Parallel Mergesort and Partial Scan

3.2 Sparsity-Exploiting Distributed Projections

Our main idea is motivated by the following two theorems that establish that projections with i.i.d. inputs and fixed right-hand-side bb become increasingly sparse as the problem size nn increases.

Theorem 3.1.

E[|τ|]<2b(n+1)ul+14+12E[|\mathcal{I}_{\tau}|]<\sqrt{\frac{2b(n+1)}{u-l}+\frac{1}{4}}+\frac{1}{2}.

Theorem 3.1 establishes that, for i.i.d. uniformly distributed inputs, the projection has O(n)O(\sqrt{n}) active entries in expectation and thus has considerable sparsity as nn grows; we also show this in the computational experiments of Appendix E.1.

Theorem 3.2.

Suppose d1,,dnd_{1},...,d_{n} are i.i.d.\mathrm{i.i.d.} from an arbitrary distribution XX, with PDF fXf_{X} and CDF FXF_{X}. Then, for any ϵ>0\epsilon>0, P(|τ|nϵ)=1P(\frac{|\mathcal{I}_{\tau}|}{n}\leq\epsilon)=1 as nn\rightarrow\infty.

Theorem 3.2 establishes arbitrarily sparse projections over arbitrary i.i.d. distributions given fixed bb. Note that if bb is sufficiently large with respect to nn (rather than fixed), then the resulting projection could be too dense to attain the theorem result. However, sparsity can be assured provided bb does not grow too quickly with respect to nn, namely:

Corollary 3.3.

Theorem 3.2 holds true if bo(n)b\in o(n).

We apply Theorem 3.2 to example distributions in Appendix C, and test the bounds empirically in Appendix E.1.

Proposition 3.4.

Let d^\hat{d} be a subvector of dd with mnm\leq n entries; moreover, without loss of generality suppose the subvector contains the first mm entries. Let v^\hat{v}^{*} be the projection of d^\hat{d} onto the simplex Δ^:={vm|i=1mvi=b,v0}\hat{\Delta}:=\{v\in\mathbb{R}^{m}\ |\ \sum_{i=1}^{m}v_{i}=b,v\geq 0\}, and τ^\hat{\tau} be the corresponding pivot value. Then, ττ^\tau\geq\hat{\tau}. Consequently, for 1im1\leq i\leq m we have that v^i=0vi=0\hat{v}_{i}^{*}=0\implies v_{i}^{*}=0.

Proposition 3.4 tells us that if we project a subvector of some length mnm\leq n onto the same bb-scaled simplex in the corresponding m\mathbb{R}^{m} space, the zero entries in the projected subvector must also be zero entries in the projected full vector.

Our idea is to partition and distribute the vector dd across cores (broadcast); have each core find the projection of its subvector (local projection); and combine the nonzero entries from all local projections to form a vector v^\hat{v} (reduce), and apply a final (global) projection to v^\hat{v}. The method is outlined in Figure 1. Provided the projection vv^{*} is sufficiently sparse, which (for instance) we have established is the case for i.i.d. distributed large-scale problems, we can expect v^\hat{v} to have been far less than nn entries. We demonstrate the practical advantages of this procedure with various computational experiments in Section 5.

Refer to caption
Figure 1: Distributed Projection Algorithm

3.3 Parallel Pivot and Partition

The distributed method outlined in Figure 1 can be applied directly to Pivot and Partition, as described in Algorithm 8. Note that, as presented, vv^{*} is a sparse vector: entries not processed in the final Pivot and Project iteration are set to zero (recall Proposition 3.4).

Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb, cores number kk.
Output: projection vv^{*}.
1 Partition dd into subvectors d1,,dkd^{1},...,d^{k} of dimension nk\leq\frac{n}{k} ;
2 Set i\mathcal{I}_{i} be the active set from Pivot_Project(di,b)\texttt{Pivot\_Project}(d^{i},b)   (distributed across cores i=1,,ki=1,...,k);
3 Set ^:=i=1ki\hat{\mathcal{I}}:=\cup_{i=1}^{k}\mathcal{I}_{i};
return Sparse vector from the projection of {vi}i^\{v_{i}\}_{i\in\hat{\mathcal{I}}}.
Algorithm 8 Parallel Pivot and Partition

We assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u], and we have

Proposition 3.5.

Parallel Pivot and Partition with either the median, random, or Michelot’s pivot rule, has an average runtime of O(nk+kn)O(\frac{n}{k}+\sqrt{kn}).

In the worst case we may assume the distributed projections are ineffective, dim(v^)O(n)\mbox{dim}(\hat{v})\in O(n), and so the final projection is bounded above by O(n2)O(n^{2}) with random pivots and Michelot’s method, and O(n)O(n) with the median pivot rule.

3.4 Parallel Condat’s Method

We could apply the distributed sparsity idea as a preprocessing step for Condat’s method. However, due to Proposition 3.6 (and confirmed via computational experiments) we have found that Filter tends to discard many non-active elements. Therefore, we propose instead to apply our distributed method to parallelize the Filter itself. Our Distributed Filter is presented in Algorithm 9: we partition dd and broadcast it to the cores, and in each core we apply (serial) Filter on its subvector. Condat’s method with the distributed Filter is presented as Algorithm 10.

Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb, kk cores.
Output: Index set \mathcal{I} of Stage 11.
1 Partition \mathcal{I} into index sets {1,,k}\{\mathcal{I}_{1},\cdots,\mathcal{I}_{k}\} such that ink,i=1,,,.k\mathcal{I}_{i}\leq\frac{n}{k},i=1,,,.k;
2 for i=1:ki=1:k do parallel
3       Update i\mathcal{I}_{i} with (serial) Filter(di,bd_{\mathcal{I}_{i}},b);
4       Set pi:=jidjb|i|p^{i}:=\frac{\sum_{j\in\mathcal{I}_{i}}d_{j}-b}{|\mathcal{I}_{i}|};
5       for jij\in\mathcal{I}_{i} do
6             if djpid_{j}\leq p^{i} then
7                   Set pi:=pi+pidj|i|1p^{i}:=p^{i}+\frac{p^{i}-d_{j}}{|\mathcal{I}_{i}|-1}, i:=i\{j}\mathcal{I}_{i}:=\mathcal{I}_{i}\backslash\{j\}
8             end if
9            
10       end for
11      
12 end for
return :=i=1ki\mathcal{I}:=\cup_{i=1}^{k}\mathcal{I}_{i}.
Algorithm 9 Distributed Filter (Dfilter)
Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb, kk cores.
Output: projection vv^{*}.
1 Set p:=Dfilter(d,b,k)\mathcal{I}_{p}:=\texttt{Dfilter}(d,b,k), :=\mathcal{I}:=\emptyset, p:=ipdib|p|p:=\frac{\sum_{i\in\mathcal{I}_{p}}d_{i}-b}{|\mathcal{I}_{p}|};
2 do
3       Set :=p\mathcal{I}:=\mathcal{I}_{p};
4       for i:dipi\in\mathcal{I}:d_{i}\leq p do
5             Set p:=p\{i}\mathcal{I}_{p}:=\mathcal{I}_{p}\backslash\{i\}, p:=p+pdi|p|p:=p+\frac{p-d_{i}}{|\mathcal{I}_{p}|};
6            
7       end for
8      
9while ||>|p||\mathcal{I}|>|\mathcal{I}_{p}|;
10Set Vτ:={diτ|iτ}V_{\tau}:=\{d_{i}-\tau\ |\ i\in\mathcal{I}_{\tau}\};
return SparseVector(τ,Vτ)SparseVector(\mathcal{I}_{\tau},V_{\tau}).
Algorithm 10 Parallel Condat’s method

We assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u], and we have

Proposition 3.6.

Let p\mathcal{I}_{p} be the output of Filter(d,b)\mathrm{Filter}(d,b). Then E[|p|]O(n23)E[|\mathcal{I}_{p}|]\in O(n^{\frac{2}{3}}).

Under the same assumption (uniformly distributed inputs) to Proposition 3.6, we have

Proposition 3.7.

Parallel Condat’s method has an average complexity O(nk+kn23)O(\frac{n}{k}+\sqrt[3]{kn^{2}}).

In the worst-case we can assume Distributed Filter is ineffective (|p|O(n)|\mathcal{I}_{p}|\in O(n)), and so the complexity of Parallel Condat’s method is O(n2)O(n^{2}), same as the serial method.

3.5 Summary of Results

Table 2: Time complexity of serial vs parallel algorithms with problem dimension nn and kk cores
Method Worst case complexity Average complexity
Quicksort + Scan O(n2)O(n^{2}) O(nlogn)O(n\log n)
(P)Mergesort + Scan O(nklogn)O(\frac{n}{k}\log n) O(nklogn)O(\frac{n}{k}\log n)
(P)Mergesort + Partial Scan O(nklogn)O(\frac{n}{k}\log n) O(nklogn)O(\frac{n}{k}\log n)
Michelot O(n2)O(n^{2}) O(n)O(n)
(P)Michelot O(n2)O(n^{2}) O(nk+kn)O(\frac{n}{k}+\sqrt{kn})
Condat O(n2)O(n^{2}) O(n)O(n)
(P)Condat O(n2)O(n^{2}) O(nk+kn23)O(\frac{n}{k}+\sqrt[3]{kn^{2}})

Complexity results for parallel algorithms developed throughout this section, as well as their serial counterparts, are presented in Table 2. Parallelized Sort and Scan has a dependence on 1k\frac{1}{k}; indeed, sorting (and scanning) are very well-studied problems from the perspective of parallel computing. Parallel (Bitonic) Merge Sort has an average-case (and worst case) complexity in O((nlogn)/k)O((n\log n)/k) (Greb and Zachmann 2006), and Parallel Scan has an average-case (and worst case) complexity in Θ(n/k+logk)\Theta(n/k+\log k) (Wang et al. 2020); thus, running parallel sort followed by parallel scan is O(nklogn)O(\frac{n}{k}\log n). Now, Michelot’s and Condat’s serial methods are observed to have favorable practical performance in our computational experiments; this is expected as these more modern approaches were explicitly developed to gain practical advantages in e.g. the constant runtime factor. Conversely, our distributed method does not improve upon the worst-case complexity of Michelot’s and Condat’s methods, but is able to attain a 1k\frac{1}{k} factor for average complexity for nkn\gg k, which is the case on large-scale instances, i.e. for all practical purposes. The average case analyses were conducted under the admittedly limited (typical) assumption of uniform i.i.d. entries, but our computational experiments over other distributions and real-world data confirm favorable practical speedups from our parallel algorithms.

4 Parallelization for Extensions of Projection onto a Simplex

This section develops extensions involving projection onto a simplex, to be used for experiments in Section 5.

4.1 Projection onto the 1\ell_{1} Ball

Consider projection onto an 1\ell_{1} ball:

Projb(d):=argminvbvd2,\mbox{Proj}_{\mathcal{B}_{b}}(d):=\arg\min_{v\in\mathcal{B}_{b}}\|v-d\|_{2}, (6)

where b\mathcal{B}_{b} is given by Equation (3). Duchi et al. (2008) show Problem (6) is linear-time reducible to projection onto simplex (see (Duchi et al. 2008, Section 4)). Hence, any parallel method for projection onto a simplex can be applied to Problem (6).

As mentioned in Section 1, Problem (6) can itself be used as a subroutine in solving the Lasso problem, via (e.g.) Projected Gradient Descent (PGD) (see e.g. (Boyd and Vandenberghe 2004, Exercise 10.2)). To handle large-scale datasets, we instead use the mini-batch gradient descent method (Zhang et al. 2023, Sec. 12.5) in PGD in Sec 5.

4.2 Centered Parity Polytope Projection

Leveraging the solution to Problem (1), Wasson et al. (Wasson et al. 2019, Algorithm 2) develop a method to project a vector onto the centered parity polytope, n12\mathbb{PP}_{n}-\frac{1}{2} (recall Problem 2); we present a slightly modified version as Algorithm 2 in Appendix D. The modification is on line 1111, where we determine whether a simplex projection is required to avoid unnecessary operations; the original method executes line 1414 before line 1111.

5 Numerical Experiments

All algorithms were implemented in Julia 1.5.3 and run on a single node from the Ohio Supercomputer Center (Center 1987). This node includes 4 sockets, with each socket containing a 20 Intel Xeon Gold 6148 CPUs; thus there are 80 cores at this node. The node has 3 TB of memory and runs 64-bit Red Hat Enterprise Linux 3.10.0. The code and data are available at All code and data can be found at: Github111https://github.com/foreverdyz/Parallel_Projection or the IJOC repository (Dai and Chen 2023)

5.1 Testing Algorithms

In this subsection, we compare runtime results for serial methods and their parallel versions with two measures: absolute speedup and relative speedup. Absolute speedup is the parallel method’s runtime vs the fastest serial method’s runtime, e.g. serial Condat time divided by parallel Sort and Scan time; relative speedup is the parallel method’s runtime vs its serial equivalent’s runtime, e.g. serial Sort and Scan time divided by parallel Sort and Scan time. We test parallel implementations using 8,16,24,,808,16,24,...,80 cores. Note that we have verified that each parallel algorithm run on a single core is slower than its serial equivalent (see Appendix E.4).

5.1.1 Projection onto Simplex

Instances in Figures 2 and 3 are generated with a size of n=108n=10^{8} and scaling factor b=1b=1. Inputs did_{i} are drawn i.i.d. from three benchmark distributions: U[0,1]U[0,1], N(0,1)N(0,1), and N(0,103)N(0,10^{-3}). This is a common benchmark used in previous works e.g. (Duchi et al. 2008, Condat 2016). Serial Condat is the benchmark serial algorithm (i.e. with the fastest performance), with the dotted line representing a speedup of 1. Our parallel Condat algorithm achieves up to 25x absolute speedup over this state-of-the-art serial method. Parallel Sort and Scan ran slower than the serial Condat’s method, due to the dramatic relative slowdown in the serial version. In terms of relative speedup, our method offers superior performance compared to the Sort and Scan approach. Although not visible on the absolute speedup graph, we note that in relative speedup it can be seen that our partial Scan technique offers some modest improvements over the standard parallel Sort and Scan.

Instances in Figures 4 and  5 have varying input sizes of n=107,108,109n=10^{7},10^{8},10^{9}, with did_{i} are drawn i.i.d. from N(0,1)N(0,1). This demonstrates that the speedup per cores is a function of problem size. At 10710^{7}, parallel Condat tails off in absolute speedup around 40 cores (where communication costs become marginally higher than overall gains), while consistently increasing speedups are observed up to 80 cores on the 10910^{9}-sized instances. Similar patterns are observed for all algorithms in the relative speedups. For a fixed number of cores, larger instances yield larger partition sizes; hence the subvector projection problem given to each core tends to reduce more of the original vector, producing the observed effect for our parallel methods. More severe tailoff effects are observed in the Scan and Sort algorithms, which use an entirely different parallelization scheme.

For additional experiments varying bb, please see Appendix E.2.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Absolute speedup vs cores in simplex projection. Each line represents a different algorithm, and each graph represents different input distributions: N(0,1)N(0,1) (left top), U[0,1]U[0,1] (right top), and N(0,103)N(0,10^{-3}) (below).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Relative speedup vs cores in simplex projection. Each line represents a different input distribution, and 4 graphs represent 4 different projection methods, Parallel Sort and Scan (left top), Parallel Sort and Partial Scan (right top), Parallel Michelot (left below), and Parallel Condat (right below).
Refer to caption
Refer to caption
Refer to caption
Figure 4: Absolute speedup vs cores in simplex projection. Each line represents a different projection method, and 3 graphs represent 3 different sizes of input vector: 10710^{7} (left top), 10810^{8} (right top), and 10910^{9} (below).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Relative speedup vs cores in simplex projection. Each line represents a different size of input vector, and 4 graphs represent 4 different projection methods: Parallel Sort and Scan (left top), Parallel Sort and Partial Scan (right top), Parallel Michelot (left below), and Parallel Condat (right below).

5.1.2 1\ell_{1} ball

We conduct 1\ell_{1} ball projection experiments with problem size n=108n=10^{8}. Inputs did_{i} are drawn i.i.d. from N(0,1)N(0,1). Algorithms were implemented as described in Sec 4.1. Results are shown in Figure 6. Similar to the standard projection onto Simplex problems, our parallel Condat implementation attains considerably superior results over the benchmark, with nearly 50X speedup.

Refer to caption
Refer to caption
Figure 6: Speedup vs cores in 1\ell_{1} ball projection. Each line represents a different projection method.

5.1.3 Weighted Simplex and Weighted 1\ell_{1} ball

We have conducted additional experiments using the weighted versions of simplex and 1\ell_{1} ball projections. These have been placed in the online supplement, as results are similar. A description of the algorithms are given in Appendix B.2; pseudocode in Appendix D; and experimental results in Appendix E.3.

5.1.4 Parity Polytope

We conduct parity polytope projection experiments with a problem size setting of n=1081n=10^{8}-1. Inputs did_{i} are drawn i.i.d. from U[1,2]U[1,2]. Algorithms were implemented as described in Sec 4.2. Results are shown in Figure 7. Our parallel Condat has worse relative speedup on projections compared to parallel Michelot, but overall this still results in higher absolute speedups since the baseline serial Condat runs quickly. With up to around 20x absolute speedup in simplex projection subroutines from parallel Condat’s, we report an overall absolute speedup of up to around 2.75x for parity polytope projections. We note that the overall absolute speedup tails off more quickly; this is an expected effect from Amdahl’s law Amdahl (1967): diminishing returns due to partial parallelization of the algorithm.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Speedup vs cores in projection onto a parity polytope. Each line represents a different projection method, and 4 graphs represent 4 different experiments: absolute speedup and relative speedup in parity polytope projection (top line), absolute and relative speedup only for the simplex projection component of parity polytope projection (below line).

5.1.5 Lasso on Real-World Data

We selected a dataset from a paper implementing a Lasso method (Wang et al. 2022): kdd2010 (named as kdda in the cited paper), and its updated version kdd2012; both of them can be found in LIBSVM data sets (Chang and Lin 2011). There are n=20,216,830n=20,216,830 features in kdd2010 and n=54,686,452n=54,686,452 features in kdd2012.

We implemented PGD with Mini-batch (Zhang et al. 2023, Sec. 12.5), and measure the runtime for subroutine of projecting onto 1\ell_{1} ball, which is

xt+1=proj1(xtα2A~T(A~xtβ)),x_{t+1}=\mbox{proj}_{\mathcal{B}_{1}}(x_{t}-\alpha\cdot 2\tilde{A}^{T}(\tilde{A}x_{t}-\beta)),

where the initial point x0x_{0} is drawn i.i.d. from sparse U[0,1]U[0,1] with 0.50.5 sparse rate, A~m×n\tilde{A}\in\mathbb{R}^{m\times n} with sample size mm and nn features, and β\beta includes labels for samples; moreover, α=0.05\alpha=0.05. Moreover, we set m=128m=128 samples in each iteration. We measure the runtime for 1\ell_{1} ball projections for the first 1010 iterations. In later iterations the projected vector xx will be dense, at which point it is better to use serial projection methods (see Appendix E.5). Runtimes are measured by time_ns() Function), and the absolute speedup and relative speedup are reported in Figure 8 (for kdd2010) and Figure 9 (for kdd2012). Considerably more speedup is obtained for kdd2012, which may be due to problem size since the projection input vectors are more than twice the size of inputs from kdd2010.

Refer to caption
Refer to caption
Figure 8: Speedup vs cores of 1\ell_{1} ball Projection in Lasso on kdd2010. Each line represents a different projection method.
Refer to caption
Refer to caption
Figure 9: Speedup vs cores of 1\ell_{1} ball Projection in Lasso on kdd2012. Each line represents a different projection method.

5.1.6 Discussion

We observed consistent patterns of performance across a wide range of test instances, varying in size, distributions, and underlying projections. In terms of relative speedups, our parallelization scheme was surprisingly at least as effective as that of Sort and Scan. Parallel sort and parallel scan are well-studied problems in parallel computing, so we expected a priori for such speedups to be a strict upper bound on what our method could achieve. Thus our method is not simply benefiting from its compatibility with more advanced serial projection algorithms—the distributed method itself appears to be highly effective in practice.

6 Conclusion

We proposed a distributed preprocessing method for projection onto a simplex. Our method distributes subvector projection problems across processors in order to reduce the candidate index set on large-scale instances. One advantage of our method is that it is compatible with all major serial projection algorithms. Empirical results demonstrate, across a wide range of simulated distributions and real-world instances, that our parallelization of well-known serial algorithms are comparable and at times superior in relative speedups to highly developed and well-studied parallelization schemes for sorting and scanning. Moreover, the sort-and-scan serial approach involves substantially more work than e.g. Condat’s method; hence our parallelization scheme provides considerable absolute speedups in our experiments versus the state of the art.

The effectiveness depends on the sparsity of the projection; in the case of large-scale problems with i.i.d. inputs and bo(n)b\in o(n), we can expect high levels of sparsity. A wide range of large-scale computational experiments demonstrates the consistent benefits of our method, which can be combined with any serial projection algorithms. Our experiments on real-world data suggest that significant sparsity can be exploited even when such distributional assumptions could be violated. We also note that, due to Proposition 2.1 and Corollary 2.2, highly dense projections occur when τ\tau has a low value relative to the entries of dd. Now, iterative (serial) methods as Michelot’s and Condat’s can be interpreted as starting with a pivot value that is a lower bound on τ\tau and increasing the pivot iteratively until the true value τ\tau is attained. Hence, when our distributed method performs poorly due to density of the projection, the problem can simply be solved in serial using a small number of iterations, and vice-versa. This might be expected, for instance, when the input vector dd itself is sparse (see Appendix E.5).

7 Acknowledgements

This work was funded in part by the Office of Naval Research under grant N00014-23-1-2632. We also thank an anonymous reviewer for various code optimization suggestions.

References

  • Amdahl (1967) Amdahl GM (1967) Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference, 483–485.
  • Barlaud et al. (2017) Barlaud M, Belhajali W, Combettes PL, Fillatre L (2017) Classification and regression using an outer approximation projection-gradient method. IEEE Transactions on Signal Processing 65(17):4635–4644.
  • Barman et al. (2013) Barman S, Liu X, Draper SC, Recht B (2013) Decomposition methods for large scale lp decoding. IEEE Transactions on Information Theory 59(12):7870–7886.
  • Bentley and McIlroy (1993) Bentley JL, McIlroy MD (1993) Engineering a sort function. Software: Practice and Experience 23(11):1249–1256.
  • Bioucas-Dias et al. (2012) Bioucas-Dias JM, Plaza A, Dobigeon N, Parente M, Du Q, Gader P, Chanussot J (2012) Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 5(2):354–379.
  • Blondel et al. (2014) Blondel M, Fujino A, Ueda N (2014) Large-scale multiclass support vector machine training via euclidean projection onto the simplex. 2014 22nd International Conference on Pattern Recognition, 1289–1294.
  • Blum et al. (1973) Blum M, Floyd RW, Pratt VR, Rivest RL, Tarjan RE, et al. (1973) Time bounds for selection. J. Comput. Syst. Sci. 7(4):448–461.
  • Boyd and Vandenberghe (2004) Boyd S, Vandenberghe L (2004) Convex optimization (Cambridge University Press).
  • Brodie et al. (2009) Brodie J, Daubechies I, De Mol C, Giannone D, Loris I (2009) Sparse and stable markowitz portfolios. Proceedings of the National Academy of Sciences 106(30):12267–12272.
  • Candès et al. (2008) Candès EJ, Wakin MB, Boyd SP (2008) Enhancing sparsity by reweighted 1\ell_{1} minimization. Journal of Fourier Analysis and Applications 14:877–905.
  • Center (1987) Center OS (1987) Ohio supercomputer center. URL http://osc.edu/ark:/19495/f5s1ph73.
  • Chang and Lin (2011) Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2:27:1–27:27.
  • Charles and J. Laurie (2006) Charles MG, J Laurie S (2006) Introduction to probability (American Mathematical Society).
  • Chartrand and Yin (2008) Chartrand R, Yin W (2008) Iteratively reweighted algorithms for compressive sensing. 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 3869–3872.
  • Chen et al. (1998) Chen SS, Donoho DL, Saunders MA (1998) Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20(1):33–61.
  • Chen and Zhou (2014) Chen X, Zhou W (2014) Convergence of the reweighted 1\ell_{1} minimization algorithm for 2\ell_{2}-p\ell_{p} minimization. Computational Optimization and Applications 59:47–61.
  • Cominetti et al. (2014) Cominetti, R M, WF S, PJS (2014) A newton’s method for the continuous quadratic knapsack problem. Math. Prog. Comp. 6:151––196.
  • Condat (2016) Condat L (2016) Fast projection onto the simplex and the 1\ell_{1} ball. Math. Program. Ser. A 158(1):575–585.
  • Cormen et al. (2009) Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to Algorithms (The MIT Press), 3rd edition.
  • Dai et al. (2006) Dai, YH F, R (2006) New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds. Math. Program. 106:403––421.
  • Dai and Chen (2023) Dai Y, Chen C (2023) Sparsity-Exploiting Distributed Projections onto a Simplex. URL http://dx.doi.org/10.1287/ijoc.2022.0328.cd, https://github.com/INFORMSJoC/2022.0328.
  • Donoho and Elad (2003) Donoho DL, Elad M (2003) Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. Proceedings of the National Academy of Sciences 100(5):2197–2202.
  • Duchi et al. (2008) Duchi J, Shalev-Shwartz S, Singer Y, Chandra T (2008) Efficient projections onto the 1\ell_{1}-ball for learning in high dimensions. Proceedings of the 25th International Conference on Machine learning (ICML), 272–279.
  • Fischer (2011) Fischer H (2011) A history of the central limit theorem. From classical to modern probability theory (Springer).
  • Gentle (2009) Gentle J (2009) Computational Statistics (Springer).
  • Greb and Zachmann (2006) Greb A, Zachmann G (2006) Gpu-abisort: optimal parallel sorting on stream architectures. Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 10 pp.–.
  • Held et al. (1974) Held M, Wolfe P, Crowder HP (1974) Validation of subgradient optimization. Math. Program. 6:62–88.
  • Iutzeler and Condat (2018) Iutzeler F, Condat L (2018) Distributed projection on the simplex and \ell_1\backslash ell\_1 ball via admm and gossip. IEEE Signal Processing Letters 25(11):1650–1654.
  • Kiwiel (2008) Kiwiel KC (2008) Breakpoint searching algorithms for the continuous quadratic knapsack problem. Math. Program. 112:473–491.
  • Knuth (1998) Knuth D (1998) The Art of Computer Programming, volume 1 (Addison-Wesley), 3rd edition.
  • Ladner and Fischer (1980) Ladner RE, Fischer MJ (1980) Parallel prefix computation. J. ACM 27(4):831–838.
  • Lagarias (2013) Lagarias JC (2013) Euler’s constant: Euler’s work and modern developments. Bull. Amer. Math. Soc. 50:527–628.
  • Lellmann et al. (2009) Lellmann J, Kappes J, Yuan J, Becker F, Schnörr C (2009) Convex multi-class image labeling by simplex-constrained total variation. Scale Space and Variational Methods in Computer Vision, 150–162 (Springer Berlin Heidelberg).
  • Liu and Draper (2016) Liu X, Draper SC (2016) The admm penalized decoder for ldpc codes. IEEE Transactions on Information Theory 62(6):2966–2984.
  • Mahmoud (2000) Mahmoud HM (2000) Sorting: A distribution theory, volume 54 (John Wiley & Sons).
  • Michelot (1986) Michelot C (1986) A finite algorithm for finding the projection of a point onto the canonical simplex of n\mathbb{R}^{n}. Journal of Optimization Theory and Applications 50:195–200.
  • Perez et al. (2020a) Perez G, Ament S, Gomes CP, Barlaud M (2020a) Efficient projection algorithms onto the weighted 1\ell_{1} ball. CoRR abs/2009.02980.
  • Perez et al. (2020b) Perez G, Michel Barlaud LF, Régin JC (2020b) A filtered bucket-clustering method for projection onto the simplex and the 1\ell_{1} ball. Math. Program. Ser. A 182:445–464.
  • Robinson et al. (1992) Robinson AG, Jiang N, Lerme CS (1992) On the continuous quadratic knapsack problem. Mathematical Programming 55:99–108.
  • Tibshirani (1996) Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1):267–288.
  • Tibshirani (1997) Tibshirani R (1997) The lasso method for variable selection in the cox model. Statistic in Medicine 16:385–395.
  • van den Berg (2020) van den Berg E (2020) A hybrid quasi-newton projected-gradient method with application to lasso and basis-pursuit denoising. Math. Program. Comp. 12:1–38.
  • van den Berg and Friedlander (2009) van den Berg E, Friedlander MP (2009) Probing the pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing 31(2):890–912.
  • Wang et al. (2022) Wang G, Yu W, Liang X, Wu Y, Yu B (2022) An iterative reduction fista algorithm for large-scale lasso. SIAM Journal on Scientific Computing 44(4):A1989–A2017.
  • Wang et al. (2020) Wang S, Bai Y, Pekhimenko G (2020) Bppsa: Scaling back-propagation by parallel scan algorithm.
  • Wasson et al. (2019) Wasson M, Milicevic M, Draper SC, Gulak G (2019) Hardware-based linear program decoding with the alternating direction method of multipliers. IEEE Transactions on Signal Processing 67(19):4976–4991.
  • Wei et al. (2015) Wei H, Jiao X, Mu J (2015) Reduced-complexity linear programming decoding based on admm for ldpc codes. IEEE Communications Letters 19(6):909–912.
  • Xavier and Iyengar (1998) Xavier C, Iyengar SS (1998) Introduction to parallel algorithms, volume 1 (John Wiley & Sons).
  • Zhang et al. (2023) Zhang A, Lipton ZC, Li M, Smola AJ (2023) Dive into deep learning.
  • Zhang et al. (2013) Zhang G, Heusdens R, Kleijn WB (2013) Large scale lp decoding with low complexity. IEEE Communications Letters 17(11):2152–2155.
  • Zhang and Siegel (2013) Zhang X, Siegel PH (2013) Efficient iterative lp decoding of ldpc codes with alternating direction method of multipliers. 2013 IEEE International Symposium on Information Theory, 1501–1505.

Appendix A Mathematical Proofs

Corollary 1 For any t1,t2t_{1},t_{2}\in\mathbb{R} such that t1<τ<t2t_{1}<\tau<t_{2}, we have

f(t1)>f(τ)=0>f(t2).f(t_{1})>f(\tau)=0>f(t_{2}).
Proof.

By definition,

f(t)\displaystyle f(t) =\displaystyle= itdib|t|t,\displaystyle\frac{\sum_{i\in\mathcal{I}_{t}}d_{i}-b}{|\mathcal{I}_{t}|}-t,
=\displaystyle= it(dit)b|t|,\displaystyle\frac{\sum_{i\in\mathcal{I}_{t}}(d_{i}-t)-b}{|\mathcal{I}_{t}|},
=\displaystyle= i=1nmax{dit,0}b|t|.\displaystyle\frac{\sum_{i=1}^{n}\max\{d_{i}-t,0\}-b}{|\mathcal{I}_{t}|}.

Observe that g(t):=i=1nmax{dit,0}bg(t):=\sum_{i=1}^{n}\max\{d_{i}-t,0\}-b is a strictly decreasing function for tmaxi{di}t\leq\max_{i}\{d_{i}\}, and g(t)=bg(t)=-b for tmaxi{di}t\geq\max_{i}\{d_{i}\}. Furthermore, from Proposition 1, we have that τ\tau is the unique value such that g(τ)=0g(\tau)=0; moreover, since b>0b>0 then τ<maxi{di}\tau<\max_{i}\{d_{i}\}. Thus g(t1)>g(τ)=0>g(t2)g(t_{1})>g(\tau)=0>g(t_{2}), which implies f(t1)>0,f(τ)=0f(t_{1})>0,f(\tau)=0, and f(t2)<0f(t_{2})<0. ∎

We assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u], and we have
Lemma 1

E[idib||]l+u2asnE[\frac{\sum_{i\in\mathcal{I}}d_{i}-b}{|\mathcal{I}|}]\rightarrow\frac{l+u}{2}\ \mathrm{as}\ n\rightarrow\infty

with a sublinear convergence rate, where :={1,,n}\mathcal{I}:=\{1,...,n\}.

Proof.

Observe that E[di]=l+u2E[d_{i}]=\frac{l+u}{2}. Since d1,dnd_{1},...d_{n} are i.i.d, then we have

E[idib||]=l+u2bn.E[\frac{\sum_{i\in\mathcal{I}}d_{i}-b}{|\mathcal{I}|}]=\frac{l+u}{2}-\frac{b}{n}. (7)

Thus E[idib||]E[\frac{\sum_{i\in\mathcal{I}}d_{i}-b}{|\mathcal{I}|}] converges to l+u2\frac{l+u}{2} sublinearly at the rate of

limn[l+u2bn+1]/[l+u2bn]=1.\lim_{n\rightarrow\infty}[\frac{l+u}{2}-\frac{b}{n+1}]/[\frac{l+u}{2}-\frac{b}{n}]=1.

Let X|tX|t denote the conditional variable such that PX|t(x)=P(X=x|X>t)P_{X|t}(x)=P(X=x|X>t).  
We assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u], and then
 
Proposition 2 Michelot’s method has an average runtime of O(n)O(n).

Proof.

Let δi\delta_{i} be the number of elements that Algorithm 3 (from the main body) removes from the (candidate) active set p(i)\mathcal{I}_{p}^{(i)} in iteration ii of the do-while loop, and let TT be the total number of iterations.

i=1Tδi=n|τ|,\sum_{i=1}^{T}\delta_{i}=n-|\mathcal{I}_{\tau}|, (8)

where τ\mathcal{I}_{\tau} is the active index set of the projection. Let p(i)p^{(i)} be the iith pivot (line 3 in Algorithm 3 of the main body) with

p(i)=jp(i)djb|p(i)|,p^{(i)}=\frac{\sum_{j\in\mathcal{I}_{p}^{(i)}}d_{j}-b}{|\mathcal{I}_{p}^{(i)}|},

and define p(0):=lp^{(0)}:=l.

Now, from Proposition 9 we have that all djd_{j} with jp(i)j\in\mathcal{I}_{p}^{(i)} are i.i.d.U[p(i1),u]\sim U[p^{(i-1)},u]. Thus E[dj|djp(i)]=p(i1)+u2E[d_{j}|d_{j}\in\mathcal{I}_{p}^{(i)}]=\frac{p^{(i-1)}+u}{2}, and so

E[p(i)|p(i),p(i1)]=E[jp(i)djb|p(i)||p(i),p(i1)]=p(i1)+u2b|p(i)|.E[p^{(i)}|\mathcal{I}_{p}^{(i)},p^{(i-1)}]=E[\frac{\sum_{j\in\mathcal{I}_{p}^{(i)}}d_{j}-b}{|\mathcal{I}_{p}^{(i)}|}\ |\ \mathcal{I}_{p}^{(i)},p^{(i-1)}]=\frac{p^{(i-1)}+u}{2}-\frac{b}{|\mathcal{I}_{p}^{(i)}|}. (9)

In iteration ii, all elements in the range [p(i1),p(i)][p^{(i-1)},p^{(i)}] are removed, and again the i.i.d. uniform property is preserved by Proposition 9, so

E[δi|p(i),p(i),p(i1)]=|p(i)|p(i)p(i1)up(i1)=|p(i)|(12(u+p(i1))/2p(i)up(i1));E[\delta_{i}|\mathcal{I}_{p}^{(i)},p^{(i)},p^{(i-1)}]=|\mathcal{I}_{p}^{(i)}|\frac{p^{(i)}-p^{(i-1)}}{u-p^{(i-1)}}=|\mathcal{I}_{p}^{(i)}|(\frac{1}{2}-\frac{(u+p^{(i-1)})/2-p^{(i)}}{u-p^{(i-1)}});

by Law of Total Expectation,

E[δi|p(i),p(i1)]=\displaystyle E[\delta_{i}|\mathcal{I}_{p}^{(i)},p^{(i-1)}]= E[E[δi|p(i),p(i),p(i1)]]\displaystyle E[E[\delta_{i}|\mathcal{I}_{p}^{(i)},p^{(i)},p^{(i-1)}]]
=\displaystyle= |p(i)|(12(u+p(i1))/2E[p(i)|p(i),p(i1)]up(i1)).\displaystyle|\mathcal{I}_{p}^{(i)}|(\frac{1}{2}-\frac{(u+p^{(i-1)})/2-E[p^{(i)}|\mathcal{I}_{p}^{(i)},p^{(i-1)}]}{u-p^{(i-1)}}).

Replacing the right-hand side expectation using Equation (9),

E[δi|p(i),p(i1)]\displaystyle E[\delta_{i}\ |\ \mathcal{I}_{p}^{(i)},p^{(i-1)}]
=\displaystyle= |p(i)|(12(u+p(i1))/2(u+p(i1))/2+b/|p(i)|up(i1))\displaystyle|\mathcal{I}_{p}^{(i)}|(\frac{1}{2}-\frac{(u+p^{(i-1)})/2-(u+p^{(i-1)})/2+b/|\mathcal{I}_{p}^{(i)}|}{u-p^{(i-1)}})
=\displaystyle= |p(i)|2bup(i1).\displaystyle\frac{|\mathcal{I}_{p}^{(i)}|}{2}-\frac{b}{u-p^{(i-1)}}.

Using the Law of Total Expectation again,

E[δi|p(i1)]=E[E[δi|p(i),p(i1)]]=E[|p(i)|]2bup(i1).E[\delta_{i}\ |\ p^{(i-1)}]=E[E[\delta_{i}\ |\ \mathcal{I}_{p}^{(i)},p^{(i-1)}]]=\frac{E[|\mathcal{I}_{p}^{(i)}|]}{2}-\frac{b}{u-p^{(i-1)}}. (10)

Let σ:=2bul+1\sigma:=\sqrt{\frac{2b}{u-l}}+1. We will now consider cases. Observe that |τ|n|\mathcal{I}_{\tau}|\leq n, i.e. the active set is a subset of the ground set. The following two cases exhaust the possibilities of where σn\sigma\sqrt{n} lies: either |τ|<σn<n|\mathcal{I}_{\tau}|<\sigma\cdot\sqrt{n}<n (Case 1); otherwise, either |τ|nσn|\mathcal{I}_{\tau}|\leq n\leq\sigma\cdot\sqrt{n} or σn|τ|n\sigma\cdot\sqrt{n}\leq|\mathcal{I}_{\tau}|\leq n (Case 2).
Case 1: |τ|<σn<n|\mathcal{I}_{\tau}|<\sigma\cdot\sqrt{n}<n
Let ξ\xi be an iteration such that 1ξ<T1\leq\xi<T satisfies:

|p(ξ)|σn>|p(ξ+1)|>>|p(T)|=|τ|;|\mathcal{I}_{p}^{(\xi)}|\geq\sigma\cdot\sqrt{n}>|\mathcal{I}_{p}^{(\xi+1)}|>...>|\mathcal{I}_{p}^{(T)}|=|\mathcal{I}_{\tau}|;

Then,

E[i=1T|p(i)|]=E[i=1ξ|p(i)|]+E[i=ξT|p(i)|]E[\sum_{i=1}^{T}|\mathcal{I}_{p}^{(i)}|]=E[\sum_{i=1}^{\xi}|\mathcal{I}_{p}^{(i)}|]+E[\sum_{i=\xi}^{T}|\mathcal{I}_{p}^{(i)}|] (11)

Since |τ|<σn|\mathcal{I}_{\tau}|<\sigma\cdot\sqrt{n}, then from |p(ξ+1)|<σn|\mathcal{I}_{p}^{(\xi+1)}|<\sigma\cdot\sqrt{n} Michelot’s Method will stop within at most σn\sigma\cdot\sqrt{n} iterations since at least one element is removed in each iteration; thus Tξ<σnT-\xi<\sigma\cdot\sqrt{n}. Since |p(i)|<|p(ξ+1)|<σn|\mathcal{I}_{p}^{(i)}|<|\mathcal{I}_{p}^{(\xi+1)}|<\sigma\cdot\sqrt{n} for iξ+1i\geq\xi+1,

E[i=ξT|p(i)|]E[i=ξTσn]<σ2nO(n).E[\sum_{i=\xi}^{T}|\mathcal{I}_{p}^{(i)}|]\leq E[\sum_{i=\xi}^{T}\sigma\cdot\sqrt{n}]<\sigma^{2}\cdot n\in O(n). (12)

From Equation (10),

E[i=1ξδi|p(1),,p(ξ),ξ]=E[i=1ξ|p(i)||ξ]2i=1ξbup(i1).E[\sum_{i=1}^{\xi}\delta_{i}\ |\ p^{(1)},...,p^{(\xi)},\xi]=\frac{E[\sum_{i=1}^{\xi}|\mathcal{I}_{p}^{(i)}|\ |\ \xi]}{2}-\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}.

Then, from the Law of Total Expectation,

E[i=1ξδi]=\displaystyle E[\sum_{i=1}^{\xi}\delta_{i}]= E[E[i=1ξδi|p(1),,p(ξ),ξ]]\displaystyle E[E[\sum_{i=1}^{\xi}\delta_{i}\ |\ p^{(1)},...,p^{(\xi)},\xi]] (13)
=\displaystyle= E[i=1ξ|p(i)|]2E[i=1ξbup(i1)].\displaystyle\frac{E[\sum_{i=1}^{\xi}|\mathcal{I}_{p}^{(i)}|]}{2}-E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}].

Michelot’s method always maintains a nonempty active index set (decreasing per iteration), and so i=1ξδi=n|p(ξ)|<n\sum_{i=1}^{\xi}\delta_{i}=n-|\mathcal{I}_{p}^{(\xi)}|<n. Then, continuing from (13),

E[i=1ξ|p(i)|]2E[i=1ξbup(i1)]n,\frac{E[\sum_{i=1}^{\xi}|\mathcal{I}_{p}^{(i)}|]}{2}-E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}]\leq n,
E[i=1ξ|p(i)|]2\displaystyle\implies\frac{E[\sum_{i=1}^{\xi}|\mathcal{I}_{p}^{(i)}|]}{2}\leq n+E[i=1ξbup(i1)]\displaystyle n+E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}] (14)
Claim.

E[i=1ξbup(i1)]O(n)E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}]\in O(n).

Proof:

E[i=1ξbup(i1)|ξ]=\displaystyle E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}|\xi]= i=1ξE[bup(i1)]\displaystyle\sum_{i=1}^{\xi}E[\frac{b}{u-p^{(i-1)}}] (15)
\displaystyle\leq i=1ξbE[up(i1)](Jensen’s Inequality).\displaystyle\sum_{i=1}^{\xi}\frac{b}{E[u-p^{(i-1)]}}\quad(\mbox{Jensen's Inequality}).

Since p(i)={j|dj>p(i1)}\mathcal{I}_{p}^{(i)}=\{j\in\mathcal{I}\ |\ d_{j}>p^{(i-1)}\} and d1,,dnd_{1},...,d_{n} are i.i.d.U[l,u]i.i.d.\sim U[l,u],

E[|p(i)|]=nE[up(i1)]ul.E[|\mathcal{I}_{p}^{(i)}|]=n\cdot\frac{E[u-p^{(i-1)}]}{u-l}.

So for i=1,,ξi=1,...,\xi, since |p(i)|>|p(i+1)||\mathcal{I}_{p}^{(i)}|>|\mathcal{I}_{p}^{(i+1)}|,

nE[up(i1)]ul=E[|p(i)|]>E[|p(ξ)|]σn;n\cdot\frac{E[u-p^{(i-1)}]}{u-l}=E[|\mathcal{I}_{p}^{(i)}|]>E[|\mathcal{I}_{p}^{(\xi)}|]\geq\sigma\cdot\sqrt{n};
E[up(i1)]>σuln.\implies E[u-p^{(i-1)}]>\sigma\cdot\frac{u-l}{\sqrt{n}}. (16)

Substituting into Inequality (15),

E[i=1ξbup(i1)|ξ]i=1ξnbσ(ul)=ξnbσ(ul);E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}|\xi]\leq\sum_{i=1}^{\xi}\frac{\sqrt{n}b}{\sigma\cdot(u-l)}=\frac{\xi\sqrt{n}b}{\sigma\cdot(u-l)};

using Law of Total Expectation,

E[i=1ξbup(i1)]=E[E[i=1ξbup(i1)|ξ]]E[ξ]nbσ(ul).E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}]=E[E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}|\xi]]\leq E[\xi]\frac{\sqrt{n}b}{\sigma\cdot(u-l)}. (17)

So then it remains to show E[ξ]O(n)E[\xi]\in O(\sqrt{n}). From Equation (9), for any iteration ii

uE[p(i)|p(i),p(i1)]=up(i1)+u2+b|p(i)|,u-E[p^{(i)}|\mathcal{I}_{p}^{(i)},p^{(i-1)}]=u-\frac{p^{(i-1)}+u}{2}+\frac{b}{|\mathcal{I}_{p}^{(i)}|},
E[up(i)|p(i),p(i1)]=up(i1)2+b|p(i)|;\implies E[u-p^{(i)}|\mathcal{I}_{p}^{(i)},p^{(i-1)}]=\frac{u-p^{(i-1)}}{2}+\frac{b}{|\mathcal{I}_{p}^{(i)}|};

thus, using the Law of Total Expectation,

E[up(i)]=E[up(i1)]2+E[b|p(i)|].E[u-p^{(i)}]=\frac{E[u-p^{(i-1)}]}{2}+E[\frac{b}{|\mathcal{I}_{p}^{(i)}|}]. (18)

Applying Equation (18) recursively, starting with the base case E[up(0)]=ulE[u-p^{(0)}]=u-l,

E[up(i)]=\displaystyle E[u-p^{(i)}]= ul2i+j=1iE[b2ij|p(j)|],\displaystyle\frac{u-l}{2^{i}}+\sum_{j=1}^{i}E[\frac{b}{2^{i-j}\cdot|\mathcal{I}_{p}^{(j)}|}],
E[p(i)]=\displaystyle\implies E[p^{(i)}]= uul2ij=1iE[b2ij|p(j)|],\displaystyle u-\frac{u-l}{2^{i}}-\sum_{j=1}^{i}E[\frac{b}{2^{i-j}\cdot|\mathcal{I}_{p}^{(j)}|}],
E[p(ξ1)|ξ]=\displaystyle\implies E[p^{(\xi-1)}|\xi]= uul2ξ1E[j=1ξ1b2ξ1j|p(j)||ξ].\displaystyle u-\frac{u-l}{2^{\xi-1}}-E[\sum_{j=1}^{\xi-1}\frac{b}{2^{\xi-1-j}\cdot|\mathcal{I}_{p}^{(j)}|}|\xi].

Using the Law of Total Expectation,

E[p(ξ1)]=E[E[p(ξ1)|ξ]]=uE[ul2ξ1]E[j=1ξ1b2ξ1j|p(j)|]E[p^{(\xi-1)}]=E[E[p^{(\xi-1)}|\xi]]=u-E[\frac{u-l}{2^{\xi-1}}]-E[\sum_{j=1}^{\xi-1}\frac{b}{2^{\xi-1-j}\cdot|\mathcal{I}_{p}^{(j)}|}]
E[p(ξ1)]\displaystyle\implies E[p^{(\xi-1)}]\geq uE[ul2ξ1]E[j=1ξ1b2ξ1jσn](since |p(i)|σnfor iξ)\displaystyle u-E[\frac{u-l}{2^{\xi-1}}]-E[\sum_{j=1}^{\xi-1}\frac{b}{2^{\xi-1-j}\sigma\cdot\sqrt{n}}]\quad(\mbox{since }|\mathcal{I}_{p}^{(i)}|\geq\sigma\cdot\sqrt{n}\ \mbox{for }i\leq\xi)
\displaystyle\geq uE[ul2ξ1]2bσn\displaystyle u-E[\frac{u-l}{2^{\xi-1}}]-\frac{2b}{\sigma\cdot\sqrt{n}}
\displaystyle\geq uulE[2ξ1]2bσn(Jensen’s inequality)\displaystyle u-\frac{u-l}{E[2^{\xi-1}]}-\frac{2b}{\sigma\cdot\sqrt{n}}\quad(\mbox{Jensen's inequality})
ulE[2ξ1]\displaystyle\implies\frac{u-l}{E[2^{\xi-1}]}\geq E[up(ξ1)]2bσn\displaystyle E[u-p^{(\xi-1)}]-\frac{2b}{\sigma\cdot\sqrt{n}}

From Inequality (16), E[up(ξ1)]>σulnE[u-p^{(\xi-1)}]>\sigma\cdot\frac{u-l}{\sqrt{n}}; thus

ulE[2ξ1]>σuln2bσn=1σn[σ2(ul)2b].\frac{u-l}{E[2^{\xi-1}]}>\sigma\cdot\frac{u-l}{\sqrt{n}}-\frac{2b}{\sigma\cdot\sqrt{n}}=\frac{1}{\sigma\cdot\sqrt{n}}[\sigma^{2}\cdot(u-l)-2b].

Since σ=2bul+1\sigma=\sqrt{\frac{2b}{u-l}}+1, σ>2bul\sigma>\sqrt{\frac{2b}{u-l}}; thus σ2(ul)2b>0\sigma^{2}\cdot(u-l)-2b>0; as a result,

E[2ξ1]<σn(ul)σ2(ul)2b;E[2^{\xi-1}]<\frac{\sigma\sqrt{n}(u-l)}{\sigma^{2}\cdot(u-l)-2b};
2E[ξ]1E[2ξ1]<σn(ul)σ2(ul)2b(Jensen’s Inequality);\implies 2^{E[\xi]-1}\leq E[2^{\xi-1}]<\frac{\sigma\sqrt{n}(u-l)}{\sigma^{2}\cdot(u-l)-2b}\quad(\mbox{Jensen's Inequality});
E[ξ]log(σn(ul)σ2(ul)2b)+1.\implies E[\xi]\leq\log(\frac{\sigma\sqrt{n}(u-l)}{\sigma^{2}\cdot(u-l)-2b})+1.

Thus E[ξ]O(n)E[\xi]\in O(\sqrt{n}). Substituting into Inequality (17)

E[i=1ξbup(i1)](log(σn(ul)σ2(ul)2b)+1)nbσ(ul)O(n).E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}]\leq(\log(\frac{\sigma\sqrt{n}(u-l)}{\sigma^{2}\cdot(u-l)-2b})+1)\cdot\frac{\sqrt{n}b}{\sigma\cdot(u-l)}\in O(n).

\blacksquare

The Claim, together with Inequality (14) establish that

E[i=1ξ|p(i)|]O(n).E[\sum_{i=1}^{\xi}|\mathcal{I}_{p}^{(i)}|]\in O(n).

Altogether with Inequalities (11) and (12), we have E[i=1T|p(i)|]O(n)E[\sum_{i=1}^{T}|\mathcal{I}_{p}^{(i)}|]\in O(n) in Case 1.
Case 2: n|τ|σnn\geq|\mathcal{I}_{\tau}|\geq\sigma\cdot\sqrt{n}, or σnn|τ|\sigma\cdot\sqrt{n}\geq n\geq|\mathcal{I}_{\tau}|
If n|τ|σnn\geq|\mathcal{I}_{\tau}|\geq\sigma\cdot\sqrt{n}, let ξ:=T\xi:=T. Since

|p(ξ)|=|p(T)|=|τ|σn,|\mathcal{I}_{p}^{(\xi)}|=|\mathcal{I}_{p}^{(T)}|=|\mathcal{I}_{\tau}|\geq\sigma\cdot\sqrt{n},

the Claim from Case 1 and Equation (14) hold for ξ=T\xi=T; thus

E[i=1Tbup(i1)]O(n),E[\sum_{i=1}^{T}\frac{b}{u-p^{(i-1)}}]\in O(n),
E[i=1T|p(i)|]2n+E[i=1Tbup(i1)];\frac{E[\sum_{i=1}^{T}|\mathcal{I}_{p}^{(i)}|]}{2}\leq n+E[\sum_{i=1}^{T}\frac{b}{u-p^{(i-1)}}];
E[i=1T|p(i)|]O(n).\implies E[\sum_{i=1}^{T}|\mathcal{I}_{p}^{(i)}|]\in O(n).

If σnn|τ|\sigma\cdot\sqrt{n}\geq n\geq|\mathcal{I}_{\tau}|, then σn\sigma\geq\sqrt{n}, which implies nσ2n\leq\sigma^{2}. However, u,l,bu,l,b are given independently of nn and so σ\sigma is fixed. Thus for asymptotic analysis, nσ2n\leq\sigma^{2} does not hold for sufficiently large nn.

Together with Case 1 and Case 2, E[i=1T|p(i)|]O(n)E[\sum_{i=1}^{T}|\mathcal{I}_{p}^{(i)}|]\in O(n). Hence O(n)O(n) operations are used for scanning/prefix-sum. All other operations, i.e. assigning \mathcal{I} and p\mathcal{I}_{p}, are within a constant factor of the scanning operations. ∎

We assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u], and we have

Lemma 2 Filter provides a pivot pp such that τpidib||\tau\geq p\geq\frac{\sum_{i\in\mathcal{I}}d_{i}-b}{|\mathcal{I}|}.

Proof.

The upper bound pτp\leq\tau is given by construction of pp (see (Condat 2016, Section 3, Paragraph 2)).

We can establish the lower bound on pp by considering the sequence p(1)p(n)p^{(1)}\leq...\leq p^{(n)}\in\mathbb{R}, which represents the initial as well as subsequent (intermediate) values of pp from the first outer for-loop on line 2 of presented as Algorithm 4 (from the main body), and their corresponding index sets p(1)p(n)\mathcal{I}_{p}^{(1)}\subseteq...\subseteq\mathcal{I}_{p}^{(n)}. Filter initializes with p(1):=d1bp^{(1)}:=d_{1}-b and p(1):={1}\mathcal{I}_{p}^{(1)}:=\{1\}. For i=2,,ni=2,...,n, if di>p(i1)d_{i}>p^{(i-1)},

p(i):=p(i1)+(dip(i1))/(|p(i1)|+1),p(i):=p(i1){i};\displaystyle p^{(i)}:=p^{(i-1)}+(d_{i}-p^{(i-1)})/(|\mathcal{I}_{p}^{(i-1)}|+1),\ \mathcal{I}_{p}^{(i)}:=\mathcal{I}_{p}^{(i-1)}\cup\{i\};

otherwise p(i):=p(i1)p^{(i)}:=p^{(i-1)}, and p(i):=p(i1)\mathcal{I}_{p}^{(i)}:=\mathcal{I}_{p}^{(i-1)}. Then it can be shown that p(i)=(jp(i)djb)/|p(i)|p^{(i)}=(\sum_{j\in\mathcal{I}_{p}^{(i)}}d_{j}-b)/|\mathcal{I}_{p}^{(i)}| (see (Condat 2016, Section 3, Paragraph 2)), and pp(n)p\geq p^{(n)} (see (Condat 2016, Section 3, Paragraph 5)). Now in terms of p(n)p^{(n)} we may write

idib||=\displaystyle\frac{\sum_{i\in\mathcal{I}}d_{i}-b}{|\mathcal{I}|}= ip(n)dib+i\p(n)di||\displaystyle\frac{\sum_{i\in\mathcal{I}_{p}^{(n)}}d_{i}-b+\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}d_{i}}{|\mathcal{I}|}
=\displaystyle= (ip(n)dib)|p(n)||||p(n)|+i\p(n)di||\displaystyle\frac{(\sum_{i\in\mathcal{I}_{p}^{(n)}}d_{i}-b)|\mathcal{I}_{p}^{(n)}|}{|\mathcal{I}||\mathcal{I}_{p}^{(n)}|}+\frac{\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}d_{i}}{|\mathcal{I}|}
=\displaystyle= |p(n)|||p(n)+i\p(n)di||\displaystyle\frac{|\mathcal{I}_{p}^{(n)}|}{|\mathcal{I}|}p^{(n)}+\frac{\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}d_{i}}{|\mathcal{I}|}
=\displaystyle= p(n)+i\p(n)di(|||p(n)|)p(n)||\displaystyle p^{(n)}+\frac{\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}d_{i}-(|\mathcal{I}|-|\mathcal{I}_{p}^{(n)}|)p^{(n)}}{|\mathcal{I}|}
=\displaystyle= p(n)+i\p(n)(dip(n))||.\displaystyle p^{(n)}+\frac{\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}(d_{i}-p^{(n)})}{|\mathcal{I}|}.

For any i\p(n)i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}, since p(i)p(n)\mathcal{I}_{p}^{(i)}\subseteq\mathcal{I}_{p}^{(n)} then ip(i)i\not\in\mathcal{I}_{p}^{(i)}. By construction of p(i1)p^{(i-1)} and p(i)\mathcal{I}_{p}^{(i)}, we have dip(i1)p(n)d_{i}\leq p^{(i-1)}\leq p^{(n)}. Thus, i\p(n)(dip(n))0\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}(d_{i}-p^{(n)})\leq 0, and p(n)(idib)/||p^{(n)}\geq(\sum_{i\in\mathcal{I}}d_{i}-b)/|\mathcal{I}|. So pp(n)(idib)/||p\geq p^{(n)}\geq(\sum_{i\in\mathcal{I}}d_{i}-b)/|\mathcal{I}|. ∎

Now we introduce some notation in order to compare subsequent iterations of Condat’s method with iterations of Michelot’s method. Let tCnt_{C}\leq n and tMnt_{M}\leq n be the total number of iterations taken by Condat’s method and Michelot’s method (respectively) on a given instance. Let 0C,,nC\mathcal{I}^{C}_{0},...,\mathcal{I}^{C}_{n} be the active index sets per iteration for Condat’s method with corresponding pivots p0C,,pnCp^{C}_{0},...,p^{C}_{n}. Likewise, we denote the index sets and pivots of Michelot’s method as 0M,,nM\mathcal{I}^{M}_{0},...,\mathcal{I}^{M}_{n} and p0M,,pnMp^{M}_{0},...,p^{M}_{n}, respectively. If tC<nt_{C}<n then we set ptCC=ptC+1C==pnC=τp^{C}_{t_{C}}=p^{C}_{t_{C}+1}=...=p^{C}_{n}=\tau and tCC=tC+1C==nC=τ\mathcal{I}^{C}_{t_{C}}=\mathcal{I}^{C}_{t_{C}+1}=...=\mathcal{I}^{C}_{n}=\mathcal{I}_{\tau}; likewise for Michelot’s algorithm.  
 
Lemma 3 iCiM\mathcal{I}_{i}^{C}\subseteq\mathcal{I}_{i}^{M}, and piCpiMp^{C}_{i}\geq p^{M}_{i} for i=0,,ni=0,...,n.

Proof.

We will prove this by induction. For the base case, 0C\mathcal{I}_{0}^{C} is obtained by Filter. So 0C=0M\mathcal{I}_{0}^{C}\subseteq\mathcal{I}=\mathcal{I}_{0}^{M}. Moreover, from Lemma 2, p0Cp0Mp_{0}^{C}\geq p_{0}^{M}.

Now for any iteration i1i\geq 1, suppose iCiM\mathcal{I}^{C}_{i}\subseteq\mathcal{I}^{M}_{i}, and piCpiMp_{i}^{C}\geq p_{i}^{M}. From line 5 in Algorithm 3 (from the main body), i+1M:={jiM:dj>piM}\mathcal{I}^{M}_{i+1}:=\{j\in\mathcal{I}^{M}_{i}:d_{j}>p_{i}^{M}\}. From Condat (Condat 2016, Section 3, Paragraph 3), Condat’s method uses a dynamic pivot between piCp^{C}_{i} to pi+1Cp^{C}_{i+1} to remove inactive entries that would otherwise remain in Michelot’s method. Therefore, i+1C{jiC:dj>piC}i+1M\mathcal{I}^{C}_{i+1}\subseteq\{j\in\mathcal{I}^{C}_{i}:d_{j}>p_{i}^{C}\}\subseteq\mathcal{I}^{M}_{i+1}, and moreover for any ji+1M\i+1Cj\in\mathcal{I}^{M}_{i+1}\backslash\mathcal{I}^{C}_{i+1}, we have that djpi+1Cd_{j}\leq p^{C}_{i+1}. Now observe that

pi+1Cpi+1M\displaystyle p^{C}_{i+1}-p^{M}_{i+1}
=\displaystyle= ji+1Cdjb|i+1C|ji+1Mdjb|i+1M|\displaystyle\frac{\sum_{j\in\mathcal{I}^{C}_{i+1}}d_{j}-b}{|\mathcal{I}^{C}_{i+1}|}-\frac{\sum_{j\in\mathcal{I}^{M}_{i+1}}d_{j}-b}{|\mathcal{I}^{M}_{i+1}|}
=\displaystyle= ji+1Cdjb|i+1C|ji+1Cdj+ji+1M\i+1Cdjb|i+1C|+|i+1M\i+1C|\displaystyle\frac{\sum_{j\in\mathcal{I}^{C}_{i+1}}d_{j}-b}{|\mathcal{I}^{C}_{i+1}|}-\frac{\sum_{j\in\mathcal{I}^{C}_{i+1}}d_{j}+\sum_{j\in\mathcal{I}^{M}_{i+1}\backslash\mathcal{I}^{C}_{i+1}}d_{j}-b}{|\mathcal{I}^{C}_{i+1}|+|\mathcal{I}^{M}_{i+1}\backslash\mathcal{I}^{C}_{i+1}|}
=\displaystyle= ji+1M\i+1C(pi+1Cdj)|i+1C|+|i+1M\i+1C|0,\displaystyle\frac{\sum_{j\in\mathcal{I}_{i+1}^{M}\backslash\mathcal{I}_{i+1}^{C}}(p_{i+1}^{C}-d_{j})}{|\mathcal{I}_{i+1}^{C}|+|\mathcal{I}_{i+1}^{M}\backslash\mathcal{I}_{i+1}^{C}|}\geq 0,

and so pi+1Cpi+1Mp^{C}_{i+1}\geq p^{M}_{i+1}. ∎

Corollary 2 i=1tC|iC|i=1tM|iM|\sum_{i=1}^{t_{C}}|\mathcal{I}_{i}^{C}|\leq\sum_{i=1}^{t_{M}}|\mathcal{I}_{i}^{M}|.

Proof.

Observe that both algorithms remove elements (without replacement) from their candidate active sets iC,iM\mathcal{I}^{C}_{i},\mathcal{I}^{M}_{i} at every iteration; moreover, they terminate with the pivot value τ\tau and so tCC=tMM=τ\mathcal{I}^{C}_{t_{C}}=\mathcal{I}^{M}_{t_{M}}=\mathcal{I}_{\tau}. So, together with Lemma 3, we have for i=0,,ni=0,...,n that τiCiM\mathcal{I}_{\tau}\subseteq\mathcal{I}_{i}^{C}\subseteq\mathcal{I}_{i}^{M}. So tMM=τ\mathcal{I}_{t_{M}}^{M}=\mathcal{I}_{\tau} implies tMC=τ\mathcal{I}_{t_{M}}^{C}=\mathcal{I}_{\tau}, and so tCtMt_{C}\leq t_{M}. Therefore i=1tC|iC|i=1tM|iM|\sum_{i=1}^{t_{C}}|\mathcal{I}_{i}^{C}|\leq\sum_{i=1}^{t_{M}}|\mathcal{I}_{i}^{M}|. ∎

Lemma 4 The worst-case runtime of Filter is O(n)O(n).

Proof.

Since w\mathcal{I}_{w}\subseteq\mathcal{I} at any iteration, Filter will scan at most 2||2|\mathcal{I}| entries; including O(1)O(1) operations to update pp. ∎

We assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u], and we have
 
Proposition 3 Condat’s method has an average runtime of O(n)O(n).

Proof.

Filter takes O(n)O(n) operations from Lemma 4. From Corollary 2, the total operations spent on scanning in Condat’s method is less than (or equal to) the average O(n)O(n) operations for Michelot’s method (established in Proposition 2); hence Condat’s average runtime is O(n)O(n). ∎

We assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u], and we have
 
Theorem 1 E[|τ|]<2b(n+1)ul+14+12E[|\mathcal{I}_{\tau}|]<\sqrt{\frac{2b(n+1)}{u-l}+\frac{1}{4}}+\frac{1}{2}.

Proof.

Sort dd such that dπ1dπ2dπnd_{\pi_{1}}\geq d_{\pi_{2}}\geq...\geq d_{\pi_{n}}. Thus for a given order statistic, (see e.g. (Gentle 2009, p. 63)),

E[dπi]=uin+1(ul).E[d_{\pi_{i}}]=u-\frac{i}{n+1}(u-l).

Define N:=|τ|N:=|\mathcal{I}_{\tau}| for ease of presentation. From Corollary 1,

i=1NdπibN=τ;\frac{\sum_{i=1}^{N}d_{\pi_{i}}-b}{N}=\tau;

and, together with dπN+1τ<dπNd_{\pi_{N+1}}\leq\tau<d_{\pi_{N}} (by definition of τ\mathcal{I}_{\tau}), we have

NdπN+1i=1Ndπib<NdπN,N\cdot d_{\pi_{N+1}}\leq\sum_{i=1}^{N}d_{\pi_{i}}-b<N\cdot d_{\pi_{N}},
E[i=1Ndπib]<E[NdπN].\implies E[\sum_{i=1}^{N}d_{\pi_{i}}-b]<E[N\cdot d_{\pi_{N}}]. (19)

Furthermore, from the Law of Total Expectation,

E[NdπN]=uE[N]E[N2]uln+1,E[N\cdot d_{\pi_{N}}]=uE[N]-E[N^{2}]\frac{u-l}{n+1},
E[i=1Ndπib]=uE[N]E[N(N+1)]ul2(n+1)b.E[\sum_{i=1}^{N}d_{\pi_{i}}-b]=uE[N]-E[N(N+1)]\frac{u-l}{2(n+1)}-b.

Substituting into (19) yields

E[N2]E[N]<2b(n+1)ul,E[N^{2}]-E[N]<\frac{2b(n+1)}{u-l},

Since E2[N]E[N2]E^{2}[N]\leq E[N^{2}], then

E2[N]E[N]E[N2]E[N]<2b(n+1)ul,E^{2}[N]-E[N]\leq E[N^{2}]-E[N]<\frac{2b(n+1)}{u-l},
E[|τ|]=E[N]<2b(n+1)ul+14+12O(n).\implies E[|\mathcal{I}_{\tau}|]=E[N]<\sqrt{\frac{2b(n+1)}{u-l}+\frac{1}{4}}+\frac{1}{2}\in O(\sqrt{n}). (20)

Lemma 5 Suppose d1,,dnd_{1},...,d_{n} are i.i.d.\mathrm{i.i.d.} from an arbitrary distribution XX, with PDF fXf_{X} and CDF FXF_{X}. Let ϵ\epsilon such that 0<ϵ<10<\epsilon<1 be some positive number, and tt\in\mathbb{R} be such that 1FX(t)=ϵ1-F_{X}(t)=\epsilon. Then i) |t||\mathcal{I}_{t}|\rightarrow\infty as nn\rightarrow\infty and ii) P(|t|nϵ)=1asnP(\frac{|\mathcal{I}_{t}|}{n}\leq\epsilon)=1\ \mbox{as}\ n\rightarrow\infty.

Proof.

Since fXf_{X} is a density function, the CDF FXF_{X} is absolutely continuous (see e.g. (Charles and J. Laurie 2006, Page 59, Definition 2.1)); thus for any 0<ϵ<10<\epsilon<1, there exists tt\in\mathbb{R} such that Fx(t)=1ϵF_{x}(t)=1-\epsilon.

For i=1,,ni=1,...,n, define the indicator variable

δi:={1,ifdi>t0,otherwise\delta_{i}:=\begin{cases}1,&\mbox{if}\ d_{i}>t\\ 0,&\mathrm{otherwise}\end{cases}

and let Sn:=i=1nδiS_{n}:=\sum_{i=1}^{n}\delta_{i}. So Sn=|t|S_{n}=|\mathcal{I}_{t}|, and we can show that it is binomially distributed.

Claim.

SnB(n,ϵ)S_{n}\sim B(n,\epsilon).

Proof: Observe that P(δi=1)=P(di>t)=ϵP(\delta_{i}=1)=P(d_{i}>t)=\epsilon, and P(δi=0)=P(dit)=1ϵP(\delta_{i}=0)=P(d_{i}\leq t)=1-\epsilon; thus δiBernoulli(ϵ)\delta_{i}\sim\mbox{Bernoulli}(\epsilon). Moreover, d1,,dnd_{1},...,d_{n} are independent, and so δ1,,δn\delta_{1},...,\delta_{n} are i.i.d. Bernoulli(ϵ)\mbox{Bernoulli}(\epsilon); consequently, Sn:=i=1nδiB(n,ϵ)S_{n}:=\sum_{i=1}^{n}\delta_{i}\sim B(n,\epsilon). \blacksquare

So, as nn\rightarrow\infty, we can apply the Central Limit Theorem (see e.g. Fischer (2011)):

Sn:=Snnϵnϵ(1ϵ)N(0,1).S_{n}^{*}:=\frac{S_{n}-n\epsilon}{\sqrt{n\epsilon(1-\epsilon)}}\sim N(0,1). (21)

Denote Φ(q)=q12πex22𝑑x\Phi(q)=\int_{-\infty}^{q}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}\,dx to be the CDF of the standard normal distribution. Consider (for any qq\in\mathbb{R}) the right-tail probability

P(Snnϵnϵ(1ϵ)q)=1Φ(q)=Φ(q),P(\frac{S_{n}-n\epsilon}{\sqrt{n\epsilon(1-\epsilon)}}\geq-q)=1-\Phi(-q)=\Phi(q), (22)
P(Snnϵqnϵ(1ϵ))=Φ(q),P(S_{n}\geq n\epsilon-q\sqrt{n\epsilon(1-\epsilon)})=\Phi(q),
P(|t|qnϵ(1ϵ)+nϵ)Φ(q),\implies P(|\mathcal{I}_{t}|\geq\lfloor-q\sqrt{n\epsilon(1-\epsilon)}+n\epsilon\rfloor)\geq\Phi(q), (23)

where .\lfloor.\rfloor is the floor function.

Setting q=2lognq=\sqrt{2\log n} yields

P(|t|2(nlogn)ϵ(1ϵ)+nϵ)Φ(2logn).P(|\mathcal{I}_{t}|\geq\lfloor-\sqrt{2(n\log n)\epsilon(1-\epsilon)}+n\epsilon\rfloor)\geq\Phi(\sqrt{2\log n}).

Consider the right-hand side, limnΦ(2logn)\lim_{n\rightarrow\infty}\Phi(\sqrt{2\log n}). Since Φ\Phi is a CDF, it is monotonically increasing, continuous, and converges to 11; thus limnΦ(2logn)=1\lim_{n\rightarrow\infty}\Phi(\sqrt{2\log n})=1. So as nn approaches infinity,

P(|t|2(nlogn)ϵ(1ϵ)+nϵ)=1,P(|\mathcal{I}_{t}|\geq\lfloor-\sqrt{2(n\log n)\epsilon(1-\epsilon)}+n\epsilon\rfloor)=1, (24)

which establishes condition (i).

Now consider the left-tail probability,

P(Snnϵnϵ(1ϵ)q)=Φ(q)P(\frac{S_{n}-n\epsilon}{\sqrt{n\epsilon(1-\epsilon)}}\leq q)=\Phi(q)
P(|t|qnϵ(1ϵ)+nϵ)Φ(q).\implies P(|\mathcal{I}_{t}|\leq\lceil q\sqrt{n\epsilon(1-\epsilon)}+n\epsilon\rceil)\geq\Phi(q).

Again setting q=2lognq=\sqrt{2\log n}, we have that limnqnϵ(1ϵ)+nϵ/n=ϵ\lim_{n\rightarrow\infty}\lceil q\sqrt{n\epsilon(1-\epsilon)}+n\epsilon\rceil/n=\epsilon. So as nn approaches infinity,

P(|t|nϵ)=1,P(\frac{|\mathcal{I}_{t}|}{n}\leq\epsilon)=1,

which establishes condition (ii). ∎

Theorem 2 Suppose d1,,dnd_{1},...,d_{n} are i.i.d.\mathrm{i.i.d.} from an arbitrary distribution XX, with PDF fXf_{X} and CDF FXF_{X}. Then, for any ϵ>0\epsilon>0, P(|τ|nϵ)=1P(\frac{|\mathcal{I}_{\tau}|}{n}\leq\epsilon)=1 as nn\rightarrow\infty.

Proof.

Case 1: ϵ1\epsilon\geq 1
Since τ\mathcal{I}_{\tau}\subseteq\mathcal{I} and ||=n|\mathcal{I}|=n, |τ|n1\frac{|\mathcal{I}_{\tau}|}{n}\leq 1. Hence, P(|τ|nϵ)=1P(\frac{|\mathcal{I}_{\tau}|}{n}\leq\epsilon)=1.
Case 2: 0<ϵ<10<\epsilon<1
Since fXf_{X} is a density function, FXF_{X} is absolutely continuous. So there exists tt\in\mathbb{R} be such that 1FX(t)=ϵ1-F_{X}(t)=\epsilon. We shall first establish that P(τ>t)=1P(\tau>t)=1 as nn\rightarrow\infty.

From Corollary 1, τ>tf(t)>0\tau>t\iff f(t)>0, and so

P(τ>t)=\displaystyle P(\tau>t)= P(f(t)>0)\displaystyle P(f(t)>0)
=\displaystyle= P(itdib|t|>t)\displaystyle P(\frac{\sum_{i\in\mathcal{I}_{t}}d_{i}-b}{|\mathcal{I}_{t}|}>t)
=\displaystyle= 1P(itdi|t|t+b)\displaystyle 1-P(\sum_{i\in\mathcal{I}_{t}}d_{i}\leq|\mathcal{I}_{t}|\cdot t+b)
=\displaystyle= 1P(itdiE[itdi]|t|t+bE[itdi]).\displaystyle 1-P(\sum_{i\in\mathcal{I}_{t}}d_{i}-E[\sum_{i\in\mathcal{I}_{t}}d_{i}]\leq|\mathcal{I}_{t}|\cdot t+b-E[\sum_{i\in\mathcal{I}_{t}}d_{i}]). (25)

Now observe that, for any iti\in\mathcal{I}_{t}, did_{i} can be treated as a conditional variable: di|di>td_{i}|_{d_{i}>t}. Since all such did_{i} are i.i.d, we may denote the (shared) expected value μ:=E[di|di>t]\mu:=E[d_{i}|d_{i}>t] and variance as σ2:=Var[di|di>t]\sigma^{2}:=\mbox{Var}[d_{i}|d_{i}>t]. Moreover, by definition of t\mathcal{I}_{t} we have

E[di|it]=μ>t.E[d_{i}|i\in\mathcal{I}_{t}]=\mu>t. (26)

Together with condition (i) of Lemma 5, this implies that the right-hand side of the probability in (25) is negative as nn\rightarrow\infty:

|t|t+bE[itdi]=|t|(tμ)+b<0.|\mathcal{I}_{t}|\cdot t+b-E[\sum_{i\in\mathcal{I}_{t}}d_{i}]=|\mathcal{I}_{t}|\cdot(t-\mu)+b<0. (27)

It follows, continuing from Equation (25), that

P(f(t)>0)\displaystyle P(f(t)>0) 1P(|itdiE[itdi]|E[itdi]|t|tb)\displaystyle\geq 1-P(|\sum_{i\in\mathcal{I}_{t}}d_{i}-E[\sum_{i\in\mathcal{I}_{t}}d_{i}]|\geq E[\sum_{i\in\mathcal{I}_{t}}d_{i}]-|\mathcal{I}_{t}|\cdot t-b)
1Var(itdi)(E[itdi]|t|tb)2(Chebyshev’s inequality)\displaystyle\geq 1-\frac{\texttt{Var}(\sum_{i\in\mathcal{I}_{t}}d_{i})}{(E[\sum_{i\in\mathcal{I}_{t}}d_{i}]-|\mathcal{I}_{t}|\cdot t-b)^{2}}\ \ \mbox{(Chebyshev's inequality)}
=1σ2|t|(μ|t|t|t|b)2.\displaystyle=1-\frac{\sigma^{2}|\mathcal{I}_{t}|}{(\mu|\mathcal{I}_{t}|-t|\mathcal{I}_{t}|-b)^{2}}.

Thus we have the desired result:

P(τ>t)1σ2|t|(μ|t|t|t|b)21asn.P(\tau>t)\geq 1-\frac{\sigma^{2}|\mathcal{I}_{t}|}{(\mu|\mathcal{I}_{t}|-t|\mathcal{I}_{t}|-b)^{2}}\rightarrow 1\ \mbox{as}\ n\rightarrow\infty. (28)

Now observe that τ>t\tau>t implies τt\mathcal{I}_{\tau}\subseteq\mathcal{I}_{t}, and subsequently |τ|t|\mathcal{I}_{\tau}|\leq\mathcal{I}_{t}. Together with condition (ii) from Lemma 5 we have

P(|τ|nϵ)P(|t|nϵ)=1,asn.P(\frac{|\mathcal{I}_{\tau}|}{n}\leq\epsilon)\geq P(\frac{|\mathcal{I}_{t}|}{n}\leq\epsilon)=1,\ \mbox{as}\ n\rightarrow\infty.

Corollary 2 For projection of dnd\in\mathbb{R}^{n} onto a simplex Δb\Delta_{b}, if bo(n)b\in o(n), the conclusion from Theorem 2 keeps true.

Proof.

From Equation (24), |t|Θ(n)|\mathcal{I}_{t}|\in\Theta(n). If bo(n)b\in o(n), since Equation (26) implies tμ<0t-\mu<0, Equation (27) keeps true. As a result, Theorem 2 keeps true. ∎

Proposition 4 Let d^\hat{d} be a subvector of dd with mnm\leq n entries; moreover, without loss of generality suppose the subvector contains the first mm entries. Let v^\hat{v}^{*} be the projection of d^\hat{d} onto the simplex Δ^:={vm|i=1mvi=b,v0}\hat{\Delta}:=\{v\in\mathbb{R}^{m}\ |\ \sum_{i=1}^{m}v_{i}=b,v\geq 0\}, and τ^\hat{\tau} be the corresponding pivot value. Then, ττ^\tau\geq\hat{\tau}. Consequently, for 1im1\leq i\leq m we have that v^i=0vi=0\hat{v}_{i}^{*}=0\implies v_{i}^{*}=0.

Proof.

Define two index sets,

τ^:={i=1,,n|di>τ^,did};\mathcal{I}_{\hat{\tau}}:=\{i=1,...,n\ |\ d_{i}>\hat{\tau},d_{i}\in d\};
^τ^:={i=1,,m|di>τ^,did^}.\hat{\mathcal{I}}_{\hat{\tau}}:=\{i=1,...,m\ |\ d_{i}>\hat{\tau},d_{i}\in\hat{d}\}.

As d^\hat{d} is a subvector of dd, we have ^τ^τ^\hat{\mathcal{I}}_{\hat{\tau}}\subseteq\mathcal{I}_{\hat{\tau}}; thus,

iτ^(diτ^)i^τ^(diτ^)=b,\sum_{i\in\mathcal{I}_{\hat{\tau}}}(d_{i}-\hat{\tau})\geq\sum_{i\in\hat{\mathcal{I}}_{\hat{\tau}}}(d_{i}-\hat{\tau})=b,
iτ^dib|τ^|τ^.\implies\frac{\sum_{i\in\mathcal{I}_{\hat{\tau}}}d_{i}-b}{|\mathcal{I}_{\hat{\tau}}|}\geq\hat{\tau}.

From Corollary 1, ττ^\tau\geq\hat{\tau}; from Proposition 1 it thus follows that τ^τ\mathcal{I}_{\hat{\tau}}\supset\mathcal{I}_{\tau}. ∎

We assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u], and we have
 
Proposition 5 Parallel Pivot and Partition with either the median, random, or Michelot’s pivot rule, has an average runtime of O(nk+kn)O(\frac{n}{k}+\sqrt{kn}).

Proof.

The algorithm starts by distributing projections. Pivot and Partition has linear runtime on average with any of the stated pivot rules, and so for the iith core with input did^{i}, whose size is O(nk)O(\frac{n}{k}), we have an average runtime of O(nk)O(\frac{n}{k}). Now from Theorem 1, each core returns in expectation at most O(nk)O(\sqrt{\frac{n}{k}}) active terms. So the reduced input v^\hat{v} (line 3 of Algorithm 8 from the main body) will have at most O(kn)O(\sqrt{kn}) entries on average; thus the final projection will incur an expected number of operations in O(kn)O(\sqrt{kn}). ∎

We assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u], and we have
 
Proposition 6 Let p\mathcal{I}_{p} be the output of Filter(d,b)\mathrm{Filter}(d,b). Then E[|p|]O(n23)E[|\mathcal{I}_{p}|]\in O(n^{\frac{2}{3}}).

Proof.

We assume that the if in line 5 from Algorithm 4 (from the main body) does not trigger; this is a conservative assumption as otherwise more elements would be removed from p\mathcal{I}_{p}, reducing the number of iterations.

Let p(j)p^{(j)} be the jjth pivot with p(1):=d1bp^{(1)}:=d_{1}-b (from line 1 in Algorithm 4 from the main body), and subsequent pivots corresponding to the for-loop iterations of line 2. Whenever Filter finds di>p(j)d_{i}>p^{(j)}, it updates pp as follows:

p(j+1):=p(j)+dip(j)j+1.p^{(j+1)}:=p^{(j)}+\frac{d_{i}-p^{(j)}}{j+1}. (29)

Moreover, when di>p(j)d_{i}>p^{(j)} is found, diU[p(j),u]d_{i}\sim U[p^{(j)},u]; thus, from the Law of Total Expectation, E[di]=E[E[di|p(j)]]=(u+E[p(j)])/2E[d_{i}]=E[E[d_{i}|p^{(j)}]]=(u+E[p^{(j)}])/2. Together with (29),

E[p(j+1)]=\displaystyle E[p^{(j+1)}]= E[p(j)+dip(j)j+1]\displaystyle E[p^{(j)}+\frac{d_{i}-p^{(j)}}{j+1}]
=\displaystyle= E[p(j)]+E[di]E[p(j)]j+1\displaystyle E[p^{(j)}]+\frac{E[d_{i}]-E[p^{(j)}]}{j+1}
=\displaystyle= E[p(j)]+(u+E[p(j)])/2E[p(j)]j+1\displaystyle E[p^{(j)}]+\frac{(u+E[p^{(j)}])/2-E[p^{(j)}]}{j+1}
=\displaystyle= (2j+1)E[p(j)]+u2j+2;\displaystyle\frac{(2j+1)E[p^{(j)}]+u}{2j+2};
E[p(j+1)]u=\displaystyle\implies E[p^{(j+1)}]-u= (E[p(j)]u)2j+12j+2.\displaystyle(E[p^{(j)}]-u)\frac{2j+1}{2j+2}.

Using the initial value E[p(1)]=E[E[p(1)|d1]]=E[d1]b=u+l2bE[p^{(1)}]=E[E[p^{(1)}|d_{1}]]=E[d_{1}]-b=\frac{u+l}{2}-b, we can obtain a closed-form representation for the recursive formula:

E[p(j)]u=2(b+ul2)i=0j12i+12i+2.E[p^{(j)}]-u=-2(b+\frac{u-l}{2})\prod_{i=0}^{j-1}\frac{2i+1}{2i+2}. (30)

Now let LjL_{j} denote the number of terms Filter scans after calculating p(j)p^{(j)} and before finding some di>p(j)d_{i}>p^{(j)}. Since Filter scans nn terms in total (from initialize and n1n-1 calls to line 4), then

j=1|p|1Ljn1<j=1|p|Lj;\sum_{j=1}^{|\mathcal{I}_{p}|-1}L_{j}\leq n-1<\sum_{j=1}^{|\mathcal{I}_{p}|}L_{j};
1+j=1|p|1E[Lj]n<1+j=1|p|E[Lj].\implies 1+\sum_{j=1}^{|\mathcal{I}_{p}|-1}E[L_{j}]\leq n<1+\sum_{j=1}^{|\mathcal{I}_{p}|}E[L_{j}]. (31)

We can show LjL_{j} has a geometric distribution as follows.

Claim.

LjGeo(up(j)ul)L_{j}\sim\mbox{Geo}(\frac{u-p^{(j)}}{u-l}).

Proof: For each term diU[l,u]d_{i}\sim U[l,u], we have

P(di>p(j))=1(p(j)l)/(ul)=(up(j))/(ul).P(d_{i}>p^{(j)})=1-(p^{(j)}-l)/(u-l)=(u-p^{(j)})/(u-l).

So di>p(j)d_{i}>p^{(j)} can be interpreted as a Bernoulli trial. Hence LjL_{j} is distributed with Geo(up(j)ul)\mbox{Geo}(\frac{u-p^{(j)}}{u-l}). \blacksquare

Now, applying Jensen’s Inequality to E[Lj]E[L_{j}],

E[Lj]=E[ulup(j)]uluE[p(j)],E[L_{j}]=E[\frac{u-l}{u-p^{(j)}}]\leq\frac{u-l}{u-E[p^{(j)}]},
ln(E[Lj])E[ln(Lj)]=ln(ul)ln(uE[p(j)]),\implies\ln(E[L_{j}])\leq E[\ln(L_{j})]=\ln(u-l)-\ln(u-E[p^{(j)}]),

and together with (30) we have

ln(E[Lj])\displaystyle\ln(E[L_{j}])\leq ln(ul2b+ul)+i=0j1ln(2i+22i+1)\displaystyle\ln(\frac{u-l}{2b+u-l})+\sum_{i=0}^{j-1}\ln(\frac{2i+2}{2i+1}) (32)
=\displaystyle= ln(ul2b+ul)+i=1jln(2i2i1)\displaystyle\ln(\frac{u-l}{2b+u-l})+\sum_{i=1}^{j}\ln(\frac{2i}{2i-1})

Now observe that

limiln(2i2i1)1i=limi2i12i4i24i(2i1)21i2=limi2i2(2i1)(2i)=12,\lim_{i\rightarrow\infty}\frac{\ln(\frac{2i}{2i-1})}{\frac{1}{i}}=\lim_{i\rightarrow\infty}\frac{\frac{2i-1}{2i}\frac{4i-2-4i}{(2i-1)^{2}}}{-\frac{1}{i^{2}}}=\lim_{i\rightarrow\infty}\frac{2i^{2}}{(2i-1)(2i)}=\frac{1}{2},

where the first equality is from L’Hôpital’s rule. Thus E[Lj]Θ(exp(i=1j12i))E[L_{j}]\in\Theta(\mbox{exp}(\sum_{i=1}^{j}\frac{1}{2i})). Furthermore, we have the classical bound on the harmonic series:

i=1j1i=ln(j)+γ+12jln(j)+1,\sum_{i=1}^{j}\frac{1}{i}=\ln(j)+\gamma+\frac{1}{2j}\leq\ln(j)+1,

where γ\gamma is the Euler-Mascheroni constant, see e.g. Lagarias (2013); thus,

ei=1j12i=j+eγ2+14j,e^{\sum_{i=1}^{j}\frac{1}{2i}}=\sqrt{j}+e^{\frac{\gamma}{2}+\frac{1}{4j}},

which implies E[Lj]O(j)E[L_{j}]\in O(\sqrt{j}). Moreover (see e.g. (Knuth 1998, Section 1.2.7)),

j=1|p|j=23|p||p|+32+o(|p|),\sum_{j=1}^{|\mathcal{I}_{p}|}\sqrt{j}=\frac{2}{3}|\mathcal{I}_{p}|\sqrt{|\mathcal{I}_{p}|+\frac{3}{2}}+o(\sqrt{|\mathcal{I}_{p}|}),

which implies 1+j=1|p|1E[Lj]O(|p|32)1+\sum_{j=1}^{|\mathcal{I}_{p}|-1}E[L_{j}]\in O(|\mathcal{I}_{p}|^{\frac{3}{2}}). From (31) it follows that |p|O(n23)|\mathcal{I}_{p}|\in O(n^{\frac{2}{3}}). ∎

We assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u], and we have
 
Proposition 7 Parallel Condat’s method has an average complexity O(nk+kn23)O(\frac{n}{k}+\sqrt[3]{kn^{2}}).

Proof.

In Distributed Filter (Algorithm 9 from the main body), each core is given input i\mathcal{I}_{i} with |i|O(nk)|\mathcal{I}_{i}|\in O(\frac{n}{k}). (serial) Filter has linear runtime from Lemma 4, and the for loop in line 5 of Algorithm 9 (from the main body) will scan at most |i||\mathcal{I}_{i}| terms. Thus distributed Filter runtime is in O(nk)O(\frac{n}{k}).

From Proposition 6, E[|i|]O((n/k)23)E[|\mathcal{I}_{i}|]\in O((n/k)^{\frac{2}{3}}). Since the output of Distributed Filter is p=i=1ki\mathcal{I}_{p}=\cup_{i=1}^{k}\mathcal{I}_{i}, we have that (given dd is i.i.d.) E[|p|]O(kn23)E[|\mathcal{I}_{p}|]\in O(\sqrt[3]{kn^{2}}).

Parallel Condat’s method takes the input from Distributed Filter and applies serial Condat’s method (Algorithm 10 from the main body, lines 2-7), excluding the serial Filter. From Proposition 3, Condat’s method has average linear runtime, so this application of serial Condat’s method has average complexity O(kn23)O(\sqrt[3]{kn^{2}}). ∎

Lemma 6 If XU[l,u]X\sim U[l,u] and l<t<ul<t<u, then X|tU[t,u]X|t\sim U[t,u].

Proof.

The CDF of XX is FX(x)=P(Xx)=(xl)/(ul)F_{X}(x)=P(X\leq x)=(x-l)/(u-l). Then,

FX|t(x)=P(Xx|X>t)=P(Xx,X>t)P(X>t)=(xtul)/(utul)=xtut,F_{X|t}(x)=P(X\leq x|X>t)=\frac{P(X\leq x,X>t)}{P(X>t)}=(\frac{x-t}{u-l})/(\frac{u-t}{u-l})=\frac{x-t}{u-t},

which implies X|tU[t,u]X|t\sim U[t,u]. ∎

Lemma 7 If X,YX,Y are independent random variables, then X|t,Y|tX|t,Y|t are independent.

Proof.

X,YX,Y are independent and so P(Xx,Yy)=P(Xx)P(Yy)P(X\leq x,Y\leq y)=P(X\leq x)P(Y\leq y) for any x,yx,y\in\mathbb{R}. So considering the joint conditional probability,

P(Xx,Yy|X>t,Y>t)\displaystyle P(X\leq x,Y\leq y|X>t,Y>t)
=P(Xx,Yy,X>t,Y>t)P(X>t,Y>t)\displaystyle=\frac{P(X\leq x,Y\leq y,X>t,Y>t)}{P(X>t,Y>t)}
=P(Xx,X>t)P(X>t)P(Yy,Y>t)P(Y>t)\displaystyle=\frac{P(X\leq x,X>t)}{P(X>t)}\frac{P(Y\leq y,Y>t)}{P(Y>t)}
=P(Xx|X>t)P(Yy|y>t).\displaystyle=P(X\leq x|X>t)P(Y\leq y|y>t).

Proposition 9 If d1,,dnd_{1},...,d_{n} i.i.d. U[l,u]\sim U[l,u] and l<t<ul<t<u, then {di|it}\{d_{i}\ |\ i\in\mathcal{I}_{t}\} i.i.d. U[t,u]\sim U[t,u].

Proof.

From Lemma 6, for any iti\in\mathcal{I}_{t}, diU[t,u]d_{i}\sim U[t,u]. From Lemma 7, for any i,jti,j\in\mathcal{I}_{t}, iji\neq j, di,djd_{i},d_{j} are conditionally independent. So, {di|it}\{d_{i}\ |\ i\in\mathcal{I}_{t}\} i.i.d. U[t,u]\sim U[t,u]. ∎

Appendix B Algorithm Descriptions

B.1 Bucket Method

Pivot and Partition selects one pivot in each iteration to partition dd and applies this recursively in order to create sub-partitions in the manner of a binary search. The Bucket Method, developed by Perez et al. (2020b), can be interpreted as a modification that uses multiple pivots and partitions (buckets) per iteration.

Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb, bucket number cc, maximum number of iterations TT.
Output: projection vv^{*}.
1 Set ={1,,n}\mathcal{I}=\{1,...,n\}, τ=\mathcal{I}_{\tau}=\emptyset;
2 for t=1:Tt=1:T do
3       Set 1,,c:=\mathcal{I}_{1},...,\mathcal{I}_{c}:=\emptyset;
4       for j=1:cj=1:c do
5             Set pj:=(maxi{di}mini{di})(cj)/c+mini{di}p_{j}:=(\max_{i\in\mathcal{I}}\{d_{i}\}-\min_{i\in\mathcal{I}}\{d_{i}\})\cdot(c-j)/c+\min_{i\in\mathcal{I}}\{d_{i}\};
6             for i:dipji\in\mathcal{I}:d_{i}\geq p_{j} do
7                   Set j:=j{i}\mathcal{I}_{j}:=\mathcal{I}_{j}\cup\{i\}, :=\{i}\mathcal{I}:=\mathcal{I}\backslash\{i\}
8             end for
9            
10       end for
11      for j=1:cj=1:c do
12             Set p=iτdi+ijdib|τ|+|j|p=\frac{\sum_{i\in\mathcal{I}_{\tau}}d_{i}+\sum_{i\in\mathcal{I}_{j}}d_{i}-b}{|\mathcal{I}_{\tau}|+|\mathcal{I}_{j}|};
13             if ppjp\geq p_{j} then
14                   Set :=j\mathcal{I}:=\mathcal{I}_{j};
15                   Break the inner loop;
16                  
17            else if j<cj<c &\& p>maxij+1{di}p>\max_{i\in\mathcal{I}_{j+1}}\{d_{i}\} then
18                   Set τ:=τj\mathcal{I}_{\tau}:=\mathcal{I}_{\tau}\cup\mathcal{I}_{j};
19                   Break the outer loop;
20                  
21            else
22                   τ:=τj\mathcal{I}_{\tau}:=\mathcal{I}_{\tau}\cup\mathcal{I}_{j};
23                  
24             end if
25            
26       end for
27      
28 end for
29Set τ:=iτdib|τ|\tau:=\frac{\sum_{i\in\mathcal{I}_{\tau}}d_{i}-b}{|\mathcal{I}_{\tau}|};
30 Set vi:=max{diτ,0}v_{i}^{*}:=\mbox{max}\{d_{i}-\tau,0\} for all 1in1\leq i\leq n;
return v:=(v1,,vn)v^{*}:=(v_{1}^{*},\cdots,v_{n}^{*}).
Algorithm 11 Bucket method

The algorithm, presented as Algorithm 11 is initialized with tuning parameters TT, the maximum number of iterations, and cc, the number of buckets with which to subdivide the data. In each iteration the algorithm partitions the problem into the buckets j\mathcal{I}_{j} with the inner for loop of line 4, and then calculates corresponding pivot values in the inner for loop of line 10.

The tuning parameters can be determined as follows. Suppose we want the algorithm to find a (final) pivot τ¯\bar{\tau} within some absolute numerical tolerance DD of the true pivot τ\tau, i.e. such that |τ¯τ|D|\bar{\tau}-\tau|\leq D. This can be ensured (see Perez et al. (2020b)) by setting

T=logcRD,T=\log_{c}\frac{R}{D},

where R:=maxi{di}mini{di}R:=\max_{i\in\mathcal{I}}\{d_{i}\}-\min_{i\in\mathcal{I}}\{d_{i}\} denotes the range of dd. Perez et al. Perez et al. (2020b) prove the worst-case complexity is O((n+c)logc(R/D))O((n+c)\log_{c}(R/D)).

We assume uniformly distributed inputs, d1,,dnd_{1},\dots,d_{n} are i.i.dU[l,u]\mathrm{i.i.d}\sim U[l,u], and we have
 
Proposition 8 The Bucket method has an average runtime of O(cn)O(cn).

Proof.

Let (t)\mathcal{I}^{(t)} denote the index set \mathcal{I} at the start of iteration tt in the outer for loop (line 2), and j(t)\mathcal{I}_{j}^{(t)} denote the index set of the jjth bucket, j\mathcal{I}_{j}, at the end of the first inner for loop (line 4).

For a given outer for loop iteration tt (line 2), the first inner for loop (line 4) uses O(c|(t)|)O(c|\mathcal{I}^{(t)}|) operations. Note that the max\max and min\min on line 5 can be reused in each iteration, and the nested for loop on line 6 has |(t)||\mathcal{I}^{(t)}| iterations. The second inner for loop (line 10) also uses O(c|(t)|)O(c|\mathcal{I}^{(t)}|) operations. In line 11, the first sum iτdi\sum_{i\in\mathcal{I}_{\tau}}d_{i} can be updated dynamically (in the manner of a scan) as a cumulative sum as τ\tau is updated in line 16 or 19, thus requiring a constant number of operations per iteration jj. The second sum ijdi\sum_{i\in\mathcal{I}_{j}}d_{i} is bounded above by O(|(t)|)O(|\mathcal{I}^{(t)}|) since j(t)\mathcal{I}_{j}\subseteq\mathcal{I}^{(t)}. Thus each iteration jj of the outer for loop uses O(c|(t)|)O(c|\mathcal{I}^{(t)}|) operations.

Since d1,,dnd_{1},...,d_{n} are i.i.d U[l,u]\sim U[l,u], then from Proposition 9, the terms from each sub-partition are also i.i.d uniform. So for any t=1,,Tt=1,...,T and j=1,,cj=1,...,c, E[|j(t)|]=E[|(t)|]/cE[|\mathcal{I}_{j}^{(t)}|]=E[|\mathcal{I}^{(t)}|]/c. From line 13, E[|(t+1)|]=E[|(t)|]/cE[|\mathcal{I}^{(t+1)}|]=E[|\mathcal{I}^{(t)}|]/c. Since E[|(1)|]=nE[|\mathcal{I}^{(1)}|]=n then E[|(t)|]=n/ct1E[|\mathcal{I}^{(t)}|]=n/c^{t-1}; thus

E[t=1T|(t)|]=t=1logc(R/D)nct1=cc1n(1DR).\displaystyle E[\sum_{t=1}^{T}|\mathcal{I}^{(t)}|]=\sum_{t=1}^{\log_{c}(R/D)}\frac{n}{c^{t-1}}=\frac{c}{c-1}n(1-\frac{D}{R}).

Therefore, E[t=1Tc|(t)|]O(cn)E[\sum_{t=1}^{T}c\cdot|\mathcal{I}^{(t)}|]\in O(cn). ∎

B.2 Projection onto a Weighted Simplex and a Weighted 1\ell_{1} Ball

The weighted simplex and the weighted 1\ell_{1} ball are

Δw,b:={vn|i=1nwivi=b,v0},\Delta_{w,b}:=\{v\in\mathbb{R}^{n}\ |\ \sum_{i=1}^{n}w_{i}v_{i}=b,v\geq 0\},
w,b:={vn|i=1nwi|vi|b},\mathcal{B}_{w,b}:=\{v\in\mathbb{R}^{n}\ |\ \sum_{i=1}^{n}w_{i}|v_{i}|\leq b\},

where w>0w>0 is a weight vector, and b>0b>0 is a scaling factor. Perez et al. (2020a) show there is a unique τ\tau\in\mathbb{R} such that v=max{diwiτ,0},i=1,,nv^{*}=\max\{d_{i}-w_{i}\tau,0\},\ \forall i=1,\cdots,n, where v=projΔw,b(d)nv^{*}=\mbox{proj}_{\Delta_{w,b}}(d)\in\mathbb{R}^{n}. Thus pivot-based methods for the unweighted simplex extend to the weighted simplex in a straightforward manner. We present weighted Michelot’s method as Algorithm 3 (Appendix D), and weighted Filter as Algorithm 4 (Appendix D).

Our parallelization depends on the choice of serial method for projection onto simplex, since projection onto the weighted simplex requires direct modification rather than oracle calls to methods for the unweighted case. Sort and Scan for weighted simplex projection can be implemented with a parallel merge sort algorithm in Algorithm 5 (Appendix D). Our distributed structure can be applied to Michelot and Condat methods in a similar manner as with the unweighted case; these are presented respectively as Algorithms 6 and 7 (Appendix D). We note that projection onto w,b\mathcal{B}_{w,b} is linear-time reducible to projection onto Δw,b\Delta_{w,b} (Perez et al. 2020a, Equation (4)).

Appendix C Distribution Examples

Here we apply Theorem 2 to three examples:
(A) Let diU[0,1]d_{i}\sim U[0,1], n1=105n_{1}=10^{5}, n2=106n_{2}=10^{6}, t=0.95t=0.95.
(B) Let diN[0,1]d_{i}\sim N[0,1], n1=105n_{1}=10^{5}, n2=106n_{2}=10^{6}, t=1.65t=1.65.
(C) Let diN[0,103]d_{i}\sim N[0,10^{-3}], n1=105n_{1}=10^{5}, n2=106n_{2}=10^{6}, t=1.65×103=0.05218t=1.65\times\sqrt{10^{-3}}=0.05218.
Observe that

P(τ>t)\displaystyle P(\tau>t)\geq P(τ>t,|p|2(nlogn)p(1p)+np)\displaystyle P(\tau>t,|\mathcal{I}_{p}|\geq\lfloor-\sqrt{2(n\log n)p(1-p)}+np\rfloor) (33a)
=\displaystyle= P(τ>t||p|2(nlogn)p(1p)+np)\displaystyle P(\tau>t\ |\ |\mathcal{I}_{p}|\geq\lfloor-\sqrt{2(n\log n)p(1-p)}+np\rfloor) (33b)
×\displaystyle\times P(|p|2(nlogn)p(1p)+np))\displaystyle P(|\mathcal{I}_{p}|\geq\lfloor-\sqrt{2(n\log n)p(1-p)}+np\rfloor)) (33c)

(33b) can be calculated by (28), and (33c) can be calculated by (23).

For (A), p=1Fd(t)=0.05p=1-F_{d}(t)=0.05. From n1p=5000n_{1}p=5000 and n2p=50000n_{2}p=50000, we have n1p(1p)=217.9\sqrt{n_{1}p(1-p)}=217.9 and n2p(1p)=68.9\sqrt{n_{2}p(1-p)}=68.9. So,

P(|t1|[4793,5207])=P(|t2|[49347,50654])=0.9973.P(|\mathcal{I}_{t}^{1}|\in[4793,5207])=P(|\mathcal{I}_{t}^{2}|\in[49347,50654])=0.9973.

Now, since fd(x)=10.05=20,forx[0.95,1]f_{d}^{*}(x)=\frac{1}{0.05}=20,\ \mbox{for}\ x\in[0.95,1], we have E=0.975E=0.975 and V=14800V=\frac{1}{4800}. Applying Theorem 2 yields:

P(τ1>t)0.99723,|t1|[4793,5207],P(\tau_{1}>t)\geq 0.99723,\ \forall\ |\mathcal{I}_{t}^{1}|\in[4793,5207],
P(τ2>t)0.99729,|t2|[49347,50654],P(\tau_{2}>t)\geq 0.99729,\ \forall\ |\mathcal{I}_{t}^{2}|\in[49347,50654],

which implies the number of active elements in the projection should be less than 5%5\% of n1n_{1} or n2n_{2} with high probability.

For (B), p=1Fd(t)=0.05p=1-F_{d}(t)=0.05; similar to the first example,

P(|t1|[4793,5207])=P(|t2|[49347,50654])=0.9973.P(|\mathcal{I}_{t}^{1}|\in[4793,5207])=P(|\mathcal{I}_{t}^{2}|\in[49347,50654])=0.9973.

Together with

fd=10.0512πex22=202πex22,forx[1.65,+),f_{d}^{*}=\frac{1}{0.05}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}=\frac{20}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}},\ \mbox{for}\ x\in[1.65,+\infty),

we can calculate EE and VV:

E=202π1.65xex22𝑑x=202πe1.6522=2.045,E=\frac{20}{\sqrt{2\pi}}\int_{1.65}^{\infty}xe^{-\frac{x^{2}}{2}}\,dx=\frac{20}{\sqrt{2\pi}}e^{-\frac{1.65^{2}}{2}}=2.045,
E(x2)=202π1.65x2ex22𝑑x=4.375,E(x^{2})=\frac{20}{\sqrt{2\pi}}\int_{1.65}^{\infty}x^{2}e^{-\frac{x^{2}}{2}}\,dx=4.375,
V=E(x2)E2=0.192.\implies V=E(x^{2})-E^{2}=0.192.

Applying Theorem 2,

P(τ1>t)0.99704,|t1|[4793,5207],P(\tau_{1}>t)\geq 0.99704,\ \forall\ |\mathcal{I}_{t}^{1}|\in[4793,5207],
P2(τ>t)0.99728,|t2|[49347,50654],P_{2}(\tau>t)\geq 0.99728,\ \forall\ |\mathcal{I}_{t}^{2}|\in[49347,50654],

which imply 5%5\% of terms are active after projection with probability >99%>99\%.

For (C), p=1Fd(t)=0.05p=1-F_{d}(t)=0.05. Similar to the previous examples,

P(|t1|[4793,5207])=P(|t2|[49347,50654])=0.9973.P(|\mathcal{I}_{t}^{1}|\in[4793,5207])=P(|\mathcal{I}_{t}^{2}|\in[49347,50654])=0.9973.

Together with

fd=202π103ex22×103,x[0.05218,+),f_{d}^{*}=\frac{20}{\sqrt{2\pi 10^{-3}}}e^{-\frac{x^{2}}{2\times 10^{-3}}},\ \forall\ x\in[0.05218,+\infty),

we can calculate EE and VV as follows,

E=202π1031.65×103xex22×103𝑑x=0.03234,E=\frac{20}{\sqrt{2\pi 10^{-3}}}\int_{1.65\times\sqrt{10^{-3}}}^{\infty}xe^{-\frac{x^{2}}{2\times 10^{-3}}}\,dx=0.03234,
E(x2)=202π1031.65×103x2ex22×103𝑑x=0.002187,E(x^{2})=\frac{20}{\sqrt{2\pi 10^{-3}}}\int_{1.65\times\sqrt{10^{-3}}}^{\infty}x^{2}e^{-\frac{x^{2}}{2\times 10^{-3}}}\,dx=0.002187,
V=E(x2)E2=0.001142.\implies V=E(x^{2})-E^{2}=0.001142.

Applying Theorem 2,

P(τ1>t)0.99671,|t1|[4793,5207],P(\tau_{1}>t)\geq 0.99671,\ \forall\ |\mathcal{I}_{t}^{1}|\in[4793,5207],
P(τ2>t)0.99724,|t2|[49347,50654],P(\tau_{2}>t)\geq 0.99724,\ \forall\ |\mathcal{I}_{t}^{2}|\in[49347,50654],

and so 5%5\% of terms are active after projection with probability >99%>99\%.

Appendix D Algorithm Pseudocode

Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n})
Output: projection vv^{*}.
1 for i=1:ni=1:n do
2       fi={1,ifdi00,otherwisef_{i}=\begin{cases}1,&\mbox{if}\ d_{i}\geq 0\\ 0,&\mbox{otherwise}\end{cases};
3      
4 end for
5if 1Tf1^{T}f is even then
6       Set i=argmini1:n|di|i^{*}=\mbox{argmin}_{i\in 1:n}|d_{i}|;
7       Update fi=1fif_{i^{*}}=1-f_{i^{*}};
8      
9 end if
10for i=1:ni=1:n do
11       Set vi=di(1)fiv_{i}=d_{i}(-1)^{f_{i}};
12      
13 end for
14if 1Tproj[12,12]n(v)1n21^{T}\mathrm{proj}_{[-\frac{1}{2},\frac{1}{2}]^{n}}(v)\geq 1-\frac{n}{2} then
15       return v=proj[12,12]n(d)v^{*}=\mbox{proj}_{[-\frac{1}{2},\frac{1}{2}]^{n}}(d)
16else
17       Set v=projΔ12(d)v^{*}=\mbox{proj}_{\Delta-\frac{1}{2}}(d);
18       for i=1:ni=1:n do
19             Update vi=vi(1)fiv^{*}_{i}=v_{i}^{*}(-1)^{f_{i}};
20            
21       end for
22      return vv^{*}
23 end if
Algorithm 12 Centered Parity Polytope Projection
Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb, weight w={w1,,wn}w=\{w_{1},...,w_{n}\}.
Output: projection vv^{*}.
1 Set p:={1,,n}\mathcal{I}_{p}:=\{1,...,n\}, :=\mathcal{I}:=\emptyset;
2 do
3       Set :=|p|\mathcal{I}:=|\mathcal{I}_{p}|;
4       Set p:=ipwidibipwi2p:=\frac{\sum_{i\in\mathcal{I}_{p}}w_{i}d_{i}-b}{\sum_{i\in\mathcal{I}_{p}}w_{i}^{2}};
5       Set t={ip|diwi>p}\mathcal{I}_{t}=\{i\in\mathcal{I}_{p}\ |\ \frac{d_{i}}{w_{i}}>p\};
6      
7while ||>|p||\mathcal{I}|>|\mathcal{I}_{p}|;
8Set vi:=max{diwip,0}v_{i}^{*}:=\mbox{max}\{d_{i}-w_{i}p,0\} for all 1in1\leq i\leq n;
return v:=(v1,,vn)v^{*}:=(v_{1}^{*},\cdots,v_{n}^{*}).
Algorithm 13 Weighted Michelot’s method
Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb, weight ww.
Output: Index set p\mathcal{I}_{p}.
1 Set p:={1}\mathcal{I}_{p}:=\{1\}, w:=\mathcal{I}_{w}:=\emptyset, p=:w1d1bw12p=:\frac{w_{1}d_{1}-b}{w_{1}^{2}};
2 for i=2:ni=2:n do
3       if diwi>p\frac{d_{i}}{w_{i}}>p then
4             Set p:=widi+jpwjdjbwi2+jpwj2p:=\frac{w_{i}d_{i}+\sum_{j\in\mathcal{I}_{p}w_{j}d_{j}}-b}{w_{i}^{2}+\sum_{j\in\mathcal{I}_{p}w_{j}^{2}}};
5             if p>widibwi2p>\frac{w_{i}d_{i}-b}{w_{i}^{2}} then
6                   Set p:={i}\mathcal{I}_{p}:=\mathcal{I}\cup\{i\};
7                  
8            else
9                   Set w:=wp\mathcal{I}_{w}:=\mathcal{I}_{w}\cup\mathcal{I}_{p};
10                   Set p={i}\mathcal{I}_{p}=\{i\}, p:=widibwi2p:=\frac{w_{i}d_{i}-b}{w_{i}^{2}};
11                  
12             end if
13            
14       end if
15      
16 end for
17if |w|0|\mathcal{I}_{w}|\neq 0 then
18       for iw:diwi>pi\in\mathcal{I}_{w}:\frac{d_{i}}{w_{i}}>p do
19             Set p:=p{i}\mathcal{I}_{p}:=\mathcal{I}_{p}\cup\{i\};
20             Set p:=widi+jpwjdjbwi2+jpwj2p:=\frac{w_{i}d_{i}+\sum_{j\in\mathcal{I}_{p}w_{j}d_{j}}-b}{w_{i}^{2}+\sum_{j\in\mathcal{I}_{p}w_{j}^{2}}};
21            
22       end for
23      
24 end if
return p\mathcal{I}_{p}.
Algorithm 14 Weighted Filter Perez et al. (2020a)
Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb, weight ww.
Output: projection vv^{*}.
1 Set z:={diwi}z:=\{\frac{d_{i}}{w_{i}}\};
2 Parallel sort zz as z(1)z(n)z_{(1)}\geq\cdots\geq z_{(n)}, and apply this order to dd and ww ;
3 Find κ:=maxk=1,,n{i=1kwidibi=1kwi2zk}\kappa:=\mathrm{max}_{k=1,\cdots,n}\{\frac{\sum_{i=1}^{k}w_{i}d_{i}-b}{\sum_{i=1}^{k}w_{i}^{2}}\leq z_{k}\};
4 Set τ=i=1κwidibi=1κwi2\tau=\frac{\sum_{i=1}^{\kappa}w_{i}d_{i}-b}{\sum_{i=1}^{\kappa}w_{i}^{2}};
5 Parallel set vi:=max{diwiτ,0}v_{i}^{*}:=\mbox{max}\{d_{i}-w_{i}\tau,0\} for all 1in1\leq i\leq n;
return v=(v1,,vn)v^{*}=(v_{1}^{*},\cdots,v_{n}^{*}).
Algorithm 15 Weighted Parallel Sort and Parallel Scan
Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb, weight ww, kk cores.
Output: projection vv^{*}.
1 Partition dd into subvectors d1,,dkd^{1},...,d^{k} of dimension nk\leq\frac{n}{k};
2 Set vi:=Weighted_Pivot_Project(vi,w,b)v^{i}:=\texttt{Weighted\_Pivot\_Project}(v^{i},w,b)(distributed across cores i=1,,ki=1,...,k);
3 Set v^:=i=1k{vji|vji>0}\hat{v}:=\cup_{i=1}^{k}\{v_{j}^{i}\ |\ v_{j}^{i}>0\};
4 Set v:=Weighted_Pivot_Project(v^,w,b)v^{*}:=\texttt{Weighted\_Pivot\_Project}(\hat{v},w,b);
return vv^{*}.
Algorithm 16 Distributed Weighted Pivot and Project
Input: vector d=(d1,,dn)d=(d_{1},\cdots,d_{n}), scaling factor bb, weight ww, kk cores.
Output: projection vv^{*}.
1 Partition dd into subvectors d1,,dkd^{1},...,d^{k} of dimension nk\leq\frac{n}{k};
2 Set vi:=Weighted_Condat_Project(vi,w,b)v^{i}:=\texttt{Weighted\_Condat\_Project}(v^{i},w,b)(distributed across cores i=1,…,k);
3 Set p:=i=1k{j|vji>0}\mathcal{I}_{p}:=\cup_{i=1}^{k}\{j\ |\ v^{i}_{j}>0\};
4 Set p:=ipwidibipwi2p:=\frac{\sum_{i\in\mathcal{I}_{p}}w_{i}d_{i}-b}{\sum_{i\in\mathcal{I}_{p}}w_{i}^{2}}, :=\mathcal{I}:=\emptyset;
5 do
6       Set =p\mathcal{I}=\mathcal{I}_{p};
7       for ii\in\mathcal{I} do
8             if diwip\frac{d_{i}}{w_{i}}\leq p then
9                   Set :=p\{i}\mathcal{I}:=\mathcal{I}_{p}\backslash\{i\}, p:=iwidibiwi2p:=\frac{\sum_{i\in\mathcal{I}}w_{i}d_{i}-b}{\sum_{i\in\mathcal{I}}w_{i}^{2}};
10                  
11             end if
12            
13       end for
14      
15while ||>|p||\mathcal{I}|>|\mathcal{I}_{p}|;
16Set vi=max{diwip,0}v_{i}^{*}=\mbox{max}\{d_{i}-w_{i}p,0\} for all 1in1\leq i\leq n;
return v=(v1,,vn)v^{*}=(v_{1}^{*},\cdots,v_{n}^{*}).
Algorithm 17 Distributed Weighted Condat

a

Appendix E Additional Experiments

All code and data can be found at: Github222https://github.com/foreverdyz/Parallel_Projection or the IJOC repository Dai and Chen (2023)

E.1 Testing Theoretical Bounds

For Theorem 1, we calculate the average number of active elements in projecting a vector dd, drawn i.i.d. from U[0,1]U[0,1], onto the simplex Δ1\Delta_{1}, with nn between 10610^{6} and 10710^{7} and 10 trials per size. This empirical result is compared against the corresponding asymptotic bound of 2n\sqrt{2n} given by Theorem 1. Results are shown in Figure 10, and demonstrate that our asymptotic bound is rather accurate for small nn.

Similarly, for Proposition 6, we conduct the same experiments and compare the results against the function (2.2n)23(2.2n)^{\frac{2}{3}} in Figure 11, where the constant was found empirically.

For Lemma 1, we run Algorithm 3 (in the main paper) on U[0,1]U[0,1] i.i.d. distributed inputs did_{i} with scaling factor b=1b=1, size n=106n=10^{6}, and 100 trials. We compare the (average) remaining number of elements after each iteration of Michelot’s method and against the geometric series with a ratio as 12\frac{1}{2}. We find the average number of remaining terms after each loop of Michelot’s method is close to the corresponding value of the geometric series with a ratio as 12\frac{1}{2} from 10610^{6}. So, the conclusion from Lemma 1, which claims that the Michelot method approximately discards half of the vector in each loop when the size is big, is accurate.

Refer to caption
Figure 10: Observed number of active terms vs bound value of 2n\sqrt{2n}
Refer to caption
Figure 11: Observed number of remaining terms after applying Filter vs bound value of (2.2n)23(2.2n)^{\frac{2}{3}}
Refer to caption
Figure 12: Observed number of remaining terms after each Michelot iteration vs bound value of geometric series with a ratio of 12\frac{1}{2}

E.2 Robustness Test

We created examples with a wide range of scaling factors b=1,10,102,103b=1,10,10^{2},10^{3}, 104,105,10610^{4},10^{5},10^{6} and drawing diN(0,1)d_{i}\sim N(0,1) and N(0,100)N(0,100) with a size of n=108n=10^{8} for serial methods and parallel methods with 4040 cores. Results are given in Figure 13 (diN(0,1)d_{i}\sim N(0,1)) and Figure 14 (diN(0,100)d_{i}\sim N(0,100)).

Refer to caption
Refer to caption
Figure 13: Speedup vs b in Simplex Projection. Each line represents a different projection method, and the input distribution is N(0,1)N(0,1).
Refer to caption
Refer to caption
Figure 14: Speedup vs b in Simplex Projection. Each line represents a different projection method, and the input distribution is N(0,100)N(0,100).

E.3 Weighted simplex, weighted 1\ell_{1} ball

We conduct weighted simplex and weighted 1\ell_{1} ball projection experiments with various methods to solve a problem with a problem size of n=1081n=10^{8}-1. Inputs did_{i} are drawn i.i.d. from N(0,1)N(0,1) and weights wiw_{i} are drawn i.i.d. from U[0,1]U[0,1]. Algorithms were implemented as described in Sec B.2. Results for weighted simplex are shown in Figure 15 and results for weighted 1\ell_{1} ball projection are shown in Fig. 16. Slightly more modest speedups across all algorithms are observed in the weighted simplex compared to the unweighted experiments; nonetheless, the general pattern still holds, with parallel Condat attaining superior performance with up to 14x absolute speedup. In the weighted 1\ell_{1} ball projection, we observe that sort and scan has similar relative speedups to our parallel Condat’s method. However, the underlying serial Condat’s method is considerably faster.

Refer to caption
Refer to caption
Figure 15: Speedup vs cores in weighted simplex projection. Each line represents a different projection method.
Refer to caption
Refer to caption
Figure 16: Speedup vs cores in weighted 1\ell_{1} ball projection. Each line represents a different projection method.

E.4 Runtime Fairness Test

We provide a runtime fairness test for serial methods (e.g. Sort and Scan method, Michelot’s method, Condat’s method) and their respective parallel implements. We restrict both serail and parallel methods to use only one core to solve a problem with a size of n=108n=10^{8} and a scaling factor b=1b=1. Inputs did_{i} drawn i.i.d. from u[0,1]u[0,1], and b=1b=1. Results are provided in Table 3.

Method Runtime
Sort + Scan 1.037e+01\mathrm{1.037e+01}
(P)Sort + Scan 1.571e+01\mathrm{1.571e+01}
(P)Sort + Partial Scan 1.505e+01\mathrm{1.505e+01}
Michelot 4.030\mathrm{4.030}
(P) Michelot 4.229\mathrm{4.229}
Condat 2.429e01\mathrm{2.429e-01}
(P) Condat 2.497e01\mathrm{2.497e-01}
Table 3: Runtime (s) for projection onto a simplex in fairness test

E.5 Discussion on dense projections

To project a vector dnd\in\mathbb{R}^{n} onto the b\ell_{b} ball, we first check if i=1n|di|b\sum_{i=1}^{n}|d_{i}|\leq b. If this condition holds, then dd is already within the b\ell_{b} ball. However, if i=1n|di|>b\sum_{i=1}^{n}|d_{i}|>b, we project |d|:=(|d1|,,|dn|)|d|:=(|d_{1}|,...,|d_{n}|) onto the simplex with a scaling factor of bb. As noted earlier, we have τi=1n|di|bn>0\tau\geq\frac{\sum_{i=1}^{n}|d_{i}|-b}{n}>0, which means that all zero terms in dd are inactive in the projection of |d||d| onto the simplex. Therefore, we only need to project a subvector (|di|)i:di0(|d_{i}|)_{i:d_{i}\neq 0} onto the simplex. This is why when dd is sparse, it is probably better to use serial projection methods instead of their parallel counterparts.