Sparsity-Exploiting Distributed Projections onto a Simplex

Yongzheng Dai Chen Chen

Abstract

Projecting a vector onto a simplex is a well-studied problem that arises in a wide range of optimization problems. Numerous algorithms have been proposed for determining the projection; however, the primary focus of the literature has been on serial algorithms. We present a parallel method that decomposes the input vector and distributes it across multiple processors for local projection. Our method is especially effective when the resulting projection is highly sparse; which is the case, for instance, in large-scale problems with i.i.d. entries. Moreover, the method can be adapted to parallelize a broad range of serial algorithms from the literature. We fill in theoretical gaps in serial algorithm analysis, and develop similar results for our parallel analogues. Numerical experiments conducted on a wide range of large-scale instances, both real-world and simulated, demonstrate the practical effectiveness of the method.

1 Introduction

Given a vector $d\in\mathbb{R}^{n}$ , we consider the following projection of $d$ :

\mbox{proj}_{\Delta_{b}}(d):=\mbox{argmin}_{v\in\Delta_{b}}\|v-d\|_{2},

(1)

where $\Delta_{b}$ is a scaled standard simplex parameterized by some scaling factor $b>0$ ,

\Delta_{b}:=\{v\in\mathbb{R}^{n}\ |\ \sum_{i=1}^{n}v_{i}=b,v\geq 0\}.

1.1 Applications

Projection onto a simplex can be leveraged as a subroutine to determine projections onto more complex polyhedra. Such projections arise in numerous settings such as: image processing, e.g. labeling (Lellmann et al. 2009), or multispectral unmixing (Bioucas-Dias et al. 2012); portfolio optimization (Brodie et al. 2009); and machine learning (Blondel et al. 2014). As a particular example, projection onto a simplex can be used to project onto the parity polytope (Wasson et al. 2019):

\mbox{proj}_{\mathbb{PP}_{n}}(d):=\mbox{argmin}_{v\in\mathbb{PP}_{n}}\|v-d\|_{2},

(2)

where $\mathbb{PP}_{n}$ is a $n$ -dimensional parity polytope:

\mathbb{PP}_{n}:=\mbox{conv}(\{v\in\{0,1\}^{n}\ |\ \sum_{i=1}^{n}v_{i}=0\ (\mbox{mod}2)\}).

Projection onto the parity polytope arises in linear programming (LP) decoding (Liu and Draper 2016, Barman et al. 2013, Zhang and Siegel 2013, Zhang et al. 2013, Wei et al. 2015), which is used for signal processing.

Another example is projection onto a $\ell_{1}$ ball:

\mathcal{B}_{b}:=\{v\in\mathbb{R}^{n}\ |\ \sum_{i=1}^{n}|v_{i}|\leq b\}.

(3)

Duchi et al. (2008) demonstrate that the solution to this problem can be easily recovered from projection onto a simplex. Furthermore, projection onto a $\ell_{1}$ ball can, in turn, be used as a subroutine in gradient-projection methods (see e.g. van den Berg (2020)) for a variety of machine learning problems that use $\ell_{1}$ penalty, such as: Lasso (Tibshirani 1996); basis-pursuit denoising (Chen et al. 1998, van den Berg and Friedlander 2009, van den Berg 2020); sparse representation in dictionaries (Donoho and Elad 2003); variable selection (Tibshirani 1997); and classification (Barlaud et al. 2017).

Finally, we note that methods for projection onto the scaled standard simplex and $\ell_{1}$ ball can be extended to projection onto the weighted simplex and weighted $\ell_{1}$ ball (Perez et al. 2020a) (see Section B.2). Projection onto the weighted simplex can, in turn, be used to solve the continuous quadratic knapsack problem (Robinson et al. 1992). Moreover, $\ell_{p}$ regularization can be handled by iteratively solving weighted $\ell_{1}$ projections (Candès et al. 2008, Chartrand and Yin 2008, Chen and Zhou 2014).

1.2 Contributions

This paper presents a method to decompose the projection problem and distribute work across (up to $n$ ) processors. The key insight to our approach is captured by Proposition 3.4: the projection of any subvector of $d$ onto a simplex (in the corresponding space with scale factor $b$ ) will have zero-valued entries only if the full-dimension projection has corresponding zero-valued entries. The method can be interpreted as a sparsity-exploiting distributed preprocessing method, and thus it can be adapted to parallelize a broad range of serial projection algorithms. We furthermore provide extensive theoretical and empirical analyses of several such adaptations. We also fill in gaps in the literature on serial algorithm complexity. Our computational results demonstrate significant speedups from our distributed method compared to the state-of-the-art over a wide range of large-scale problems involving both real-world and simulated data.

Our paper contributes to the limited literature on parallel computation for large-scale projection onto a simplex. Most of the algorithms for projection onto a simplex are for serial computing. Indeed, to our knowledge, there is only one published parallel method, and one distributed method for projection problem (1). Wasson et al. (2019) parallelize a basic sort and scan (specifically prefix sum) approach–a method that we use as a benchmark in our experiments. We also develop a modest but practically significant enhancement to their approach. Iutzeler and Condat (2018) propose a gossip-based distributed ADMM algorithm for projection onto a Simplex. In this gossip-based setup, one entry of $d$ is given to each agent (e.g. processor), and communication is restricted according to a particular network topology. This differs fundamentally from our approach both in context and intended use, as we aim to solve large-scale problems and moreover our method can accommodate any number of processors up to $n$ .

The remainder of the paper is organized as follows. Section 2 describes serial algorithms from the literature and develops new complexity results to fill in gaps in the literature. Section 3 develops parallel analogues of the aforementioned algorithms using our novel distributed method. Section 4 extends these parallelized algorithms to various applications of projection onto a simplex. Section 5 describes computational experiments. Section 6 concludes. Note that all appendices, mathematical proofs, as well as our code and data can be found in the online supplement.

2 Background and Serial Algorithms

This section begins with a presentation of some fundamental results regarding projection onto a simplex, followed by analysis of serial algorithms for the problem, filling in various gaps in the literature. The final subsection, Section 2.5, provides a summary. Note that, for the purposes of average case analysis, we assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ )—a typical choice of distribution in the literature (e.g. Condat (2016)).

2.1 Properties of Simplex Projection

KKT conditions characterize the unique optimal solution $v^{*}$ to problem (1):

Proposition 2.1 (Held et al. (1974)).

For a vector $d\in\mathbb{R}^{n}$ and a scaled standard simplex $\Delta_{b}\in\mathbb{R}^{n}$ , there exists a unique $\tau\in\mathbb{R}$ such that

v^{*}_{i}=\max\{d_{i}-\tau,0\},\ \forall i=1,\cdots,n,

(4)

where $v^{*}:=\mathrm{proj}_{\Delta_{b}}(d)\in\mathbb{R}^{n}$ .

Hence, (1) can be reduced to a univariate search problem for the optimal pivot $\tau$ . Note that the nonzero (positive) entries of $v^{*}$ correspond to entries where $d_{i}>\tau$ . So for a given value $t\in\mathbb{R}$ and the index set $\mathcal{I}:=\{1,...,n\}$ , we denote the active index set

\mathcal{I}_{t}:=\{i\in\mathcal{I}_{t}\ |\ d_{i}>t\},

as the set containing all indices of active elements where $d_{i}>t$ . Now consider the following function, which will be used to provide an alternate characterization of $\tau$ :

f(t):=\begin{cases}\frac{\sum_{i\in\mathcal{I}_{t}}d_{i}-b}{|\mathcal{I}_{t}|}-t,&t<\max_{i}\{d_{i}\}\\ -b,&t\geq\max_{i}\{d_{i}\}\end{cases}

(5)

Corollary 2.2.

For any $t_{1},t_{2}\in\mathbb{R}$ such that $t_{1}<\tau<t_{2}$ , we have

f(t_{1})>f(\tau)=0>f(t_{2}).

The sign of $f$ only changes once, and $\tau$ is its unique root. These results can be leveraged to develop search algorithms for $\tau$ , which are presented next. This corollary and the use of $f$ is our own contribution, as we have found it a convenient organizing principle for the sake of exposition We note, however, that the root finding framework has been in use in the more general constrained quadratic programming literature (see (Cominetti et al. 2014, Equation 5) and (Dai et al. 2006, Section 2, Paragraph 2)).

2.2 Sort and Scan

Observe that only the greatest $|\mathcal{I}_{\tau}|$ terms of $d$ are indexed in $\mathcal{I}_{\tau}$ . Now suppose we sort $d$ in non-increasing order:

d_{\pi_{1}}\geq d_{\pi_{2}}\geq...\geq d_{\pi_{n}}.

We can sequentially test these values in descending order, $f(d_{\pi_{1}}),f(d_{\pi_{2}}),$ etc. to determine $|\mathcal{I}_{\tau}|$ . In particular, from Corollary 2.2 we know there exists some $\kappa:=|\mathcal{I}_{\tau}|$ such that $f(d_{\pi_{\kappa}})<0\leq f(d_{\pi_{\kappa+1}})$ . Thus the projection must have $\kappa$ active elements, and since $f(\tau)=0$ , we have $\tau=\frac{\sum_{i=1}^{\kappa}d_{\pi_{i}}-b}{\kappa}$ . We also note that, rather than recalculating $f$ at each iteration, one can keep a running cumulative/prefix sum or scan of $\sum_{i=1}^{j}d_{\pi_{i}}$ as $j$ increments.

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

Output: projection

v^{*}

1 Sort

d

d_{\pi_{1}}\geq\cdots\geq d_{\pi_{n}}

;

2 Set

\kappa:=\mbox{max}_{1\leq j\leq n}\{j:\frac{\sum_{i=1}^{j}d_{\pi_{i}}-b}{j}<d_{\pi_{j}}\}

(set

\kappa=\pi_{n}

if maximum does not exist) ;

3 Set

\tau:=\frac{\sum_{i=1}^{\kappa}d_{\pi_{i}}-b}{\kappa}

;

4 Set

\mathcal{I}_{\tau}:=\{i\ |\ d_{i}>\tau\}

V_{\tau}:=\{d_{i}-\tau\ |\ i\in\mathcal{I}_{\tau}\}

;

return

SparseVector(\mathcal{I}_{\tau},V_{\tau})

Algorithm 1 Sort and Scan (Held et al. (1974))

The bottleneck is sorting, as all other operations are linear time; for instance, QuickSort executes the sort with average complexity $O(n\log n)$ and worst-case $O(n^{2})$ , while MergeSort has worst-case $O(n\log n)$ (see, e.g. (Bentley and McIlroy 1993)). Moreover, non-comparison sorting methods can achieve $O(n)$ (see, e.g. (Mahmoud 2000)), albeit typically with a high constant factor as well as dependence on the bit-size of $d$ .

2.3 Pivot and Partition

Sort and Scan begins by sorting all elements of $d$ , but only the greatest $|\mathcal{I}_{\tau}|$ terms are actually needed to calculate $\tau$ . Pivot and Partition, proposed by Duchi et al. (2008), can be interpreted as a hybrid sort-and-scan that attempts to avoid sorting all elements. We present as Algorithm 2 a variant of this method approach given by Condat (2016).

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

Output: projection

v^{*}

1 Set

\mathcal{I}:=\{1,...,n\}

\mathcal{I}_{\tau}:=\emptyset

\mathcal{I}_{p}:=\emptyset

;

2 while $\mathcal{I}\neq\emptyset$ do

3 Select a pivot

p\in[\min_{i\in\mathcal{I}}\{d_{i}\},\max_{i\in\mathcal{I}}\{d_{i}\}]

;

4 Set

\mathcal{I}_{p}:=\{i\ |\ d_{i}>p,i\in\mathcal{I}\}

;

5 if $(\sum_{i\in{\mathcal{I}_{p}\cup\mathcal{I}_{\tau}}}d_{i}-b)/(|\mathcal{I}_{p}|+|\mathcal{I}_{\tau}|)>p$ then

6 Set

\mathcal{I}:=\mathcal{I}_{p}

;

7 else

8 Set

\mathcal{I}_{\tau}:=\mathcal{I}_{\tau}\cup\mathcal{I}_{p}

\mathcal{I}:=\mathcal{I}\setminus\mathcal{I}_{p}

;

10 end if

12 end while

13Set

\tau:=\frac{\sum_{i\in\mathcal{I}_{\tau}}d_{i}-b}{|\mathcal{I}_{\tau}|}

;

14 Set

V_{\tau}:=\{d_{i}-\tau\ |\ i\in\mathcal{I}_{\tau}\}

;

return

SparseVector(\mathcal{I}_{\tau},V_{\tau})

Algorithm 2 Pivot and Partition

The algorithm selects a pivot $p\in[\min_{i}\{d_{i}\},\max_{i}\{d_{i}\}]$ , which is intended as a candidate value for $\tau$ ; the corresponding value $f(p)$ is calculated. From Corollary 2.2, if $f(p)>0$ , then $p<\tau$ and so $\mathcal{I}_{p}\supset\mathcal{I}_{\tau}$ ; consequently, a new pivot is chosen in the (tighter) interval $[\min_{i\in\mathcal{I}_{p}}\{d_{i}\},\max_{i\in\mathcal{I}_{p}}\{d_{i}\}]$ , which is known to contain $\tau$ . Otherwise, if $f(p)\leq 0$ , then $p\geq\tau$ , and so we can find a new pivot $p\in[\min_{i\in\bar{\mathcal{I}}_{p}}\{d_{i}\},\max_{i\in\bar{\mathcal{I}}_{p}}\{d_{i}\}]$ , where $\bar{\mathcal{I}}_{p}:=\{1,...,n\}\setminus\mathcal{I}_{p}$ is the complement set. Repeatedly selecting new pivots and creating partitions in this manner results in a binary search to determine the correct active set $\mathcal{I}_{\tau}$ , and consequently $\tau$ .

Several strategies have been proposed for selecting a pivot within a given interval. Duchi et al. (2008) choose a random value in the interval, while Kiwiel (2008) uses the median value. The classical approach of Michelot (1986) can be interpreted as a special case that sets the initial pivot as $p^{(1)}=(\sum_{i\in\mathcal{I}}d_{i}-b)/|\mathcal{I}|$ , and subsequently $p^{(i+1)}=f(p^{(i)})+p^{(i)}$ . This ensures that $p^{(i)}\leq\tau$ which avoids extraneous re-evaluation of sums in the if condition. Note that $p^{(i)}$ generates an increasing sequence converging to $\tau$ (Condat 2016, Page 579, Paragraph 2). Michelot’s algorithm is presented separately as Algorithm 3.

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

Output: projection

v^{*}

1 Set

\mathcal{I}_{p}:=\{1,...,n\}

\mathcal{I}:=\emptyset

;

2 do

3 Set

\mathcal{I}:=\mathcal{I}_{p}

;

4 Set

p:=(\sum_{i\in\mathcal{I}}d_{i}-b)/|\mathcal{I}|

;

5 Set

\mathcal{I}_{p}:=\{i\in\mathcal{I}\ |\ d_{i}>p\}

;

7while $|\mathcal{I}|>|\mathcal{I}_{p}|$ ;

8Set

V_{\tau}:=\{d_{i}-\tau\ |\ i\in\mathcal{I}_{\tau}\}

;

return

SparseVector(\mathcal{I}_{\tau},V_{\tau})

Algorithm 3 Michelot’s method

Condat (2016) provided worst-case runtimes for each of the aforementioned pivot rules, as well as average case complexity (over the uniform distribution) for the random pivot rule (see Table 2). We fill in the gaps here and establish $O(n)$ runtimes for the median rule as well as Michelot’s method. We note that the median pivot method is a linear-time algorithm, but relies on a median-of-medians subroutine (Blum et al. 1973), which has a high constant factor. For Michelot’s method, we assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ , and we have

Proposition 2.3.

Michelot’s method has an average runtime of $O(n)$ .

The same argument holds for the median pivot rule, as half the elements are guaranteed to be removed each iteration, and the operations per iteration are within a constant factor of Michelot’s; we omit a formal proof of its $O(n)$ average runtime for brevity.

2.4 Condat’s Method

Condat’s method (Condat 2016), presented as Algorithm 5, can be seen as a modification of Michelot’s method in two ways. First, Condat replaces the initial scan with a Filter to find an initial pivot, presented as Algorithm 4. Lemma 2 (see Appendix. A) establishes that the Filter provides a greater (or equal) initial starting pivot compared to Michelot’s initialization; furthermore, since Michelot approaches $\tau$ from below, this results in fewer iterations (see proof of Proposition 2.4). Second, Condat’s method dynamically updates the pivot value whenever an inactive entry is removed from $\mathcal{I}_{p}$ , whereas Michelot’s method updates the pivot every iteration by summing over all entries.

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

Output:

\mathcal{I}_{t}

1 Set

\mathcal{I}_{p}:=\{1\}

\mathcal{I}_{w}:=\emptyset

p:=d_{1}-b

;

2 for $i=2:n$ do

3 if $d_{i}>p$ then

4 Set

p:=p+\frac{d_{i}-p}{|\mathcal{I}_{p}|+1}

;

5 if $p>d_{i}-b$ then

6 Set

\mathcal{I}_{p}:=\mathcal{I}_{p}\cup\{i\}

;

8 else

9 Set

\mathcal{I}_{w}:=\mathcal{I}_{w}\cup\mathcal{I}_{p}

\mathcal{I}_{p}:=\{i\}

p:=d_{i}-b

;

11 end if

13 end if

15 end for

17for $i\in\mathcal{I}_{w}$ do

18 if $d_{i}>p$ then

19 Set

\mathcal{I}_{p}:=\mathcal{I}_{p}\cup\{i\}

p:=p+\frac{d_{i}-p}{|\mathcal{I}_{p}|}

;

21 end if

23 end for

return

\mathcal{I}_{p}

Algorithm 4 Filter

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

Output: projection

v^{*}

1 Set

\mathcal{I}_{p}:=\texttt{Filter}(d,b)

p:=\frac{\sum_{i\in\mathcal{I}_{p}}d_{i}-b}{|\mathcal{I}_{p}|}

\mathcal{I}:=\emptyset

;

2 do

3 Set

\mathcal{I}:=\mathcal{I}_{p}

;

4 for $i\in\mathcal{I}:d_{i}\leq p$ do

5 Set

\mathcal{I}_{p}:=\mathcal{I}_{p}\backslash\{i\}

;

6 Set

p:=p+\frac{p-d_{i}}{|\mathcal{I}_{p}|}

;

8 end for

10while $|\mathcal{I}|>|\mathcal{I}_{p}|$ ;

11Set

V_{\tau}:=\{d_{i}-\tau\ |\ i\in\mathcal{I}_{\tau}\}

;

return

SparseVector(\mathcal{I}_{\tau},V_{\tau})

Algorithm 5 Condat’s method

Condat (2016) supplies a worst-case complexity of $O(n^{2})$ . We supplement this with average-case analysis under uniformly distributed inputs, e.g. $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ .

Proposition 2.4.

Condat’s method has an average runtime of $O(n)$ .

2.5 Summary of Results

Table 1: Time complexity of serial algorithms for projection onto a simplex (new results bolded).

Pivot Rule	Worst Case	Average Case
(Quick)Sort and Scan	$O(n^{2})$	$O(n)$
Michelot’s method	$O(n^{2})$	$\mathbf{O(n)}$
Pivot and Partition (Median)	$O(n)$	$\mathbf{O(n)}$
Pivot and Partition (Random)	$O(n^{2})$	$O(n)$
Condat’s method	$O(n^{2})$	$\mathbf{O(n)}$
Bucket method	$O(cn)$	$\mathbf{O(cn)}$

Table 1 shows that all presented algorithms attain $O(n)$ performance on average given uniformly i.i.d. inputs. The methods are ordered by publication date, starting from the oldest result. As described in Section 2.2, Sort and Scan can be implemented with non-comparison sorting to achieve $O(n)$ worst-case performance. However, as with the linear-time median pivot rule, there are tradeoffs: increased memory, overhead, dependence on factors such as input bit-size, etc.

Both sorting and scanning are (separately) well-studied in parallel algorithm design, so the Sort and Scan idea lends itself to a natural decomposition for parallelism (discussed in Section 3.1). The other methods integrate sorting and scanning in each iterate, and it is no longer clear how best to exploit parallelism directly. We develop in the next section a distributed preprocessing scheme that works around this issue in the case of sparse projections. Note that the table includes the Bucket Method; details on the algorithm are provided in Appendix B.1.

3 Parallel Algorithms

In Section 3.1 we consider the parallel method proposed by Wasson et al. (2019) and propose a modification. In Section 3.2 we develop a novel distributed scheme that can be used to preprocess and reduce the input vector $d$ . The remainder of this section analyzes how our method can be used to enhance Pivot and Partition, as well as Condat’s method via parallelization of the Filter method. Results are summarized in Section 3.5. We note that the parallel time complexities presented are all unaffected by the underlying PRAM model (e.g. EREW vs CRCW, see (Xavier and Iyengar 1998, Chapter 1.4) for further exposition). This is well-known for parallel mergesort and parallel scan; moreover, our distributed scheme (see Section 3.2) distributes work for the projection such that memory read/writes of each core are exclusive to that core’s partition of $d$ .

3.1 Parallel Sort and Parallel Scan

Wasson et al. (2019) parallelize Sort and Scan in a natural way: first applying a parallel merge sort (see e.g. (Cormen et al. 2009, p. 797)) and then a parallel scan (Ladner and Fischer 1980) on the input vector. However, their scan calculates $\sum_{i=1}^{j}d_{\pi_{i}}$ for all $j\in\mathcal{I}$ , but only $\sum_{i=1}^{\kappa}d_{\pi_{\kappa}}$ is needed to calculate $\tau$ . We modify the algorithm accordingly, presented as Algorithm 6: checks are added (lines 7 and 14) in the for-loops to allow for possible early termination of scans. As we are adding constant operations per loop, Algorithm 6 has the same complexity as the original Parallel Sort and Scan. We combine this with parallel mergesort in Algorithm 7 and empirically benchmark this method with the original (parallel) version in Section 5.

Input: sorted vector

d_{\pi_{1}},\cdots,d_{\pi_{n}}

, scaling factor

b

Output:

\tau

1 Set

T:=\lceil\log_{2}n\rceil

s[1],...,s[n]=d_{\pi_{1}},...,d_{\pi_{n}}

;

2 for $j=1:T$ do

3 for $i=2^{j}:2^{j}:\min(n,2^{T})$ do Parallel

4 Set

s[i]:=s[i]+s[i-2^{j-1}]

;

6 end for

7 Set

\kappa:=\min(n,2^{j})

;

8 if $\frac{s[\kappa]-a}{\kappa}\geq d_{\pi_{\kappa}}$ then

9 break loop;

11 end if

13 end for

14Set

p:=2^{j-1}

;

15 for $i=j-1:-1:1$ do

16 Set

\kappa:=\min(p+2^{i-1},n)

s[\kappa]:=s[\kappa]+s[p]

;

17 if $\frac{s[\kappa]-a}{\kappa}<d_{\pi_{\kappa}}$ then

18 break loop;

20 end if

22 end for

23Set

\tau:=\frac{s[\kappa]-b}{\kappa}

;

return

\tau

Algorithm 6 Parallel Partial Scan

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

Output: projection

v^{*}

1 Parallel mergesort

d

so that

d_{\pi_{1}}\geq\cdots\geq d_{\pi_{n}}

;

2 Set

\tau=\texttt{PPScan}(\{d_{\pi_{i}}\}_{1\leq i\leq n},b)

;

3 Set

\mathcal{I}_{\tau}:=\{i\ |\ d_{i}>\tau\}

V_{\tau}:=\{d_{i}-\tau\ |\ i\in\mathcal{I}_{\tau}\}

;

return

SparseVector(\mathcal{I}_{\tau},V_{\tau})

Algorithm 7 Parallel Mergesort and Partial Scan

3.2 Sparsity-Exploiting Distributed Projections

Our main idea is motivated by the following two theorems that establish that projections with i.i.d. inputs and fixed right-hand-side $b$ become increasingly sparse as the problem size $n$ increases.

Theorem 3.1.

$E[|\mathcal{I}_{\tau}|]<\sqrt{\frac{2b(n+1)}{u-l}+\frac{1}{4}}+\frac{1}{2}$ .

Theorem 3.1 establishes that, for i.i.d. uniformly distributed inputs, the projection has $O(\sqrt{n})$ active entries in expectation and thus has considerable sparsity as $n$ grows; we also show this in the computational experiments of Appendix E.1.

Theorem 3.2.

Suppose $d_{1},...,d_{n}$ are $\mathrm{i.i.d.}$ from an arbitrary distribution $X$ , with PDF $f_{X}$ and CDF $F_{X}$ . Then, for any $\epsilon>0$ , $P(\frac{|\mathcal{I}_{\tau}|}{n}\leq\epsilon)=1$ as $n\rightarrow\infty$ .

Theorem 3.2 establishes arbitrarily sparse projections over arbitrary i.i.d. distributions given fixed $b$ . Note that if $b$ is sufficiently large with respect to $n$ (rather than fixed), then the resulting projection could be too dense to attain the theorem result. However, sparsity can be assured provided $b$ does not grow too quickly with respect to $n$ , namely:

Corollary 3.3.

Theorem 3.2 holds true if $b\in o(n)$ .

We apply Theorem 3.2 to example distributions in Appendix C, and test the bounds empirically in Appendix E.1.

Proposition 3.4.

Let $\hat{d}$ be a subvector of $d$ with $m\leq n$ entries; moreover, without loss of generality suppose the subvector contains the first $m$ entries. Let $\hat{v}^{*}$ be the projection of $\hat{d}$ onto the simplex $\hat{\Delta}:=\{v\in\mathbb{R}^{m}\ |\ \sum_{i=1}^{m}v_{i}=b,v\geq 0\}$ , and $\hat{\tau}$ be the corresponding pivot value. Then, $\tau\geq\hat{\tau}$ . Consequently, for $1\leq i\leq m$ we have that $\hat{v}_{i}^{*}=0\implies v_{i}^{*}=0$ .

Proposition 3.4 tells us that if we project a subvector of some length $m\leq n$ onto the same $b$ -scaled simplex in the corresponding $\mathbb{R}^{m}$ space, the zero entries in the projected subvector must also be zero entries in the projected full vector.

Our idea is to partition and distribute the vector $d$ across cores (broadcast); have each core find the projection of its subvector (local projection); and combine the nonzero entries from all local projections to form a vector $\hat{v}$ (reduce), and apply a final (global) projection to $\hat{v}$ . The method is outlined in Figure 1. Provided the projection $v^{*}$ is sufficiently sparse, which (for instance) we have established is the case for i.i.d. distributed large-scale problems, we can expect $\hat{v}$ to have been far less than $n$ entries. We demonstrate the practical advantages of this procedure with various computational experiments in Section 5.

Refer to caption — Figure 1: Distributed Projection Algorithm

3.3 Parallel Pivot and Partition

The distributed method outlined in Figure 1 can be applied directly to Pivot and Partition, as described in Algorithm 8. Note that, as presented, $v^{*}$ is a sparse vector: entries not processed in the final Pivot and Project iteration are set to zero (recall Proposition 3.4).

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

, cores number

k

Output: projection

v^{*}

1 Partition

d

into subvectors

d^{1},...,d^{k}

of dimension

\leq\frac{n}{k}

;

2 Set

\mathcal{I}_{i}

be the active set from

\texttt{Pivot\_Project}(d^{i},b)

(distributed across cores

i=1,...,k

);

3 Set

\hat{\mathcal{I}}:=\cup_{i=1}^{k}\mathcal{I}_{i}

;

return Sparse vector from the projection of

\{v_{i}\}_{i\in\hat{\mathcal{I}}}

Algorithm 8 Parallel Pivot and Partition

We assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ , and we have

Proposition 3.5.

Parallel Pivot and Partition with either the median, random, or Michelot’s pivot rule, has an average runtime of $O(\frac{n}{k}+\sqrt{kn})$ .

In the worst case we may assume the distributed projections are ineffective, $\mbox{dim}(\hat{v})\in O(n)$ , and so the final projection is bounded above by $O(n^{2})$ with random pivots and Michelot’s method, and $O(n)$ with the median pivot rule.

3.4 Parallel Condat’s Method

We could apply the distributed sparsity idea as a preprocessing step for Condat’s method. However, due to Proposition 3.6 (and confirmed via computational experiments) we have found that Filter tends to discard many non-active elements. Therefore, we propose instead to apply our distributed method to parallelize the Filter itself. Our Distributed Filter is presented in Algorithm 9: we partition $d$ and broadcast it to the cores, and in each core we apply (serial) Filter on its subvector. Condat’s method with the distributed Filter is presented as Algorithm 10.

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

k

cores.

Output: Index set

\mathcal{I}

of Stage

1

1 Partition

\mathcal{I}

into index sets

\{\mathcal{I}_{1},\cdots,\mathcal{I}_{k}\}

such that

\mathcal{I}_{i}\leq\frac{n}{k},i=1,,,.k

;

2 for $i=1:k$ do parallel

3 Update

\mathcal{I}_{i}

with (serial) Filter(

d_{\mathcal{I}_{i}},b

);

4 Set

p^{i}:=\frac{\sum_{j\in\mathcal{I}_{i}}d_{j}-b}{|\mathcal{I}_{i}|}

;

5 for $j\in\mathcal{I}_{i}$ do

6 if $d_{j}\leq p^{i}$ then

7 Set

p^{i}:=p^{i}+\frac{p^{i}-d_{j}}{|\mathcal{I}_{i}|-1}

\mathcal{I}_{i}:=\mathcal{I}_{i}\backslash\{j\}

8 end if

10 end for

12 end for

return

\mathcal{I}:=\cup_{i=1}^{k}\mathcal{I}_{i}

Algorithm 9 Distributed Filter (Dfilter)

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

k

cores.

Output: projection

v^{*}

1 Set

\mathcal{I}_{p}:=\texttt{Dfilter}(d,b,k)

\mathcal{I}:=\emptyset

p:=\frac{\sum_{i\in\mathcal{I}_{p}}d_{i}-b}{|\mathcal{I}_{p}|}

;

2 do

3 Set

\mathcal{I}:=\mathcal{I}_{p}

;

4 for $i\in\mathcal{I}:d_{i}\leq p$ do

5 Set

\mathcal{I}_{p}:=\mathcal{I}_{p}\backslash\{i\}

p:=p+\frac{p-d_{i}}{|\mathcal{I}_{p}|}

;

7 end for

9while $|\mathcal{I}|>|\mathcal{I}_{p}|$ ;

10Set

V_{\tau}:=\{d_{i}-\tau\ |\ i\in\mathcal{I}_{\tau}\}

;

return

SparseVector(\mathcal{I}_{\tau},V_{\tau})

Algorithm 10 Parallel Condat’s method

We assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ , and we have

Proposition 3.6.

Let $\mathcal{I}_{p}$ be the output of $\mathrm{Filter}(d,b)$ . Then $E[|\mathcal{I}_{p}|]\in O(n^{\frac{2}{3}})$ .

Under the same assumption (uniformly distributed inputs) to Proposition 3.6, we have

Proposition 3.7.

Parallel Condat’s method has an average complexity $O(\frac{n}{k}+\sqrt[3]{kn^{2}})$ .

In the worst-case we can assume Distributed Filter is ineffective ( $|\mathcal{I}_{p}|\in O(n)$ ), and so the complexity of Parallel Condat’s method is $O(n^{2})$ , same as the serial method.

3.5 Summary of Results

Table 2: Time complexity of serial vs parallel algorithms with problem dimension

n

and

k

cores

Method	Worst case complexity	Average complexity
Quicksort + Scan	$O(n^{2})$	$O(n\log n)$
(P)Mergesort + Scan	$O(\frac{n}{k}\log n)$	$O(\frac{n}{k}\log n)$
(P)Mergesort + Partial Scan	$O(\frac{n}{k}\log n)$	$O(\frac{n}{k}\log n)$
Michelot	$O(n^{2})$	$O(n)$
(P)Michelot	$O(n^{2})$	$O(\frac{n}{k}+\sqrt{kn})$
Condat	$O(n^{2})$	$O(n)$
(P)Condat	$O(n^{2})$	$O(\frac{n}{k}+\sqrt[3]{kn^{2}})$

Complexity results for parallel algorithms developed throughout this section, as well as their serial counterparts, are presented in Table 2. Parallelized Sort and Scan has a dependence on $\frac{1}{k}$ ; indeed, sorting (and scanning) are very well-studied problems from the perspective of parallel computing. Parallel (Bitonic) Merge Sort has an average-case (and worst case) complexity in $O((n\log n)/k)$ (Greb and Zachmann 2006), and Parallel Scan has an average-case (and worst case) complexity in $\Theta(n/k+\log k)$ (Wang et al. 2020); thus, running parallel sort followed by parallel scan is $O(\frac{n}{k}\log n)$ . Now, Michelot’s and Condat’s serial methods are observed to have favorable practical performance in our computational experiments; this is expected as these more modern approaches were explicitly developed to gain practical advantages in e.g. the constant runtime factor. Conversely, our distributed method does not improve upon the worst-case complexity of Michelot’s and Condat’s methods, but is able to attain a $\frac{1}{k}$ factor for average complexity for $n\gg k$ , which is the case on large-scale instances, i.e. for all practical purposes. The average case analyses were conducted under the admittedly limited (typical) assumption of uniform i.i.d. entries, but our computational experiments over other distributions and real-world data confirm favorable practical speedups from our parallel algorithms.

4 Parallelization for Extensions of Projection onto a Simplex

This section develops extensions involving projection onto a simplex, to be used for experiments in Section 5.

4.1 Projection onto the $\ell_{1}$ Ball

Consider projection onto an $\ell_{1}$ ball:

\mbox{Proj}_{\mathcal{B}_{b}}(d):=\arg\min_{v\in\mathcal{B}_{b}}\|v-d\|_{2},

(6)

where $\mathcal{B}_{b}$ is given by Equation (3). Duchi et al. (2008) show Problem (6) is linear-time reducible to projection onto simplex (see (Duchi et al. 2008, Section 4)). Hence, any parallel method for projection onto a simplex can be applied to Problem (6).

As mentioned in Section 1, Problem (6) can itself be used as a subroutine in solving the Lasso problem, via (e.g.) Projected Gradient Descent (PGD) (see e.g. (Boyd and Vandenberghe 2004, Exercise 10.2)). To handle large-scale datasets, we instead use the mini-batch gradient descent method (Zhang et al. 2023, Sec. 12.5) in PGD in Sec 5.

4.2 Centered Parity Polytope Projection

Leveraging the solution to Problem (1), Wasson et al. (Wasson et al. 2019, Algorithm 2) develop a method to project a vector onto the centered parity polytope, $\mathbb{PP}_{n}-\frac{1}{2}$ (recall Problem 2); we present a slightly modified version as Algorithm 2 in Appendix D. The modification is on line $11$ , where we determine whether a simplex projection is required to avoid unnecessary operations; the original method executes line $14$ before line $11$ .

5 Numerical Experiments

All algorithms were implemented in Julia 1.5.3 and run on a single node from the Ohio Supercomputer Center (Center 1987). This node includes 4 sockets, with each socket containing a 20 Intel Xeon Gold 6148 CPUs; thus there are 80 cores at this node. The node has 3 TB of memory and runs 64-bit Red Hat Enterprise Linux 3.10.0. The code and data are available at All code and data can be found at: Github¹¹1https://github.com/foreverdyz/Parallel_Projection or the IJOC repository (Dai and Chen 2023)

5.1 Testing Algorithms

In this subsection, we compare runtime results for serial methods and their parallel versions with two measures: absolute speedup and relative speedup. Absolute speedup is the parallel method’s runtime vs the fastest serial method’s runtime, e.g. serial Condat time divided by parallel Sort and Scan time; relative speedup is the parallel method’s runtime vs its serial equivalent’s runtime, e.g. serial Sort and Scan time divided by parallel Sort and Scan time. We test parallel implementations using $8,16,24,...,80$ cores. Note that we have verified that each parallel algorithm run on a single core is slower than its serial equivalent (see Appendix E.4).

5.1.1 Projection onto Simplex

Instances in Figures 2 and 3 are generated with a size of $n=10^{8}$ and scaling factor $b=1$ . Inputs $d_{i}$ are drawn i.i.d. from three benchmark distributions: $U[0,1]$ , $N(0,1)$ , and $N(0,10^{-3})$ . This is a common benchmark used in previous works e.g. (Duchi et al. 2008, Condat 2016). Serial Condat is the benchmark serial algorithm (i.e. with the fastest performance), with the dotted line representing a speedup of 1. Our parallel Condat algorithm achieves up to 25x absolute speedup over this state-of-the-art serial method. Parallel Sort and Scan ran slower than the serial Condat’s method, due to the dramatic relative slowdown in the serial version. In terms of relative speedup, our method offers superior performance compared to the Sort and Scan approach. Although not visible on the absolute speedup graph, we note that in relative speedup it can be seen that our partial Scan technique offers some modest improvements over the standard parallel Sort and Scan.

Instances in Figures 4 and 5 have varying input sizes of $n=10^{7},10^{8},10^{9}$ , with $d_{i}$ are drawn i.i.d. from $N(0,1)$ . This demonstrates that the speedup per cores is a function of problem size. At $10^{7}$ , parallel Condat tails off in absolute speedup around 40 cores (where communication costs become marginally higher than overall gains), while consistently increasing speedups are observed up to 80 cores on the $10^{9}$ -sized instances. Similar patterns are observed for all algorithms in the relative speedups. For a fixed number of cores, larger instances yield larger partition sizes; hence the subvector projection problem given to each core tends to reduce more of the original vector, producing the observed effect for our parallel methods. More severe tailoff effects are observed in the Scan and Sort algorithms, which use an entirely different parallelization scheme.

For additional experiments varying $b$ , please see Appendix E.2.

5.1.2 $\ell_{1}$ ball

We conduct $\ell_{1}$ ball projection experiments with problem size $n=10^{8}$ . Inputs $d_{i}$ are drawn i.i.d. from $N(0,1)$ . Algorithms were implemented as described in Sec 4.1. Results are shown in Figure 6. Similar to the standard projection onto Simplex problems, our parallel Condat implementation attains considerably superior results over the benchmark, with nearly 50X speedup.

5.1.3 Weighted Simplex and Weighted $\ell_{1}$ ball

We have conducted additional experiments using the weighted versions of simplex and $\ell_{1}$ ball projections. These have been placed in the online supplement, as results are similar. A description of the algorithms are given in Appendix B.2; pseudocode in Appendix D; and experimental results in Appendix E.3.

5.1.4 Parity Polytope

We conduct parity polytope projection experiments with a problem size setting of $n=10^{8}-1$ . Inputs $d_{i}$ are drawn i.i.d. from $U[1,2]$ . Algorithms were implemented as described in Sec 4.2. Results are shown in Figure 7. Our parallel Condat has worse relative speedup on projections compared to parallel Michelot, but overall this still results in higher absolute speedups since the baseline serial Condat runs quickly. With up to around 20x absolute speedup in simplex projection subroutines from parallel Condat’s, we report an overall absolute speedup of up to around 2.75x for parity polytope projections. We note that the overall absolute speedup tails off more quickly; this is an expected effect from Amdahl’s law Amdahl (1967): diminishing returns due to partial parallelization of the algorithm.

5.1.5 Lasso on Real-World Data

We selected a dataset from a paper implementing a Lasso method (Wang et al. 2022): kdd2010 (named as kdda in the cited paper), and its updated version kdd2012; both of them can be found in LIBSVM data sets (Chang and Lin 2011). There are $n=20,216,830$ features in kdd2010 and $n=54,686,452$ features in kdd2012.

We implemented PGD with Mini-batch (Zhang et al. 2023, Sec. 12.5), and measure the runtime for subroutine of projecting onto $\ell_{1}$ ball, which is

x_{t+1}=\mbox{proj}_{\mathcal{B}_{1}}(x_{t}-\alpha\cdot 2\tilde{A}^{T}(\tilde{A}x_{t}-\beta)),

where the initial point $x_{0}$ is drawn i.i.d. from sparse $U[0,1]$ with $0.5$ sparse rate, $\tilde{A}\in\mathbb{R}^{m\times n}$ with sample size $m$ and $n$ features, and $\beta$ includes labels for samples; moreover, $\alpha=0.05$ . Moreover, we set $m=128$ samples in each iteration. We measure the runtime for $\ell_{1}$ ball projections for the first $10$ iterations. In later iterations the projected vector $x$ will be dense, at which point it is better to use serial projection methods (see Appendix E.5). Runtimes are measured by time_ns() Function), and the absolute speedup and relative speedup are reported in Figure 8 (for kdd2010) and Figure 9 (for kdd2012). Considerably more speedup is obtained for kdd2012, which may be due to problem size since the projection input vectors are more than twice the size of inputs from kdd2010.

5.1.6 Discussion

We observed consistent patterns of performance across a wide range of test instances, varying in size, distributions, and underlying projections. In terms of relative speedups, our parallelization scheme was surprisingly at least as effective as that of Sort and Scan. Parallel sort and parallel scan are well-studied problems in parallel computing, so we expected a priori for such speedups to be a strict upper bound on what our method could achieve. Thus our method is not simply benefiting from its compatibility with more advanced serial projection algorithms—the distributed method itself appears to be highly effective in practice.

6 Conclusion

We proposed a distributed preprocessing method for projection onto a simplex. Our method distributes subvector projection problems across processors in order to reduce the candidate index set on large-scale instances. One advantage of our method is that it is compatible with all major serial projection algorithms. Empirical results demonstrate, across a wide range of simulated distributions and real-world instances, that our parallelization of well-known serial algorithms are comparable and at times superior in relative speedups to highly developed and well-studied parallelization schemes for sorting and scanning. Moreover, the sort-and-scan serial approach involves substantially more work than e.g. Condat’s method; hence our parallelization scheme provides considerable absolute speedups in our experiments versus the state of the art.

The effectiveness depends on the sparsity of the projection; in the case of large-scale problems with i.i.d. inputs and $b\in o(n)$ , we can expect high levels of sparsity. A wide range of large-scale computational experiments demonstrates the consistent benefits of our method, which can be combined with any serial projection algorithms. Our experiments on real-world data suggest that significant sparsity can be exploited even when such distributional assumptions could be violated. We also note that, due to Proposition 2.1 and Corollary 2.2, highly dense projections occur when $\tau$ has a low value relative to the entries of $d$ . Now, iterative (serial) methods as Michelot’s and Condat’s can be interpreted as starting with a pivot value that is a lower bound on $\tau$ and increasing the pivot iteratively until the true value $\tau$ is attained. Hence, when our distributed method performs poorly due to density of the projection, the problem can simply be solved in serial using a small number of iterations, and vice-versa. This might be expected, for instance, when the input vector $d$ itself is sparse (see Appendix E.5).

7 Acknowledgements

This work was funded in part by the Office of Naval Research under grant N00014-23-1-2632. We also thank an anonymous reviewer for various code optimization suggestions.

References

Amdahl (1967) Amdahl GM (1967) Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference, 483–485.
Barlaud et al. (2017) Barlaud M, Belhajali W, Combettes PL, Fillatre L (2017) Classification and regression using an outer approximation projection-gradient method. IEEE Transactions on Signal Processing 65(17):4635–4644.
Barman et al. (2013) Barman S, Liu X, Draper SC, Recht B (2013) Decomposition methods for large scale lp decoding. IEEE Transactions on Information Theory 59(12):7870–7886.
Bentley and McIlroy (1993) Bentley JL, McIlroy MD (1993) Engineering a sort function. Software: Practice and Experience 23(11):1249–1256.
Bioucas-Dias et al. (2012) Bioucas-Dias JM, Plaza A, Dobigeon N, Parente M, Du Q, Gader P, Chanussot J (2012) Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 5(2):354–379.
Blondel et al. (2014) Blondel M, Fujino A, Ueda N (2014) Large-scale multiclass support vector machine training via euclidean projection onto the simplex. 2014 22nd International Conference on Pattern Recognition, 1289–1294.
Blum et al. (1973) Blum M, Floyd RW, Pratt VR, Rivest RL, Tarjan RE, et al. (1973) Time bounds for selection. J. Comput. Syst. Sci. 7(4):448–461.
Boyd and Vandenberghe (2004) Boyd S, Vandenberghe L (2004) Convex optimization (Cambridge University Press).
Brodie et al. (2009) Brodie J, Daubechies I, De Mol C, Giannone D, Loris I (2009) Sparse and stable markowitz portfolios. Proceedings of the National Academy of Sciences 106(30):12267–12272.
Candès et al. (2008) Candès EJ, Wakin MB, Boyd SP (2008) Enhancing sparsity by reweighted $\ell_{1}$ minimization. Journal of Fourier Analysis and Applications 14:877–905.
Center (1987) Center OS (1987) Ohio supercomputer center. URL http://osc.edu/ark:/19495/f5s1ph73.
Chang and Lin (2011) Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2:27:1–27:27.
Charles and J. Laurie (2006) Charles MG, J Laurie S (2006) Introduction to probability (American Mathematical Society).
Chartrand and Yin (2008) Chartrand R, Yin W (2008) Iteratively reweighted algorithms for compressive sensing. 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 3869–3872.
Chen et al. (1998) Chen SS, Donoho DL, Saunders MA (1998) Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20(1):33–61.
Chen and Zhou (2014) Chen X, Zhou W (2014) Convergence of the reweighted $\ell_{1}$ minimization algorithm for $\ell_{2}$ - $\ell_{p}$ minimization. Computational Optimization and Applications 59:47–61.
Cominetti et al. (2014) Cominetti, R M, WF S, PJS (2014) A newton’s method for the continuous quadratic knapsack problem. Math. Prog. Comp. 6:151––196.
Condat (2016) Condat L (2016) Fast projection onto the simplex and the $\ell_{1}$ ball. Math. Program. Ser. A 158(1):575–585.
Cormen et al. (2009) Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to Algorithms (The MIT Press), 3rd edition.
Dai et al. (2006) Dai, YH F, R (2006) New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds. Math. Program. 106:403––421.
Dai and Chen (2023) Dai Y, Chen C (2023) Sparsity-Exploiting Distributed Projections onto a Simplex. URL http://dx.doi.org/10.1287/ijoc.2022.0328.cd, https://github.com/INFORMSJoC/2022.0328.
Donoho and Elad (2003) Donoho DL, Elad M (2003) Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. Proceedings of the National Academy of Sciences 100(5):2197–2202.
Duchi et al. (2008) Duchi J, Shalev-Shwartz S, Singer Y, Chandra T (2008) Efficient projections onto the $\ell_{1}$ -ball for learning in high dimensions. Proceedings of the 25th International Conference on Machine learning (ICML), 272–279.
Fischer (2011) Fischer H (2011) A history of the central limit theorem. From classical to modern probability theory (Springer).
Gentle (2009) Gentle J (2009) Computational Statistics (Springer).
Greb and Zachmann (2006) Greb A, Zachmann G (2006) Gpu-abisort: optimal parallel sorting on stream architectures. Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 10 pp.–.
Held et al. (1974) Held M, Wolfe P, Crowder HP (1974) Validation of subgradient optimization. Math. Program. 6:62–88.
Iutzeler and Condat (2018) Iutzeler F, Condat L (2018) Distributed projection on the simplex and $\backslash ell\_1$ ball via admm and gossip. IEEE Signal Processing Letters 25(11):1650–1654.
Kiwiel (2008) Kiwiel KC (2008) Breakpoint searching algorithms for the continuous quadratic knapsack problem. Math. Program. 112:473–491.
Knuth (1998) Knuth D (1998) The Art of Computer Programming, volume 1 (Addison-Wesley), 3rd edition.
Ladner and Fischer (1980) Ladner RE, Fischer MJ (1980) Parallel prefix computation. J. ACM 27(4):831–838.
Lagarias (2013) Lagarias JC (2013) Euler’s constant: Euler’s work and modern developments. Bull. Amer. Math. Soc. 50:527–628.
Lellmann et al. (2009) Lellmann J, Kappes J, Yuan J, Becker F, Schnörr C (2009) Convex multi-class image labeling by simplex-constrained total variation. Scale Space and Variational Methods in Computer Vision, 150–162 (Springer Berlin Heidelberg).
Liu and Draper (2016) Liu X, Draper SC (2016) The admm penalized decoder for ldpc codes. IEEE Transactions on Information Theory 62(6):2966–2984.
Mahmoud (2000) Mahmoud HM (2000) Sorting: A distribution theory, volume 54 (John Wiley & Sons).
Michelot (1986) Michelot C (1986) A finite algorithm for finding the projection of a point onto the canonical simplex of $\mathbb{R}^{n}$ . Journal of Optimization Theory and Applications 50:195–200.
Perez et al. (2020a) Perez G, Ament S, Gomes CP, Barlaud M (2020a) Efficient projection algorithms onto the weighted $\ell_{1}$ ball. CoRR abs/2009.02980.
Perez et al. (2020b) Perez G, Michel Barlaud LF, Régin JC (2020b) A filtered bucket-clustering method for projection onto the simplex and the $\ell_{1}$ ball. Math. Program. Ser. A 182:445–464.
Robinson et al. (1992) Robinson AG, Jiang N, Lerme CS (1992) On the continuous quadratic knapsack problem. Mathematical Programming 55:99–108.
Tibshirani (1996) Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1):267–288.
Tibshirani (1997) Tibshirani R (1997) The lasso method for variable selection in the cox model. Statistic in Medicine 16:385–395.
van den Berg (2020) van den Berg E (2020) A hybrid quasi-newton projected-gradient method with application to lasso and basis-pursuit denoising. Math. Program. Comp. 12:1–38.
van den Berg and Friedlander (2009) van den Berg E, Friedlander MP (2009) Probing the pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing 31(2):890–912.
Wang et al. (2022) Wang G, Yu W, Liang X, Wu Y, Yu B (2022) An iterative reduction fista algorithm for large-scale lasso. SIAM Journal on Scientific Computing 44(4):A1989–A2017.
Wang et al. (2020) Wang S, Bai Y, Pekhimenko G (2020) Bppsa: Scaling back-propagation by parallel scan algorithm.
Wasson et al. (2019) Wasson M, Milicevic M, Draper SC, Gulak G (2019) Hardware-based linear program decoding with the alternating direction method of multipliers. IEEE Transactions on Signal Processing 67(19):4976–4991.
Wei et al. (2015) Wei H, Jiao X, Mu J (2015) Reduced-complexity linear programming decoding based on admm for ldpc codes. IEEE Communications Letters 19(6):909–912.
Xavier and Iyengar (1998) Xavier C, Iyengar SS (1998) Introduction to parallel algorithms, volume 1 (John Wiley & Sons).
Zhang et al. (2023) Zhang A, Lipton ZC, Li M, Smola AJ (2023) Dive into deep learning.
Zhang et al. (2013) Zhang G, Heusdens R, Kleijn WB (2013) Large scale lp decoding with low complexity. IEEE Communications Letters 17(11):2152–2155.
Zhang and Siegel (2013) Zhang X, Siegel PH (2013) Efficient iterative lp decoding of ldpc codes with alternating direction method of multipliers. 2013 IEEE International Symposium on Information Theory, 1501–1505.

Appendix A Mathematical Proofs

Corollary 1 For any $t_{1},t_{2}\in\mathbb{R}$ such that $t_{1}<\tau<t_{2}$ , we have

f(t_{1})>f(\tau)=0>f(t_{2}).

Proof.

By definition,

$\displaystyle f(t)$	$\displaystyle=$	$\displaystyle\frac{\sum_{i\in\mathcal{I}_{t}}d_{i}-b}{\|\mathcal{I}_{t}\|}-t,$
	$\displaystyle=$	$\displaystyle\frac{\sum_{i\in\mathcal{I}_{t}}(d_{i}-t)-b}{\|\mathcal{I}_{t}\|},$
	$\displaystyle=$	$\displaystyle\frac{\sum_{i=1}^{n}\max\{d_{i}-t,0\}-b}{\|\mathcal{I}_{t}\|}.$

Observe that $g(t):=\sum_{i=1}^{n}\max\{d_{i}-t,0\}-b$ is a strictly decreasing function for $t\leq\max_{i}\{d_{i}\}$ , and $g(t)=-b$ for $t\geq\max_{i}\{d_{i}\}$ . Furthermore, from Proposition 1, we have that $\tau$ is the unique value such that $g(\tau)=0$ ; moreover, since $b>0$ then $\tau<\max_{i}\{d_{i}\}$ . Thus $g(t_{1})>g(\tau)=0>g(t_{2})$ , which implies $f(t_{1})>0,f(\tau)=0$ , and $f(t_{2})<0$ . ∎

We assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ , and we have
Lemma 1

E[\frac{\sum_{i\in\mathcal{I}}d_{i}-b}{|\mathcal{I}|}]\rightarrow\frac{l+u}{2}\ \mathrm{as}\ n\rightarrow\infty

with a sublinear convergence rate, where $\mathcal{I}:=\{1,...,n\}$ .

Proof.

Observe that $E[d_{i}]=\frac{l+u}{2}$ . Since $d_{1},...d_{n}$ are i.i.d, then we have

E[\frac{\sum_{i\in\mathcal{I}}d_{i}-b}{|\mathcal{I}|}]=\frac{l+u}{2}-\frac{b}{n}.

(7)

Thus $E[\frac{\sum_{i\in\mathcal{I}}d_{i}-b}{|\mathcal{I}|}]$ converges to $\frac{l+u}{2}$ sublinearly at the rate of

\lim_{n\rightarrow\infty}[\frac{l+u}{2}-\frac{b}{n+1}]/[\frac{l+u}{2}-\frac{b}{n}]=1.

∎

Let $X|t$ denote the conditional variable such that $P_{X|t}(x)=P(X=x|X>t)$ .
We assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ , and then

Proposition 2 Michelot’s method has an average runtime of $O(n)$ .

Proof.

Let $\delta_{i}$ be the number of elements that Algorithm 3 (from the main body) removes from the (candidate) active set $\mathcal{I}_{p}^{(i)}$ in iteration $i$ of the do-while loop, and let $T$ be the total number of iterations.

\sum_{i=1}^{T}\delta_{i}=n-|\mathcal{I}_{\tau}|,

(8)

where $\mathcal{I}_{\tau}$ is the active index set of the projection. Let $p^{(i)}$ be the $i$ th pivot (line 3 in Algorithm 3 of the main body) with

p^{(i)}=\frac{\sum_{j\in\mathcal{I}_{p}^{(i)}}d_{j}-b}{|\mathcal{I}_{p}^{(i)}|},

and define $p^{(0)}:=l$ .

Now, from Proposition 9 we have that all $d_{j}$ with $j\in\mathcal{I}_{p}^{(i)}$ are i.i.d. $\sim U[p^{(i-1)},u]$ . Thus $E[d_{j}|d_{j}\in\mathcal{I}_{p}^{(i)}]=\frac{p^{(i-1)}+u}{2}$ , and so

E[p^{(i)}|\mathcal{I}_{p}^{(i)},p^{(i-1)}]=E[\frac{\sum_{j\in\mathcal{I}_{p}^{(i)}}d_{j}-b}{|\mathcal{I}_{p}^{(i)}|}\ |\ \mathcal{I}_{p}^{(i)},p^{(i-1)}]=\frac{p^{(i-1)}+u}{2}-\frac{b}{|\mathcal{I}_{p}^{(i)}|}.

(9)

In iteration $i$ , all elements in the range $[p^{(i-1)},p^{(i)}]$ are removed, and again the i.i.d. uniform property is preserved by Proposition 9, so

E[\delta_{i}|\mathcal{I}_{p}^{(i)},p^{(i)},p^{(i-1)}]=|\mathcal{I}_{p}^{(i)}|\frac{p^{(i)}-p^{(i-1)}}{u-p^{(i-1)}}=|\mathcal{I}_{p}^{(i)}|(\frac{1}{2}-\frac{(u+p^{(i-1)})/2-p^{(i)}}{u-p^{(i-1)}});

by Law of Total Expectation,

	$\displaystyle E[\delta_{i}\|\mathcal{I}_{p}^{(i)},p^{(i-1)}]=$	$\displaystyle E[E[\delta_{i}\|\mathcal{I}_{p}^{(i)},p^{(i)},p^{(i-1)}]]$
	$\displaystyle=$	$\displaystyle\|\mathcal{I}_{p}^{(i)}\|(\frac{1}{2}-\frac{(u+p^{(i-1)})/2-E[p^{(i)}\|\mathcal{I}_{p}^{(i)},p^{(i-1)}]}{u-p^{(i-1)}}).$

Replacing the right-hand side expectation using Equation (9),

		$\displaystyle E[\delta_{i}\ \|\ \mathcal{I}_{p}^{(i)},p^{(i-1)}]$
	$\displaystyle=$	$\displaystyle\|\mathcal{I}_{p}^{(i)}\|(\frac{1}{2}-\frac{(u+p^{(i-1)})/2-(u+p^{(i-1)})/2+b/\|\mathcal{I}_{p}^{(i)}\|}{u-p^{(i-1)}})$
	$\displaystyle=$	$\displaystyle\frac{\|\mathcal{I}_{p}^{(i)}\|}{2}-\frac{b}{u-p^{(i-1)}}.$

Using the Law of Total Expectation again,

E[\delta_{i}\ |\ p^{(i-1)}]=E[E[\delta_{i}\ |\ \mathcal{I}_{p}^{(i)},p^{(i-1)}]]=\frac{E[|\mathcal{I}_{p}^{(i)}|]}{2}-\frac{b}{u-p^{(i-1)}}.

(10)

Let $\sigma:=\sqrt{\frac{2b}{u-l}}+1$ . We will now consider cases. Observe that $|\mathcal{I}_{\tau}|\leq n$ , i.e. the active set is a subset of the ground set. The following two cases exhaust the possibilities of where $\sigma\sqrt{n}$ lies: either $|\mathcal{I}_{\tau}|<\sigma\cdot\sqrt{n}<n$ (Case 1); otherwise, either $|\mathcal{I}_{\tau}|\leq n\leq\sigma\cdot\sqrt{n}$ or $\sigma\cdot\sqrt{n}\leq|\mathcal{I}_{\tau}|\leq n$ (Case 2).
Case 1: $|\mathcal{I}_{\tau}|<\sigma\cdot\sqrt{n}<n$
Let $\xi$ be an iteration such that $1\leq\xi<T$ satisfies:

|\mathcal{I}_{p}^{(\xi)}|\geq\sigma\cdot\sqrt{n}>|\mathcal{I}_{p}^{(\xi+1)}|>...>|\mathcal{I}_{p}^{(T)}|=|\mathcal{I}_{\tau}|;

Then,

E[\sum_{i=1}^{T}|\mathcal{I}_{p}^{(i)}|]=E[\sum_{i=1}^{\xi}|\mathcal{I}_{p}^{(i)}|]+E[\sum_{i=\xi}^{T}|\mathcal{I}_{p}^{(i)}|]

(11)

Since $|\mathcal{I}_{\tau}|<\sigma\cdot\sqrt{n}$ , then from $|\mathcal{I}_{p}^{(\xi+1)}|<\sigma\cdot\sqrt{n}$ Michelot’s Method will stop within at most $\sigma\cdot\sqrt{n}$ iterations since at least one element is removed in each iteration; thus $T-\xi<\sigma\cdot\sqrt{n}$ . Since $|\mathcal{I}_{p}^{(i)}|<|\mathcal{I}_{p}^{(\xi+1)}|<\sigma\cdot\sqrt{n}$ for $i\geq\xi+1$ ,

E[\sum_{i=\xi}^{T}|\mathcal{I}_{p}^{(i)}|]\leq E[\sum_{i=\xi}^{T}\sigma\cdot\sqrt{n}]<\sigma^{2}\cdot n\in O(n).

(12)

From Equation (10),

E[\sum_{i=1}^{\xi}\delta_{i}\ |\ p^{(1)},...,p^{(\xi)},\xi]=\frac{E[\sum_{i=1}^{\xi}|\mathcal{I}_{p}^{(i)}|\ |\ \xi]}{2}-\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}.

Then, from the Law of Total Expectation,

	$\displaystyle E[\sum_{i=1}^{\xi}\delta_{i}]=$	$\displaystyle E[E[\sum_{i=1}^{\xi}\delta_{i}\ \|\ p^{(1)},...,p^{(\xi)},\xi]]$		(13)
	$\displaystyle=$	$\displaystyle\frac{E[\sum_{i=1}^{\xi}\|\mathcal{I}_{p}^{(i)}\|]}{2}-E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}].$		(13)

Michelot’s method always maintains a nonempty active index set (decreasing per iteration), and so $\sum_{i=1}^{\xi}\delta_{i}=n-|\mathcal{I}_{p}^{(\xi)}|<n$ . Then, continuing from (13),

\frac{E[\sum_{i=1}^{\xi}|\mathcal{I}_{p}^{(i)}|]}{2}-E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}]\leq n,

\displaystyle\implies\frac{E[\sum_{i=1}^{\xi}|\mathcal{I}_{p}^{(i)}|]}{2}\leq

\displaystyle n+E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}]

(14)

Claim.

$E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}]\in O(n)$ .

Proof:

	$\displaystyle E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}\|\xi]=$	$\displaystyle\sum_{i=1}^{\xi}E[\frac{b}{u-p^{(i-1)}}]$		(15)
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{\xi}\frac{b}{E[u-p^{(i-1)]}}\quad(\mbox{Jensen's Inequality}).$		(15)

Since $\mathcal{I}_{p}^{(i)}=\{j\in\mathcal{I}\ |\ d_{j}>p^{(i-1)}\}$ and $d_{1},...,d_{n}$ are $i.i.d.\sim U[l,u]$ ,

E[|\mathcal{I}_{p}^{(i)}|]=n\cdot\frac{E[u-p^{(i-1)}]}{u-l}.

So for $i=1,...,\xi$ , since $|\mathcal{I}_{p}^{(i)}|>|\mathcal{I}_{p}^{(i+1)}|$ ,

n\cdot\frac{E[u-p^{(i-1)}]}{u-l}=E[|\mathcal{I}_{p}^{(i)}|]>E[|\mathcal{I}_{p}^{(\xi)}|]\geq\sigma\cdot\sqrt{n};

\implies E[u-p^{(i-1)}]>\sigma\cdot\frac{u-l}{\sqrt{n}}.

(16)

Substituting into Inequality (15),

E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}|\xi]\leq\sum_{i=1}^{\xi}\frac{\sqrt{n}b}{\sigma\cdot(u-l)}=\frac{\xi\sqrt{n}b}{\sigma\cdot(u-l)};

using Law of Total Expectation,

E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}]=E[E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}|\xi]]\leq E[\xi]\frac{\sqrt{n}b}{\sigma\cdot(u-l)}.

(17)

So then it remains to show $E[\xi]\in O(\sqrt{n})$ . From Equation (9), for any iteration $i$

u-E[p^{(i)}|\mathcal{I}_{p}^{(i)},p^{(i-1)}]=u-\frac{p^{(i-1)}+u}{2}+\frac{b}{|\mathcal{I}_{p}^{(i)}|},

\implies E[u-p^{(i)}|\mathcal{I}_{p}^{(i)},p^{(i-1)}]=\frac{u-p^{(i-1)}}{2}+\frac{b}{|\mathcal{I}_{p}^{(i)}|};

thus, using the Law of Total Expectation,

E[u-p^{(i)}]=\frac{E[u-p^{(i-1)}]}{2}+E[\frac{b}{|\mathcal{I}_{p}^{(i)}|}].

(18)

Applying Equation (18) recursively, starting with the base case $E[u-p^{(0)}]=u-l$ ,

	$\displaystyle E[u-p^{(i)}]=$	$\displaystyle\frac{u-l}{2^{i}}+\sum_{j=1}^{i}E[\frac{b}{2^{i-j}\cdot\|\mathcal{I}_{p}^{(j)}\|}],$
	$\displaystyle\implies E[p^{(i)}]=$	$\displaystyle u-\frac{u-l}{2^{i}}-\sum_{j=1}^{i}E[\frac{b}{2^{i-j}\cdot\|\mathcal{I}_{p}^{(j)}\|}],$
	$\displaystyle\implies E[p^{(\xi-1)}\|\xi]=$	$\displaystyle u-\frac{u-l}{2^{\xi-1}}-E[\sum_{j=1}^{\xi-1}\frac{b}{2^{\xi-1-j}\cdot\|\mathcal{I}_{p}^{(j)}\|}\|\xi].$

Using the Law of Total Expectation,

E[p^{(\xi-1)}]=E[E[p^{(\xi-1)}|\xi]]=u-E[\frac{u-l}{2^{\xi-1}}]-E[\sum_{j=1}^{\xi-1}\frac{b}{2^{\xi-1-j}\cdot|\mathcal{I}_{p}^{(j)}|}]

	$\displaystyle\implies E[p^{(\xi-1)}]\geq$	$\displaystyle u-E[\frac{u-l}{2^{\xi-1}}]-E[\sum_{j=1}^{\xi-1}\frac{b}{2^{\xi-1-j}\sigma\cdot\sqrt{n}}]\quad(\mbox{since }\|\mathcal{I}_{p}^{(i)}\|\geq\sigma\cdot\sqrt{n}\ \mbox{for }i\leq\xi)$
	$\displaystyle\geq$	$\displaystyle u-E[\frac{u-l}{2^{\xi-1}}]-\frac{2b}{\sigma\cdot\sqrt{n}}$
	$\displaystyle\geq$	$\displaystyle u-\frac{u-l}{E[2^{\xi-1}]}-\frac{2b}{\sigma\cdot\sqrt{n}}\quad(\mbox{Jensen's inequality})$
	$\displaystyle\implies\frac{u-l}{E[2^{\xi-1}]}\geq$	$\displaystyle E[u-p^{(\xi-1)}]-\frac{2b}{\sigma\cdot\sqrt{n}}$

From Inequality (16), $E[u-p^{(\xi-1)}]>\sigma\cdot\frac{u-l}{\sqrt{n}}$ ; thus

\frac{u-l}{E[2^{\xi-1}]}>\sigma\cdot\frac{u-l}{\sqrt{n}}-\frac{2b}{\sigma\cdot\sqrt{n}}=\frac{1}{\sigma\cdot\sqrt{n}}[\sigma^{2}\cdot(u-l)-2b].

Since $\sigma=\sqrt{\frac{2b}{u-l}}+1$ , $\sigma>\sqrt{\frac{2b}{u-l}}$ ; thus $\sigma^{2}\cdot(u-l)-2b>0$ ; as a result,

E[2^{\xi-1}]<\frac{\sigma\sqrt{n}(u-l)}{\sigma^{2}\cdot(u-l)-2b};

\implies 2^{E[\xi]-1}\leq E[2^{\xi-1}]<\frac{\sigma\sqrt{n}(u-l)}{\sigma^{2}\cdot(u-l)-2b}\quad(\mbox{Jensen's Inequality});

\implies E[\xi]\leq\log(\frac{\sigma\sqrt{n}(u-l)}{\sigma^{2}\cdot(u-l)-2b})+1.

Thus $E[\xi]\in O(\sqrt{n})$ . Substituting into Inequality (17)

E[\sum_{i=1}^{\xi}\frac{b}{u-p^{(i-1)}}]\leq(\log(\frac{\sigma\sqrt{n}(u-l)}{\sigma^{2}\cdot(u-l)-2b})+1)\cdot\frac{\sqrt{n}b}{\sigma\cdot(u-l)}\in O(n).

$\blacksquare$

The Claim, together with Inequality (14) establish that

E[\sum_{i=1}^{\xi}|\mathcal{I}_{p}^{(i)}|]\in O(n).

Altogether with Inequalities (11) and (12), we have $E[\sum_{i=1}^{T}|\mathcal{I}_{p}^{(i)}|]\in O(n)$ in Case 1.
Case 2: $n\geq|\mathcal{I}_{\tau}|\geq\sigma\cdot\sqrt{n}$ , or $\sigma\cdot\sqrt{n}\geq n\geq|\mathcal{I}_{\tau}|$
If $n\geq|\mathcal{I}_{\tau}|\geq\sigma\cdot\sqrt{n}$ , let $\xi:=T$ . Since

|\mathcal{I}_{p}^{(\xi)}|=|\mathcal{I}_{p}^{(T)}|=|\mathcal{I}_{\tau}|\geq\sigma\cdot\sqrt{n},

the Claim from Case 1 and Equation (14) hold for $\xi=T$ ; thus

E[\sum_{i=1}^{T}\frac{b}{u-p^{(i-1)}}]\in O(n),

\frac{E[\sum_{i=1}^{T}|\mathcal{I}_{p}^{(i)}|]}{2}\leq n+E[\sum_{i=1}^{T}\frac{b}{u-p^{(i-1)}}];

\implies E[\sum_{i=1}^{T}|\mathcal{I}_{p}^{(i)}|]\in O(n).

If $\sigma\cdot\sqrt{n}\geq n\geq|\mathcal{I}_{\tau}|$ , then $\sigma\geq\sqrt{n}$ , which implies $n\leq\sigma^{2}$ . However, $u,l,b$ are given independently of $n$ and so $\sigma$ is fixed. Thus for asymptotic analysis, $n\leq\sigma^{2}$ does not hold for sufficiently large $n$ .

Together with Case 1 and Case 2, $E[\sum_{i=1}^{T}|\mathcal{I}_{p}^{(i)}|]\in O(n)$ . Hence $O(n)$ operations are used for scanning/prefix-sum. All other operations, i.e. assigning $\mathcal{I}$ and $\mathcal{I}_{p}$ , are within a constant factor of the scanning operations. ∎

We assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ , and we have

Lemma 2 Filter provides a pivot $p$ such that $\tau\geq p\geq\frac{\sum_{i\in\mathcal{I}}d_{i}-b}{|\mathcal{I}|}$ .

Proof.

The upper bound $p\leq\tau$ is given by construction of $p$ (see (Condat 2016, Section 3, Paragraph 2)).

We can establish the lower bound on $p$ by considering the sequence $p^{(1)}\leq...\leq p^{(n)}\in\mathbb{R}$ , which represents the initial as well as subsequent (intermediate) values of $p$ from the first outer for-loop on line 2 of presented as Algorithm 4 (from the main body), and their corresponding index sets $\mathcal{I}_{p}^{(1)}\subseteq...\subseteq\mathcal{I}_{p}^{(n)}$ . Filter initializes with $p^{(1)}:=d_{1}-b$ and $\mathcal{I}_{p}^{(1)}:=\{1\}$ . For $i=2,...,n$ , if $d_{i}>p^{(i-1)}$ ,

\displaystyle p^{(i)}:=p^{(i-1)}+(d_{i}-p^{(i-1)})/(|\mathcal{I}_{p}^{(i-1)}|+1),\ \mathcal{I}_{p}^{(i)}:=\mathcal{I}_{p}^{(i-1)}\cup\{i\};

otherwise $p^{(i)}:=p^{(i-1)}$ , and $\mathcal{I}_{p}^{(i)}:=\mathcal{I}_{p}^{(i-1)}$ . Then it can be shown that $p^{(i)}=(\sum_{j\in\mathcal{I}_{p}^{(i)}}d_{j}-b)/|\mathcal{I}_{p}^{(i)}|$ (see (Condat 2016, Section 3, Paragraph 2)), and $p\geq p^{(n)}$ (see (Condat 2016, Section 3, Paragraph 5)). Now in terms of $p^{(n)}$ we may write

	$\displaystyle\frac{\sum_{i\in\mathcal{I}}d_{i}-b}{\|\mathcal{I}\|}=$	$\displaystyle\frac{\sum_{i\in\mathcal{I}_{p}^{(n)}}d_{i}-b+\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}d_{i}}{\|\mathcal{I}\|}$
	$\displaystyle=$	$\displaystyle\frac{(\sum_{i\in\mathcal{I}_{p}^{(n)}}d_{i}-b)\|\mathcal{I}_{p}^{(n)}\|}{\|\mathcal{I}\|\|\mathcal{I}_{p}^{(n)}\|}+\frac{\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}d_{i}}{\|\mathcal{I}\|}$
	$\displaystyle=$	$\displaystyle\frac{\|\mathcal{I}_{p}^{(n)}\|}{\|\mathcal{I}\|}p^{(n)}+\frac{\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}d_{i}}{\|\mathcal{I}\|}$
	$\displaystyle=$	$\displaystyle p^{(n)}+\frac{\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}d_{i}-(\|\mathcal{I}\|-\|\mathcal{I}_{p}^{(n)}\|)p^{(n)}}{\|\mathcal{I}\|}$
	$\displaystyle=$	$\displaystyle p^{(n)}+\frac{\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}(d_{i}-p^{(n)})}{\|\mathcal{I}\|}.$

For any $i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}$ , since $\mathcal{I}_{p}^{(i)}\subseteq\mathcal{I}_{p}^{(n)}$ then $i\not\in\mathcal{I}_{p}^{(i)}$ . By construction of $p^{(i-1)}$ and $\mathcal{I}_{p}^{(i)}$ , we have $d_{i}\leq p^{(i-1)}\leq p^{(n)}$ . Thus, $\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}(d_{i}-p^{(n)})\leq 0$ , and $p^{(n)}\geq(\sum_{i\in\mathcal{I}}d_{i}-b)/|\mathcal{I}|$ . So $p\geq p^{(n)}\geq(\sum_{i\in\mathcal{I}}d_{i}-b)/|\mathcal{I}|$ . ∎

Now we introduce some notation in order to compare subsequent iterations of Condat’s method with iterations of Michelot’s method. Let $t_{C}\leq n$ and $t_{M}\leq n$ be the total number of iterations taken by Condat’s method and Michelot’s method (respectively) on a given instance. Let $\mathcal{I}^{C}_{0},...,\mathcal{I}^{C}_{n}$ be the active index sets per iteration for Condat’s method with corresponding pivots $p^{C}_{0},...,p^{C}_{n}$ . Likewise, we denote the index sets and pivots of Michelot’s method as $\mathcal{I}^{M}_{0},...,\mathcal{I}^{M}_{n}$ and $p^{M}_{0},...,p^{M}_{n}$ , respectively. If $t_{C}<n$ then we set $p^{C}_{t_{C}}=p^{C}_{t_{C}+1}=...=p^{C}_{n}=\tau$ and $\mathcal{I}^{C}_{t_{C}}=\mathcal{I}^{C}_{t_{C}+1}=...=\mathcal{I}^{C}_{n}=\mathcal{I}_{\tau}$ ; likewise for Michelot’s algorithm.

Lemma 3 $\mathcal{I}_{i}^{C}\subseteq\mathcal{I}_{i}^{M}$ , and $p^{C}_{i}\geq p^{M}_{i}$ for $i=0,...,n$ .

Proof.

We will prove this by induction. For the base case, $\mathcal{I}_{0}^{C}$ is obtained by Filter. So $\mathcal{I}_{0}^{C}\subseteq\mathcal{I}=\mathcal{I}_{0}^{M}$ . Moreover, from Lemma 2, $p_{0}^{C}\geq p_{0}^{M}$ .

Now for any iteration $i\geq 1$ , suppose $\mathcal{I}^{C}_{i}\subseteq\mathcal{I}^{M}_{i}$ , and $p_{i}^{C}\geq p_{i}^{M}$ . From line 5 in Algorithm 3 (from the main body), $\mathcal{I}^{M}_{i+1}:=\{j\in\mathcal{I}^{M}_{i}:d_{j}>p_{i}^{M}\}$ . From Condat (Condat 2016, Section 3, Paragraph 3), Condat’s method uses a dynamic pivot between $p^{C}_{i}$ to $p^{C}_{i+1}$ to remove inactive entries that would otherwise remain in Michelot’s method. Therefore, $\mathcal{I}^{C}_{i+1}\subseteq\{j\in\mathcal{I}^{C}_{i}:d_{j}>p_{i}^{C}\}\subseteq\mathcal{I}^{M}_{i+1}$ , and moreover for any $j\in\mathcal{I}^{M}_{i+1}\backslash\mathcal{I}^{C}_{i+1}$ , we have that $d_{j}\leq p^{C}_{i+1}$ . Now observe that

		$\displaystyle p^{C}_{i+1}-p^{M}_{i+1}$
	$\displaystyle=$	$\displaystyle\frac{\sum_{j\in\mathcal{I}^{C}_{i+1}}d_{j}-b}{\|\mathcal{I}^{C}_{i+1}\|}-\frac{\sum_{j\in\mathcal{I}^{M}_{i+1}}d_{j}-b}{\|\mathcal{I}^{M}_{i+1}\|}$
	$\displaystyle=$	$\displaystyle\frac{\sum_{j\in\mathcal{I}^{C}_{i+1}}d_{j}-b}{\|\mathcal{I}^{C}_{i+1}\|}-\frac{\sum_{j\in\mathcal{I}^{C}_{i+1}}d_{j}+\sum_{j\in\mathcal{I}^{M}_{i+1}\backslash\mathcal{I}^{C}_{i+1}}d_{j}-b}{\|\mathcal{I}^{C}_{i+1}\|+\|\mathcal{I}^{M}_{i+1}\backslash\mathcal{I}^{C}_{i+1}\|}$
	$\displaystyle=$	$\displaystyle\frac{\sum_{j\in\mathcal{I}_{i+1}^{M}\backslash\mathcal{I}_{i+1}^{C}}(p_{i+1}^{C}-d_{j})}{\|\mathcal{I}_{i+1}^{C}\|+\|\mathcal{I}_{i+1}^{M}\backslash\mathcal{I}_{i+1}^{C}\|}\geq 0,$

and so $p^{C}_{i+1}\geq p^{M}_{i+1}$ . ∎

Corollary 2 $\sum_{i=1}^{t_{C}}|\mathcal{I}_{i}^{C}|\leq\sum_{i=1}^{t_{M}}|\mathcal{I}_{i}^{M}|$ .

Proof.

Observe that both algorithms remove elements (without replacement) from their candidate active sets $\mathcal{I}^{C}_{i},\mathcal{I}^{M}_{i}$ at every iteration; moreover, they terminate with the pivot value $\tau$ and so $\mathcal{I}^{C}_{t_{C}}=\mathcal{I}^{M}_{t_{M}}=\mathcal{I}_{\tau}$ . So, together with Lemma 3, we have for $i=0,...,n$ that $\mathcal{I}_{\tau}\subseteq\mathcal{I}_{i}^{C}\subseteq\mathcal{I}_{i}^{M}$ . So $\mathcal{I}_{t_{M}}^{M}=\mathcal{I}_{\tau}$ implies $\mathcal{I}_{t_{M}}^{C}=\mathcal{I}_{\tau}$ , and so $t_{C}\leq t_{M}$ . Therefore $\sum_{i=1}^{t_{C}}|\mathcal{I}_{i}^{C}|\leq\sum_{i=1}^{t_{M}}|\mathcal{I}_{i}^{M}|$ . ∎

Lemma 4 The worst-case runtime of Filter is $O(n)$ .

Proof.

Since $\mathcal{I}_{w}\subseteq\mathcal{I}$ at any iteration, Filter will scan at most $2|\mathcal{I}|$ entries; including $O(1)$ operations to update $p$ . ∎

We assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ , and we have

Proposition 3 Condat’s method has an average runtime of $O(n)$ .

Proof.

Filter takes $O(n)$ operations from Lemma 4. From Corollary 2, the total operations spent on scanning in Condat’s method is less than (or equal to) the average $O(n)$ operations for Michelot’s method (established in Proposition 2); hence Condat’s average runtime is $O(n)$ . ∎

We assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ , and we have

Theorem 1 $E[|\mathcal{I}_{\tau}|]<\sqrt{\frac{2b(n+1)}{u-l}+\frac{1}{4}}+\frac{1}{2}$ .

Proof.

Sort $d$ such that $d_{\pi_{1}}\geq d_{\pi_{2}}\geq...\geq d_{\pi_{n}}$ . Thus for a given order statistic, (see e.g. (Gentle 2009, p. 63)),

E[d_{\pi_{i}}]=u-\frac{i}{n+1}(u-l).

Define $N:=|\mathcal{I}_{\tau}|$ for ease of presentation. From Corollary 1,

\frac{\sum_{i=1}^{N}d_{\pi_{i}}-b}{N}=\tau;

and, together with $d_{\pi_{N+1}}\leq\tau<d_{\pi_{N}}$ (by definition of $\mathcal{I}_{\tau}$ ), we have

N\cdot d_{\pi_{N+1}}\leq\sum_{i=1}^{N}d_{\pi_{i}}-b<N\cdot d_{\pi_{N}},

\implies E[\sum_{i=1}^{N}d_{\pi_{i}}-b]<E[N\cdot d_{\pi_{N}}].

(19)

Furthermore, from the Law of Total Expectation,

E[N\cdot d_{\pi_{N}}]=uE[N]-E[N^{2}]\frac{u-l}{n+1},

E[\sum_{i=1}^{N}d_{\pi_{i}}-b]=uE[N]-E[N(N+1)]\frac{u-l}{2(n+1)}-b.

Substituting into (19) yields

E[N^{2}]-E[N]<\frac{2b(n+1)}{u-l},

Since $E^{2}[N]\leq E[N^{2}]$ , then

E^{2}[N]-E[N]\leq E[N^{2}]-E[N]<\frac{2b(n+1)}{u-l},

\implies E[|\mathcal{I}_{\tau}|]=E[N]<\sqrt{\frac{2b(n+1)}{u-l}+\frac{1}{4}}+\frac{1}{2}\in O(\sqrt{n}).

(20)

∎

Lemma 5 Suppose $d_{1},...,d_{n}$ are $\mathrm{i.i.d.}$ from an arbitrary distribution $X$ , with PDF $f_{X}$ and CDF $F_{X}$ . Let $\epsilon$ such that $0<\epsilon<1$ be some positive number, and $t\in\mathbb{R}$ be such that $1-F_{X}(t)=\epsilon$ . Then i) $|\mathcal{I}_{t}|\rightarrow\infty$ as $n\rightarrow\infty$ and ii) $P(\frac{|\mathcal{I}_{t}|}{n}\leq\epsilon)=1\ \mbox{as}\ n\rightarrow\infty$ .

Proof.

Since $f_{X}$ is a density function, the CDF $F_{X}$ is absolutely continuous (see e.g. (Charles and J. Laurie 2006, Page 59, Definition 2.1)); thus for any $0<\epsilon<1$ , there exists $t\in\mathbb{R}$ such that $F_{x}(t)=1-\epsilon$ .

For $i=1,...,n$ , define the indicator variable

\delta_{i}:=\begin{cases}1,&\mbox{if}\ d_{i}>t\\ 0,&\mathrm{otherwise}\end{cases}

and let $S_{n}:=\sum_{i=1}^{n}\delta_{i}$ . So $S_{n}=|\mathcal{I}_{t}|$ , and we can show that it is binomially distributed.

Claim.

$S_{n}\sim B(n,\epsilon)$ .

Proof: Observe that $P(\delta_{i}=1)=P(d_{i}>t)=\epsilon$ , and $P(\delta_{i}=0)=P(d_{i}\leq t)=1-\epsilon$ ; thus $\delta_{i}\sim\mbox{Bernoulli}(\epsilon)$ . Moreover, $d_{1},...,d_{n}$ are independent, and so $\delta_{1},...,\delta_{n}$ are i.i.d. $\mbox{Bernoulli}(\epsilon)$ ; consequently, $S_{n}:=\sum_{i=1}^{n}\delta_{i}\sim B(n,\epsilon)$ . $\blacksquare$

So, as $n\rightarrow\infty$ , we can apply the Central Limit Theorem (see e.g. Fischer (2011)):

S_{n}^{*}:=\frac{S_{n}-n\epsilon}{\sqrt{n\epsilon(1-\epsilon)}}\sim N(0,1).

(21)

Denote $\Phi(q)=\int_{-\infty}^{q}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}\,dx$ to be the CDF of the standard normal distribution. Consider (for any $q\in\mathbb{R}$ ) the right-tail probability

P(\frac{S_{n}-n\epsilon}{\sqrt{n\epsilon(1-\epsilon)}}\geq-q)=1-\Phi(-q)=\Phi(q),

(22)

P(S_{n}\geq n\epsilon-q\sqrt{n\epsilon(1-\epsilon)})=\Phi(q),

\implies P(|\mathcal{I}_{t}|\geq\lfloor-q\sqrt{n\epsilon(1-\epsilon)}+n\epsilon\rfloor)\geq\Phi(q),

(23)

where $\lfloor.\rfloor$ is the floor function.

Setting $q=\sqrt{2\log n}$ yields

P(|\mathcal{I}_{t}|\geq\lfloor-\sqrt{2(n\log n)\epsilon(1-\epsilon)}+n\epsilon\rfloor)\geq\Phi(\sqrt{2\log n}).

Consider the right-hand side, $\lim_{n\rightarrow\infty}\Phi(\sqrt{2\log n})$ . Since $\Phi$ is a CDF, it is monotonically increasing, continuous, and converges to $1$ ; thus $\lim_{n\rightarrow\infty}\Phi(\sqrt{2\log n})=1$ . So as $n$ approaches infinity,

P(|\mathcal{I}_{t}|\geq\lfloor-\sqrt{2(n\log n)\epsilon(1-\epsilon)}+n\epsilon\rfloor)=1,

(24)

which establishes condition (i).

Now consider the left-tail probability,

P(\frac{S_{n}-n\epsilon}{\sqrt{n\epsilon(1-\epsilon)}}\leq q)=\Phi(q)

\implies P(|\mathcal{I}_{t}|\leq\lceil q\sqrt{n\epsilon(1-\epsilon)}+n\epsilon\rceil)\geq\Phi(q).

Again setting $q=\sqrt{2\log n}$ , we have that $\lim_{n\rightarrow\infty}\lceil q\sqrt{n\epsilon(1-\epsilon)}+n\epsilon\rceil/n=\epsilon$ . So as $n$ approaches infinity,

P(\frac{|\mathcal{I}_{t}|}{n}\leq\epsilon)=1,

which establishes condition (ii). ∎

Theorem 2 Suppose $d_{1},...,d_{n}$ are $\mathrm{i.i.d.}$ from an arbitrary distribution $X$ , with PDF $f_{X}$ and CDF $F_{X}$ . Then, for any $\epsilon>0$ , $P(\frac{|\mathcal{I}_{\tau}|}{n}\leq\epsilon)=1$ as $n\rightarrow\infty$ .

Proof.

Case 1: $\epsilon\geq 1$
Since $\mathcal{I}_{\tau}\subseteq\mathcal{I}$ and $|\mathcal{I}|=n$ , $\frac{|\mathcal{I}_{\tau}|}{n}\leq 1$ . Hence, $P(\frac{|\mathcal{I}_{\tau}|}{n}\leq\epsilon)=1$ .
Case 2: $0<\epsilon<1$
Since $f_{X}$ is a density function, $F_{X}$ is absolutely continuous. So there exists $t\in\mathbb{R}$ be such that $1-F_{X}(t)=\epsilon$ . We shall first establish that $P(\tau>t)=1$ as $n\rightarrow\infty$ .

From Corollary 1, $\tau>t\iff f(t)>0$ , and so

$\displaystyle P(\tau>t)=$	$\displaystyle P(f(t)>0)$
$\displaystyle=$	$\displaystyle P(\frac{\sum_{i\in\mathcal{I}_{t}}d_{i}-b}{\|\mathcal{I}_{t}\|}>t)$
$\displaystyle=$	$\displaystyle 1-P(\sum_{i\in\mathcal{I}_{t}}d_{i}\leq\|\mathcal{I}_{t}\|\cdot t+b)$
$\displaystyle=$	$\displaystyle 1-P(\sum_{i\in\mathcal{I}_{t}}d_{i}-E[\sum_{i\in\mathcal{I}_{t}}d_{i}]\leq\|\mathcal{I}_{t}\|\cdot t+b-E[\sum_{i\in\mathcal{I}_{t}}d_{i}]).$	(25)

Now observe that, for any $i\in\mathcal{I}_{t}$ , $d_{i}$ can be treated as a conditional variable: $d_{i}|_{d_{i}>t}$ . Since all such $d_{i}$ are i.i.d, we may denote the (shared) expected value $\mu:=E[d_{i}|d_{i}>t]$ and variance as $\sigma^{2}:=\mbox{Var}[d_{i}|d_{i}>t]$ . Moreover, by definition of $\mathcal{I}_{t}$ we have

E[d_{i}|i\in\mathcal{I}_{t}]=\mu>t.

(26)

Together with condition (i) of Lemma 5, this implies that the right-hand side of the probability in (25) is negative as $n\rightarrow\infty$ :

|\mathcal{I}_{t}|\cdot t+b-E[\sum_{i\in\mathcal{I}_{t}}d_{i}]=|\mathcal{I}_{t}|\cdot(t-\mu)+b<0.

(27)

It follows, continuing from Equation (25), that

	$\displaystyle P(f(t)>0)$	$\displaystyle\geq 1-P(\|\sum_{i\in\mathcal{I}_{t}}d_{i}-E[\sum_{i\in\mathcal{I}_{t}}d_{i}]\|\geq E[\sum_{i\in\mathcal{I}_{t}}d_{i}]-\|\mathcal{I}_{t}\|\cdot t-b)$
		$\displaystyle\geq 1-\frac{\texttt{Var}(\sum_{i\in\mathcal{I}_{t}}d_{i})}{(E[\sum_{i\in\mathcal{I}_{t}}d_{i}]-\|\mathcal{I}_{t}\|\cdot t-b)^{2}}\ \ \mbox{(Chebyshev's inequality)}$
		$\displaystyle=1-\frac{\sigma^{2}\|\mathcal{I}_{t}\|}{(\mu\|\mathcal{I}_{t}\|-t\|\mathcal{I}_{t}\|-b)^{2}}.$

Thus we have the desired result:

P(\tau>t)\geq 1-\frac{\sigma^{2}|\mathcal{I}_{t}|}{(\mu|\mathcal{I}_{t}|-t|\mathcal{I}_{t}|-b)^{2}}\rightarrow 1\ \mbox{as}\ n\rightarrow\infty.

(28)

Now observe that $\tau>t$ implies $\mathcal{I}_{\tau}\subseteq\mathcal{I}_{t}$ , and subsequently $|\mathcal{I}_{\tau}|\leq\mathcal{I}_{t}$ . Together with condition (ii) from Lemma 5 we have

P(\frac{|\mathcal{I}_{\tau}|}{n}\leq\epsilon)\geq P(\frac{|\mathcal{I}_{t}|}{n}\leq\epsilon)=1,\ \mbox{as}\ n\rightarrow\infty.

∎

Corollary 2 For projection of $d\in\mathbb{R}^{n}$ onto a simplex $\Delta_{b}$ , if $b\in o(n)$ , the conclusion from Theorem 2 keeps true.

Proof.

From Equation (24), $|\mathcal{I}_{t}|\in\Theta(n)$ . If $b\in o(n)$ , since Equation (26) implies $t-\mu<0$ , Equation (27) keeps true. As a result, Theorem 2 keeps true. ∎

Proposition 4 Let $\hat{d}$ be a subvector of $d$ with $m\leq n$ entries; moreover, without loss of generality suppose the subvector contains the first $m$ entries. Let $\hat{v}^{*}$ be the projection of $\hat{d}$ onto the simplex $\hat{\Delta}:=\{v\in\mathbb{R}^{m}\ |\ \sum_{i=1}^{m}v_{i}=b,v\geq 0\}$ , and $\hat{\tau}$ be the corresponding pivot value. Then, $\tau\geq\hat{\tau}$ . Consequently, for $1\leq i\leq m$ we have that $\hat{v}_{i}^{*}=0\implies v_{i}^{*}=0$ .

Proof.

Define two index sets,

\mathcal{I}_{\hat{\tau}}:=\{i=1,...,n\ |\ d_{i}>\hat{\tau},d_{i}\in d\};

\hat{\mathcal{I}}_{\hat{\tau}}:=\{i=1,...,m\ |\ d_{i}>\hat{\tau},d_{i}\in\hat{d}\}.

As $\hat{d}$ is a subvector of $d$ , we have $\hat{\mathcal{I}}_{\hat{\tau}}\subseteq\mathcal{I}_{\hat{\tau}}$ ; thus,

\sum_{i\in\mathcal{I}_{\hat{\tau}}}(d_{i}-\hat{\tau})\geq\sum_{i\in\hat{\mathcal{I}}_{\hat{\tau}}}(d_{i}-\hat{\tau})=b,

\implies\frac{\sum_{i\in\mathcal{I}_{\hat{\tau}}}d_{i}-b}{|\mathcal{I}_{\hat{\tau}}|}\geq\hat{\tau}.

From Corollary 1, $\tau\geq\hat{\tau}$ ; from Proposition 1 it thus follows that $\mathcal{I}_{\hat{\tau}}\supset\mathcal{I}_{\tau}$ . ∎

We assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ , and we have

Proposition 5 Parallel Pivot and Partition with either the median, random, or Michelot’s pivot rule, has an average runtime of $O(\frac{n}{k}+\sqrt{kn})$ .

Proof.

The algorithm starts by distributing projections. Pivot and Partition has linear runtime on average with any of the stated pivot rules, and so for the $i$ th core with input $d^{i}$ , whose size is $O(\frac{n}{k})$ , we have an average runtime of $O(\frac{n}{k})$ . Now from Theorem 1, each core returns in expectation at most $O(\sqrt{\frac{n}{k}})$ active terms. So the reduced input $\hat{v}$ (line 3 of Algorithm 8 from the main body) will have at most $O(\sqrt{kn})$ entries on average; thus the final projection will incur an expected number of operations in $O(\sqrt{kn})$ . ∎

We assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ , and we have

Proposition 6 Let $\mathcal{I}_{p}$ be the output of $\mathrm{Filter}(d,b)$ . Then $E[|\mathcal{I}_{p}|]\in O(n^{\frac{2}{3}})$ .

Proof.

We assume that the if in line 5 from Algorithm 4 (from the main body) does not trigger; this is a conservative assumption as otherwise more elements would be removed from $\mathcal{I}_{p}$ , reducing the number of iterations.

Let $p^{(j)}$ be the $j$ th pivot with $p^{(1)}:=d_{1}-b$ (from line 1 in Algorithm 4 from the main body), and subsequent pivots corresponding to the for-loop iterations of line 2. Whenever Filter finds $d_{i}>p^{(j)}$ , it updates $p$ as follows:

p^{(j+1)}:=p^{(j)}+\frac{d_{i}-p^{(j)}}{j+1}.

(29)

Moreover, when $d_{i}>p^{(j)}$ is found, $d_{i}\sim U[p^{(j)},u]$ ; thus, from the Law of Total Expectation, $E[d_{i}]=E[E[d_{i}|p^{(j)}]]=(u+E[p^{(j)}])/2$ . Together with (29),

	$\displaystyle E[p^{(j+1)}]=$	$\displaystyle E[p^{(j)}+\frac{d_{i}-p^{(j)}}{j+1}]$
	$\displaystyle=$	$\displaystyle E[p^{(j)}]+\frac{E[d_{i}]-E[p^{(j)}]}{j+1}$
	$\displaystyle=$	$\displaystyle E[p^{(j)}]+\frac{(u+E[p^{(j)}])/2-E[p^{(j)}]}{j+1}$
	$\displaystyle=$	$\displaystyle\frac{(2j+1)E[p^{(j)}]+u}{2j+2};$
	$\displaystyle\implies E[p^{(j+1)}]-u=$	$\displaystyle(E[p^{(j)}]-u)\frac{2j+1}{2j+2}.$

Using the initial value $E[p^{(1)}]=E[E[p^{(1)}|d_{1}]]=E[d_{1}]-b=\frac{u+l}{2}-b$ , we can obtain a closed-form representation for the recursive formula:

E[p^{(j)}]-u=-2(b+\frac{u-l}{2})\prod_{i=0}^{j-1}\frac{2i+1}{2i+2}.

(30)

Now let $L_{j}$ denote the number of terms Filter scans after calculating $p^{(j)}$ and before finding some $d_{i}>p^{(j)}$ . Since Filter scans $n$ terms in total (from initialize and $n-1$ calls to line 4), then

\sum_{j=1}^{|\mathcal{I}_{p}|-1}L_{j}\leq n-1<\sum_{j=1}^{|\mathcal{I}_{p}|}L_{j};

\implies 1+\sum_{j=1}^{|\mathcal{I}_{p}|-1}E[L_{j}]\leq n<1+\sum_{j=1}^{|\mathcal{I}_{p}|}E[L_{j}].

(31)

We can show $L_{j}$ has a geometric distribution as follows.

Claim.

$L_{j}\sim\mbox{Geo}(\frac{u-p^{(j)}}{u-l})$ .

Proof: For each term $d_{i}\sim U[l,u]$ , we have

P(d_{i}>p^{(j)})=1-(p^{(j)}-l)/(u-l)=(u-p^{(j)})/(u-l).

So $d_{i}>p^{(j)}$ can be interpreted as a Bernoulli trial. Hence $L_{j}$ is distributed with $\mbox{Geo}(\frac{u-p^{(j)}}{u-l})$ . $\blacksquare$

Now, applying Jensen’s Inequality to $E[L_{j}]$ ,

E[L_{j}]=E[\frac{u-l}{u-p^{(j)}}]\leq\frac{u-l}{u-E[p^{(j)}]},

\implies\ln(E[L_{j}])\leq E[\ln(L_{j})]=\ln(u-l)-\ln(u-E[p^{(j)}]),

and together with (30) we have

	$\displaystyle\ln(E[L_{j}])\leq$	$\displaystyle\ln(\frac{u-l}{2b+u-l})+\sum_{i=0}^{j-1}\ln(\frac{2i+2}{2i+1})$		(32)
	$\displaystyle=$	$\displaystyle\ln(\frac{u-l}{2b+u-l})+\sum_{i=1}^{j}\ln(\frac{2i}{2i-1})$		(32)

Now observe that

\lim_{i\rightarrow\infty}\frac{\ln(\frac{2i}{2i-1})}{\frac{1}{i}}=\lim_{i\rightarrow\infty}\frac{\frac{2i-1}{2i}\frac{4i-2-4i}{(2i-1)^{2}}}{-\frac{1}{i^{2}}}=\lim_{i\rightarrow\infty}\frac{2i^{2}}{(2i-1)(2i)}=\frac{1}{2},

where the first equality is from L’Hôpital’s rule. Thus $E[L_{j}]\in\Theta(\mbox{exp}(\sum_{i=1}^{j}\frac{1}{2i}))$ . Furthermore, we have the classical bound on the harmonic series:

\sum_{i=1}^{j}\frac{1}{i}=\ln(j)+\gamma+\frac{1}{2j}\leq\ln(j)+1,

where $\gamma$ is the Euler-Mascheroni constant, see e.g. Lagarias (2013); thus,

e^{\sum_{i=1}^{j}\frac{1}{2i}}=\sqrt{j}+e^{\frac{\gamma}{2}+\frac{1}{4j}},

which implies $E[L_{j}]\in O(\sqrt{j})$ . Moreover (see e.g. (Knuth 1998, Section 1.2.7)),

\sum_{j=1}^{|\mathcal{I}_{p}|}\sqrt{j}=\frac{2}{3}|\mathcal{I}_{p}|\sqrt{|\mathcal{I}_{p}|+\frac{3}{2}}+o(\sqrt{|\mathcal{I}_{p}|}),

which implies $1+\sum_{j=1}^{|\mathcal{I}_{p}|-1}E[L_{j}]\in O(|\mathcal{I}_{p}|^{\frac{3}{2}})$ . From (31) it follows that $|\mathcal{I}_{p}|\in O(n^{\frac{2}{3}})$ . ∎

We assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ , and we have

Proposition 7 Parallel Condat’s method has an average complexity $O(\frac{n}{k}+\sqrt[3]{kn^{2}})$ .

Proof.

In Distributed Filter (Algorithm 9 from the main body), each core is given input $\mathcal{I}_{i}$ with $|\mathcal{I}_{i}|\in O(\frac{n}{k})$ . (serial) Filter has linear runtime from Lemma 4, and the for loop in line 5 of Algorithm 9 (from the main body) will scan at most $|\mathcal{I}_{i}|$ terms. Thus distributed Filter runtime is in $O(\frac{n}{k})$ .

From Proposition 6, $E[|\mathcal{I}_{i}|]\in O((n/k)^{\frac{2}{3}})$ . Since the output of Distributed Filter is $\mathcal{I}_{p}=\cup_{i=1}^{k}\mathcal{I}_{i}$ , we have that (given $d$ is i.i.d.) $E[|\mathcal{I}_{p}|]\in O(\sqrt[3]{kn^{2}})$ .

Parallel Condat’s method takes the input from Distributed Filter and applies serial Condat’s method (Algorithm 10 from the main body, lines 2-7), excluding the serial Filter. From Proposition 3, Condat’s method has average linear runtime, so this application of serial Condat’s method has average complexity $O(\sqrt[3]{kn^{2}})$ . ∎

Lemma 6 If $X\sim U[l,u]$ and $l<t<u$ , then $X|t\sim U[t,u]$ .

Proof.

The CDF of $X$ is $F_{X}(x)=P(X\leq x)=(x-l)/(u-l)$ . Then,

F_{X|t}(x)=P(X\leq x|X>t)=\frac{P(X\leq x,X>t)}{P(X>t)}=(\frac{x-t}{u-l})/(\frac{u-t}{u-l})=\frac{x-t}{u-t},

which implies $X|t\sim U[t,u]$ . ∎

Lemma 7 If $X,Y$ are independent random variables, then $X|t,Y|t$ are independent.

Proof.

$X,Y$ are independent and so $P(X\leq x,Y\leq y)=P(X\leq x)P(Y\leq y)$ for any $x,y\in\mathbb{R}$ . So considering the joint conditional probability,

	$\displaystyle P(X\leq x,Y\leq y\|X>t,Y>t)$
	$\displaystyle=\frac{P(X\leq x,Y\leq y,X>t,Y>t)}{P(X>t,Y>t)}$
	$\displaystyle=\frac{P(X\leq x,X>t)}{P(X>t)}\frac{P(Y\leq y,Y>t)}{P(Y>t)}$
	$\displaystyle=P(X\leq x\|X>t)P(Y\leq y\|y>t).$

∎

Proposition 9 If $d_{1},...,d_{n}$ i.i.d. $\sim U[l,u]$ and $l<t<u$ , then $\{d_{i}\ |\ i\in\mathcal{I}_{t}\}$ i.i.d. $\sim U[t,u]$ .

Proof.

From Lemma 6, for any $i\in\mathcal{I}_{t}$ , $d_{i}\sim U[t,u]$ . From Lemma 7, for any $i,j\in\mathcal{I}_{t}$ , $i\neq j$ , $d_{i},d_{j}$ are conditionally independent. So, $\{d_{i}\ |\ i\in\mathcal{I}_{t}\}$ i.i.d. $\sim U[t,u]$ . ∎

Appendix B Algorithm Descriptions

B.1 Bucket Method

Pivot and Partition selects one pivot in each iteration to partition $d$ and applies this recursively in order to create sub-partitions in the manner of a binary search. The Bucket Method, developed by Perez et al. (2020b), can be interpreted as a modification that uses multiple pivots and partitions (buckets) per iteration.

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

, bucket number

c

, maximum number of iterations

T

Output: projection

v^{*}

1 Set

\mathcal{I}=\{1,...,n\}

\mathcal{I}_{\tau}=\emptyset

;

2 for $t=1:T$ do

3 Set

\mathcal{I}_{1},...,\mathcal{I}_{c}:=\emptyset

;

4 for $j=1:c$ do

5 Set

p_{j}:=(\max_{i\in\mathcal{I}}\{d_{i}\}-\min_{i\in\mathcal{I}}\{d_{i}\})\cdot(c-j)/c+\min_{i\in\mathcal{I}}\{d_{i}\}

;

6 for $i\in\mathcal{I}:d_{i}\geq p_{j}$ do

7 Set

\mathcal{I}_{j}:=\mathcal{I}_{j}\cup\{i\}

\mathcal{I}:=\mathcal{I}\backslash\{i\}

8 end for

10 end for

11 for $j=1:c$ do

12 Set

p=\frac{\sum_{i\in\mathcal{I}_{\tau}}d_{i}+\sum_{i\in\mathcal{I}_{j}}d_{i}-b}{|\mathcal{I}_{\tau}|+|\mathcal{I}_{j}|}

;

13 if $p\geq p_{j}$ then

14 Set

\mathcal{I}:=\mathcal{I}_{j}

;

15 Break the inner loop;

17 else if $j<c$ $\&$ $p>\max_{i\in\mathcal{I}_{j+1}}\{d_{i}\}$ then

18 Set

\mathcal{I}_{\tau}:=\mathcal{I}_{\tau}\cup\mathcal{I}_{j}

;

19 Break the outer loop;

21 else

\mathcal{I}_{\tau}:=\mathcal{I}_{\tau}\cup\mathcal{I}_{j}

;

24 end if

26 end for

28 end for

29Set

\tau:=\frac{\sum_{i\in\mathcal{I}_{\tau}}d_{i}-b}{|\mathcal{I}_{\tau}|}

;

30 Set

v_{i}^{*}:=\mbox{max}\{d_{i}-\tau,0\}

for all

1\leq i\leq n

;

return

v^{*}:=(v_{1}^{*},\cdots,v_{n}^{*})

Algorithm 11 Bucket method

The algorithm, presented as Algorithm 11 is initialized with tuning parameters $T$ , the maximum number of iterations, and $c$ , the number of buckets with which to subdivide the data. In each iteration the algorithm partitions the problem into the buckets $\mathcal{I}_{j}$ with the inner for loop of line 4, and then calculates corresponding pivot values in the inner for loop of line 10.

The tuning parameters can be determined as follows. Suppose we want the algorithm to find a (final) pivot $\bar{\tau}$ within some absolute numerical tolerance $D$ of the true pivot $\tau$ , i.e. such that $|\bar{\tau}-\tau|\leq D$ . This can be ensured (see Perez et al. (2020b)) by setting

T=\log_{c}\frac{R}{D},

where $R:=\max_{i\in\mathcal{I}}\{d_{i}\}-\min_{i\in\mathcal{I}}\{d_{i}\}$ denotes the range of $d$ . Perez et al. Perez et al. (2020b) prove the worst-case complexity is $O((n+c)\log_{c}(R/D))$ .

We assume uniformly distributed inputs, $d_{1},\dots,d_{n}$ are $\mathrm{i.i.d}\sim U[l,u]$ , and we have

Proposition 8 The Bucket method has an average runtime of $O(cn)$ .

Proof.

Let $\mathcal{I}^{(t)}$ denote the index set $\mathcal{I}$ at the start of iteration $t$ in the outer for loop (line 2), and $\mathcal{I}_{j}^{(t)}$ denote the index set of the $j$ th bucket, $\mathcal{I}_{j}$ , at the end of the first inner for loop (line 4).

For a given outer for loop iteration $t$ (line 2), the first inner for loop (line 4) uses $O(c|\mathcal{I}^{(t)}|)$ operations. Note that the $\max$ and $\min$ on line 5 can be reused in each iteration, and the nested for loop on line 6 has $|\mathcal{I}^{(t)}|$ iterations. The second inner for loop (line 10) also uses $O(c|\mathcal{I}^{(t)}|)$ operations. In line 11, the first sum $\sum_{i\in\mathcal{I}_{\tau}}d_{i}$ can be updated dynamically (in the manner of a scan) as a cumulative sum as $\tau$ is updated in line 16 or 19, thus requiring a constant number of operations per iteration $j$ . The second sum $\sum_{i\in\mathcal{I}_{j}}d_{i}$ is bounded above by $O(|\mathcal{I}^{(t)}|)$ since $\mathcal{I}_{j}\subseteq\mathcal{I}^{(t)}$ . Thus each iteration $j$ of the outer for loop uses $O(c|\mathcal{I}^{(t)}|)$ operations.

Since $d_{1},...,d_{n}$ are i.i.d $\sim U[l,u]$ , then from Proposition 9, the terms from each sub-partition are also i.i.d uniform. So for any $t=1,...,T$ and $j=1,...,c$ , $E[|\mathcal{I}_{j}^{(t)}|]=E[|\mathcal{I}^{(t)}|]/c$ . From line 13, $E[|\mathcal{I}^{(t+1)}|]=E[|\mathcal{I}^{(t)}|]/c$ . Since $E[|\mathcal{I}^{(1)}|]=n$ then $E[|\mathcal{I}^{(t)}|]=n/c^{t-1}$ ; thus

\displaystyle E[\sum_{t=1}^{T}|\mathcal{I}^{(t)}|]=\sum_{t=1}^{\log_{c}(R/D)}\frac{n}{c^{t-1}}=\frac{c}{c-1}n(1-\frac{D}{R}).

Therefore, $E[\sum_{t=1}^{T}c\cdot|\mathcal{I}^{(t)}|]\in O(cn)$ . ∎

B.2 Projection onto a Weighted Simplex and a Weighted $\ell_{1}$ Ball

The weighted simplex and the weighted $\ell_{1}$ ball are

\Delta_{w,b}:=\{v\in\mathbb{R}^{n}\ |\ \sum_{i=1}^{n}w_{i}v_{i}=b,v\geq 0\},

\mathcal{B}_{w,b}:=\{v\in\mathbb{R}^{n}\ |\ \sum_{i=1}^{n}w_{i}|v_{i}|\leq b\},

where $w>0$ is a weight vector, and $b>0$ is a scaling factor. Perez et al. (2020a) show there is a unique $\tau\in\mathbb{R}$ such that $v^{*}=\max\{d_{i}-w_{i}\tau,0\},\ \forall i=1,\cdots,n$ , where $v^{*}=\mbox{proj}_{\Delta_{w,b}}(d)\in\mathbb{R}^{n}$ . Thus pivot-based methods for the unweighted simplex extend to the weighted simplex in a straightforward manner. We present weighted Michelot’s method as Algorithm 3 (Appendix D), and weighted Filter as Algorithm 4 (Appendix D).

Our parallelization depends on the choice of serial method for projection onto simplex, since projection onto the weighted simplex requires direct modification rather than oracle calls to methods for the unweighted case. Sort and Scan for weighted simplex projection can be implemented with a parallel merge sort algorithm in Algorithm 5 (Appendix D). Our distributed structure can be applied to Michelot and Condat methods in a similar manner as with the unweighted case; these are presented respectively as Algorithms 6 and 7 (Appendix D). We note that projection onto $\mathcal{B}_{w,b}$ is linear-time reducible to projection onto $\Delta_{w,b}$ (Perez et al. 2020a, Equation (4)).

Appendix C Distribution Examples

Here we apply Theorem 2 to three examples:
(A) Let $d_{i}\sim U[0,1]$ , $n_{1}=10^{5}$ , $n_{2}=10^{6}$ , $t=0.95$ .
(B) Let $d_{i}\sim N[0,1]$ , $n_{1}=10^{5}$ , $n_{2}=10^{6}$ , $t=1.65$ .
(C) Let $d_{i}\sim N[0,10^{-3}]$ , $n_{1}=10^{5}$ , $n_{2}=10^{6}$ , $t=1.65\times\sqrt{10^{-3}}=0.05218$ .
Observe that


$\displaystyle P(\tau>t)\geq$	$\displaystyle P(\tau>t,\|\mathcal{I}_{p}\|\geq\lfloor-\sqrt{2(n\log n)p(1-p)}+np\rfloor)$	(33a)
$\displaystyle=$	$\displaystyle P(\tau>t\ \|\ \|\mathcal{I}_{p}\|\geq\lfloor-\sqrt{2(n\log n)p(1-p)}+np\rfloor)$	(33b)
$\displaystyle\times$	$\displaystyle P(\|\mathcal{I}_{p}\|\geq\lfloor-\sqrt{2(n\log n)p(1-p)}+np\rfloor))$	(33c)

(33b) can be calculated by (28), and (33c) can be calculated by (23).

For (A), $p=1-F_{d}(t)=0.05$ . From $n_{1}p=5000$ and $n_{2}p=50000$ , we have $\sqrt{n_{1}p(1-p)}=217.9$ and $\sqrt{n_{2}p(1-p)}=68.9$ . So,

P(|\mathcal{I}_{t}^{1}|\in[4793,5207])=P(|\mathcal{I}_{t}^{2}|\in[49347,50654])=0.9973.

Now, since $f_{d}^{*}(x)=\frac{1}{0.05}=20,\ \mbox{for}\ x\in[0.95,1]$ , we have $E=0.975$ and $V=\frac{1}{4800}$ . Applying Theorem 2 yields:

P(\tau_{1}>t)\geq 0.99723,\ \forall\ |\mathcal{I}_{t}^{1}|\in[4793,5207],

P(\tau_{2}>t)\geq 0.99729,\ \forall\ |\mathcal{I}_{t}^{2}|\in[49347,50654],

which implies the number of active elements in the projection should be less than $5\%$ of $n_{1}$ or $n_{2}$ with high probability.

For (B), $p=1-F_{d}(t)=0.05$ ; similar to the first example,

P(|\mathcal{I}_{t}^{1}|\in[4793,5207])=P(|\mathcal{I}_{t}^{2}|\in[49347,50654])=0.9973.

Together with

f_{d}^{*}=\frac{1}{0.05}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}=\frac{20}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}},\ \mbox{for}\ x\in[1.65,+\infty),

we can calculate $E$ and $V$ :

E=\frac{20}{\sqrt{2\pi}}\int_{1.65}^{\infty}xe^{-\frac{x^{2}}{2}}\,dx=\frac{20}{\sqrt{2\pi}}e^{-\frac{1.65^{2}}{2}}=2.045,

E(x^{2})=\frac{20}{\sqrt{2\pi}}\int_{1.65}^{\infty}x^{2}e^{-\frac{x^{2}}{2}}\,dx=4.375,

\implies V=E(x^{2})-E^{2}=0.192.

Applying Theorem 2,

P(\tau_{1}>t)\geq 0.99704,\ \forall\ |\mathcal{I}_{t}^{1}|\in[4793,5207],

P_{2}(\tau>t)\geq 0.99728,\ \forall\ |\mathcal{I}_{t}^{2}|\in[49347,50654],

which imply $5\%$ of terms are active after projection with probability $>99\%$ .

For (C), $p=1-F_{d}(t)=0.05$ . Similar to the previous examples,

P(|\mathcal{I}_{t}^{1}|\in[4793,5207])=P(|\mathcal{I}_{t}^{2}|\in[49347,50654])=0.9973.

Together with

f_{d}^{*}=\frac{20}{\sqrt{2\pi 10^{-3}}}e^{-\frac{x^{2}}{2\times 10^{-3}}},\ \forall\ x\in[0.05218,+\infty),

we can calculate $E$ and $V$ as follows,

E=\frac{20}{\sqrt{2\pi 10^{-3}}}\int_{1.65\times\sqrt{10^{-3}}}^{\infty}xe^{-\frac{x^{2}}{2\times 10^{-3}}}\,dx=0.03234,

E(x^{2})=\frac{20}{\sqrt{2\pi 10^{-3}}}\int_{1.65\times\sqrt{10^{-3}}}^{\infty}x^{2}e^{-\frac{x^{2}}{2\times 10^{-3}}}\,dx=0.002187,

\implies V=E(x^{2})-E^{2}=0.001142.

Applying Theorem 2,

P(\tau_{1}>t)\geq 0.99671,\ \forall\ |\mathcal{I}_{t}^{1}|\in[4793,5207],

P(\tau_{2}>t)\geq 0.99724,\ \forall\ |\mathcal{I}_{t}^{2}|\in[49347,50654],

and so $5\%$ of terms are active after projection with probability $>99\%$ .

Appendix D Algorithm Pseudocode

Input: vector

d=(d_{1},\cdots,d_{n})

Output: projection

v^{*}

1 for $i=1:n$ do

f_{i}=\begin{cases}1,&\mbox{if}\ d_{i}\geq 0\\ 0,&\mbox{otherwise}\end{cases}

;

4 end for

5if $1^{T}f$ is even then

6 Set

i^{*}=\mbox{argmin}_{i\in 1:n}|d_{i}|

;

7 Update

f_{i^{*}}=1-f_{i^{*}}

;

9 end if

10for $i=1:n$ do

11 Set

v_{i}=d_{i}(-1)^{f_{i}}

;

13 end for

14if $1^{T}\mathrm{proj}_{[-\frac{1}{2},\frac{1}{2}]^{n}}(v)\geq 1-\frac{n}{2}$ then

15 return

v^{*}=\mbox{proj}_{[-\frac{1}{2},\frac{1}{2}]^{n}}(d)

16else

17 Set

v^{*}=\mbox{proj}_{\Delta-\frac{1}{2}}(d)

;

18 for $i=1:n$ do

19 Update

v^{*}_{i}=v_{i}^{*}(-1)^{f_{i}}

;

21 end for

22 return

v^{*}

23 end if

Algorithm 12 Centered Parity Polytope Projection

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

, weight

w=\{w_{1},...,w_{n}\}

Output: projection

v^{*}

1 Set

\mathcal{I}_{p}:=\{1,...,n\}

\mathcal{I}:=\emptyset

;

2 do

3 Set

\mathcal{I}:=|\mathcal{I}_{p}|

;

4 Set

p:=\frac{\sum_{i\in\mathcal{I}_{p}}w_{i}d_{i}-b}{\sum_{i\in\mathcal{I}_{p}}w_{i}^{2}}

;

5 Set

\mathcal{I}_{t}=\{i\in\mathcal{I}_{p}\ |\ \frac{d_{i}}{w_{i}}>p\}

;

7while $|\mathcal{I}|>|\mathcal{I}_{p}|$ ;

8Set

v_{i}^{*}:=\mbox{max}\{d_{i}-w_{i}p,0\}

for all

1\leq i\leq n

;

return

v^{*}:=(v_{1}^{*},\cdots,v_{n}^{*})

Algorithm 13 Weighted Michelot’s method

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

, weight

w

Output: Index set

\mathcal{I}_{p}

1 Set

\mathcal{I}_{p}:=\{1\}

\mathcal{I}_{w}:=\emptyset

p=:\frac{w_{1}d_{1}-b}{w_{1}^{2}}

;

2 for $i=2:n$ do

3 if $\frac{d_{i}}{w_{i}}>p$ then

4 Set

p:=\frac{w_{i}d_{i}+\sum_{j\in\mathcal{I}_{p}w_{j}d_{j}}-b}{w_{i}^{2}+\sum_{j\in\mathcal{I}_{p}w_{j}^{2}}}

;

5 if $p>\frac{w_{i}d_{i}-b}{w_{i}^{2}}$ then

6 Set

\mathcal{I}_{p}:=\mathcal{I}\cup\{i\}

;

8 else

9 Set

\mathcal{I}_{w}:=\mathcal{I}_{w}\cup\mathcal{I}_{p}

;

10 Set

\mathcal{I}_{p}=\{i\}

p:=\frac{w_{i}d_{i}-b}{w_{i}^{2}}

;

12 end if

14 end if

16 end for

17if $|\mathcal{I}_{w}|\neq 0$ then

18 for $i\in\mathcal{I}_{w}:\frac{d_{i}}{w_{i}}>p$ do

19 Set

\mathcal{I}_{p}:=\mathcal{I}_{p}\cup\{i\}

;

20 Set

p:=\frac{w_{i}d_{i}+\sum_{j\in\mathcal{I}_{p}w_{j}d_{j}}-b}{w_{i}^{2}+\sum_{j\in\mathcal{I}_{p}w_{j}^{2}}}

;

22 end for

24 end if

return

\mathcal{I}_{p}

Algorithm 14 Weighted Filter Perez et al. (2020a)

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

, weight

w

Output: projection

v^{*}

1 Set

z:=\{\frac{d_{i}}{w_{i}}\}

;

2 Parallel sort

z

z_{(1)}\geq\cdots\geq z_{(n)}

, and apply this order to

d

and

w

;

3 Find

\kappa:=\mathrm{max}_{k=1,\cdots,n}\{\frac{\sum_{i=1}^{k}w_{i}d_{i}-b}{\sum_{i=1}^{k}w_{i}^{2}}\leq z_{k}\}

;

4 Set

\tau=\frac{\sum_{i=1}^{\kappa}w_{i}d_{i}-b}{\sum_{i=1}^{\kappa}w_{i}^{2}}

;

5 Parallel set

v_{i}^{*}:=\mbox{max}\{d_{i}-w_{i}\tau,0\}

for all

1\leq i\leq n

;

return

v^{*}=(v_{1}^{*},\cdots,v_{n}^{*})

Algorithm 15 Weighted Parallel Sort and Parallel Scan

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

, weight

w

k

cores.

Output: projection

v^{*}

1 Partition

d

into subvectors

d^{1},...,d^{k}

of dimension

\leq\frac{n}{k}

;

2 Set

v^{i}:=\texttt{Weighted\_Pivot\_Project}(v^{i},w,b)

(distributed across cores

i=1,...,k

);

3 Set

\hat{v}:=\cup_{i=1}^{k}\{v_{j}^{i}\ |\ v_{j}^{i}>0\}

;

4 Set

v^{*}:=\texttt{Weighted\_Pivot\_Project}(\hat{v},w,b)

;

return

v^{*}

Algorithm 16 Distributed Weighted Pivot and Project

Input: vector

d=(d_{1},\cdots,d_{n})

, scaling factor

b

, weight

w

k

cores.

Output: projection

v^{*}

1 Partition

d

into subvectors

d^{1},...,d^{k}

of dimension

\leq\frac{n}{k}

;

2 Set

v^{i}:=\texttt{Weighted\_Condat\_Project}(v^{i},w,b)

(distributed across cores i=1,…,k);

3 Set

\mathcal{I}_{p}:=\cup_{i=1}^{k}\{j\ |\ v^{i}_{j}>0\}

;

4 Set

p:=\frac{\sum_{i\in\mathcal{I}_{p}}w_{i}d_{i}-b}{\sum_{i\in\mathcal{I}_{p}}w_{i}^{2}}

\mathcal{I}:=\emptyset

;

5 do

6 Set

\mathcal{I}=\mathcal{I}_{p}

;

7 for $i\in\mathcal{I}$ do

8 if $\frac{d_{i}}{w_{i}}\leq p$ then

9 Set

\mathcal{I}:=\mathcal{I}_{p}\backslash\{i\}

p:=\frac{\sum_{i\in\mathcal{I}}w_{i}d_{i}-b}{\sum_{i\in\mathcal{I}}w_{i}^{2}}

;

11 end if

13 end for

15while $|\mathcal{I}|>|\mathcal{I}_{p}|$ ;

16Set

v_{i}^{*}=\mbox{max}\{d_{i}-w_{i}p,0\}

for all

1\leq i\leq n

;

return

v^{*}=(v_{1}^{*},\cdots,v_{n}^{*})

Algorithm 17 Distributed Weighted Condat

Appendix E Additional Experiments

All code and data can be found at: Github²²2https://github.com/foreverdyz/Parallel_Projection or the IJOC repository Dai and Chen (2023)

E.1 Testing Theoretical Bounds

For Theorem 1, we calculate the average number of active elements in projecting a vector $d$ , drawn i.i.d. from $U[0,1]$ , onto the simplex $\Delta_{1}$ , with $n$ between $10^{6}$ and $10^{7}$ and 10 trials per size. This empirical result is compared against the corresponding asymptotic bound of $\sqrt{2n}$ given by Theorem 1. Results are shown in Figure 10, and demonstrate that our asymptotic bound is rather accurate for small $n$ .

Similarly, for Proposition 6, we conduct the same experiments and compare the results against the function $(2.2n)^{\frac{2}{3}}$ in Figure 11, where the constant was found empirically.

For Lemma 1, we run Algorithm 3 (in the main paper) on $U[0,1]$ i.i.d. distributed inputs $d_{i}$ with scaling factor $b=1$ , size $n=10^{6}$ , and 100 trials. We compare the (average) remaining number of elements after each iteration of Michelot’s method and against the geometric series with a ratio as $\frac{1}{2}$ . We find the average number of remaining terms after each loop of Michelot’s method is close to the corresponding value of the geometric series with a ratio as $\frac{1}{2}$ from $10^{6}$ . So, the conclusion from Lemma 1, which claims that the Michelot method approximately discards half of the vector in each loop when the size is big, is accurate.

E.2 Robustness Test

We created examples with a wide range of scaling factors $b=1,10,10^{2},10^{3}$ , $10^{4},10^{5},10^{6}$ and drawing $d_{i}\sim N(0,1)$ and $N(0,100)$ with a size of $n=10^{8}$ for serial methods and parallel methods with $40$ cores. Results are given in Figure 13 ( $d_{i}\sim N(0,1)$ ) and Figure 14 ( $d_{i}\sim N(0,100)$ ).

E.3 Weighted simplex, weighted $\ell_{1}$ ball

We conduct weighted simplex and weighted $\ell_{1}$ ball projection experiments with various methods to solve a problem with a problem size of $n=10^{8}-1$ . Inputs $d_{i}$ are drawn i.i.d. from $N(0,1)$ and weights $w_{i}$ are drawn i.i.d. from $U[0,1]$ . Algorithms were implemented as described in Sec B.2. Results for weighted simplex are shown in Figure 15 and results for weighted $\ell_{1}$ ball projection are shown in Fig. 16. Slightly more modest speedups across all algorithms are observed in the weighted simplex compared to the unweighted experiments; nonetheless, the general pattern still holds, with parallel Condat attaining superior performance with up to 14x absolute speedup. In the weighted $\ell_{1}$ ball projection, we observe that sort and scan has similar relative speedups to our parallel Condat’s method. However, the underlying serial Condat’s method is considerably faster.

E.4 Runtime Fairness Test

We provide a runtime fairness test for serial methods (e.g. Sort and Scan method, Michelot’s method, Condat’s method) and their respective parallel implements. We restrict both serail and parallel methods to use only one core to solve a problem with a size of $n=10^{8}$ and a scaling factor $b=1$ . Inputs $d_{i}$ drawn i.i.d. from $u[0,1]$ , and $b=1$ . Results are provided in Table 3.

Method	Runtime
Sort + Scan	$\mathrm{1.037e+01}$
(P)Sort + Scan	$\mathrm{1.571e+01}$
(P)Sort + Partial Scan	$\mathrm{1.505e+01}$
Michelot	$\mathrm{4.030}$
(P) Michelot	$\mathrm{4.229}$
Condat	$\mathrm{2.429e-01}$
(P) Condat	$\mathrm{2.497e-01}$

Table 3: Runtime (s) for projection onto a simplex in fairness test

E.5 Discussion on dense projections

To project a vector $d\in\mathbb{R}^{n}$ onto the $\ell_{b}$ ball, we first check if $\sum_{i=1}^{n}|d_{i}|\leq b$ . If this condition holds, then $d$ is already within the $\ell_{b}$ ball. However, if $\sum_{i=1}^{n}|d_{i}|>b$ , we project $|d|:=(|d_{1}|,...,|d_{n}|)$ onto the simplex with a scaling factor of $b$ . As noted earlier, we have $\tau\geq\frac{\sum_{i=1}^{n}|d_{i}|-b}{n}>0$ , which means that all zero terms in $d$ are inactive in the projection of $|d|$ onto the simplex. Therefore, we only need to project a subvector $(|d_{i}|)_{i:d_{i}\neq 0}$ onto the simplex. This is why when $d$ is sparse, it is probably better to use serial projection methods instead of their parallel counterparts.

	$\displaystyle E[\delta_{i}\|\mathcal{I}_{p}^{(i)},p^{(i-1)}]=$	$\displaystyle E[E[\delta_{i}\|\mathcal{I}_{p}^{(i)},p^{(i)},p^{(i-1)}]]$
	$\displaystyle=$	$\displaystyle\|\mathcal{I}_{p}^{(i)}\|(\frac{1}{2}-\frac{(u+p^{(i-1)})/2-E[p^{(i)}\|\mathcal{I}_{p}^{(i)},p^{(i-1)}]}{u-p^{(i-1)}}).$

		$\displaystyle E[\delta_{i}\ \|\ \mathcal{I}_{p}^{(i)},p^{(i-1)}]$
	$\displaystyle=$	$\displaystyle\|\mathcal{I}_{p}^{(i)}\|(\frac{1}{2}-\frac{(u+p^{(i-1)})/2-(u+p^{(i-1)})/2+b/\|\mathcal{I}_{p}^{(i)}\|}{u-p^{(i-1)}})$
	$\displaystyle=$	$\displaystyle\frac{\|\mathcal{I}_{p}^{(i)}\|}{2}-\frac{b}{u-p^{(i-1)}}.$

	$\displaystyle E[u-p^{(i)}]=$	$\displaystyle\frac{u-l}{2^{i}}+\sum_{j=1}^{i}E[\frac{b}{2^{i-j}\cdot\|\mathcal{I}_{p}^{(j)}\|}],$
	$\displaystyle\implies E[p^{(i)}]=$	$\displaystyle u-\frac{u-l}{2^{i}}-\sum_{j=1}^{i}E[\frac{b}{2^{i-j}\cdot\|\mathcal{I}_{p}^{(j)}\|}],$
	$\displaystyle\implies E[p^{(\xi-1)}\|\xi]=$	$\displaystyle u-\frac{u-l}{2^{\xi-1}}-E[\sum_{j=1}^{\xi-1}\frac{b}{2^{\xi-1-j}\cdot\|\mathcal{I}_{p}^{(j)}\|}\|\xi].$

	$\displaystyle\frac{\sum_{i\in\mathcal{I}}d_{i}-b}{\|\mathcal{I}\|}=$	$\displaystyle\frac{\sum_{i\in\mathcal{I}_{p}^{(n)}}d_{i}-b+\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}d_{i}}{\|\mathcal{I}\|}$
	$\displaystyle=$	$\displaystyle\frac{(\sum_{i\in\mathcal{I}_{p}^{(n)}}d_{i}-b)\|\mathcal{I}_{p}^{(n)}\|}{\|\mathcal{I}\|\|\mathcal{I}_{p}^{(n)}\|}+\frac{\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}d_{i}}{\|\mathcal{I}\|}$
	$\displaystyle=$	$\displaystyle\frac{\|\mathcal{I}_{p}^{(n)}\|}{\|\mathcal{I}\|}p^{(n)}+\frac{\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}d_{i}}{\|\mathcal{I}\|}$
	$\displaystyle=$	$\displaystyle p^{(n)}+\frac{\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}d_{i}-(\|\mathcal{I}\|-\|\mathcal{I}_{p}^{(n)}\|)p^{(n)}}{\|\mathcal{I}\|}$
	$\displaystyle=$	$\displaystyle p^{(n)}+\frac{\sum_{i\in\mathcal{I}\backslash\mathcal{I}_{p}^{(n)}}(d_{i}-p^{(n)})}{\|\mathcal{I}\|}.$

	$\displaystyle P(f(t)>0)$	$\displaystyle\geq 1-P(\|\sum_{i\in\mathcal{I}_{t}}d_{i}-E[\sum_{i\in\mathcal{I}_{t}}d_{i}]\|\geq E[\sum_{i\in\mathcal{I}_{t}}d_{i}]-\|\mathcal{I}_{t}\|\cdot t-b)$
		$\displaystyle\geq 1-\frac{\texttt{Var}(\sum_{i\in\mathcal{I}_{t}}d_{i})}{(E[\sum_{i\in\mathcal{I}_{t}}d_{i}]-\|\mathcal{I}_{t}\|\cdot t-b)^{2}}\ \ \mbox{(Chebyshev's inequality)}$
		$\displaystyle=1-\frac{\sigma^{2}\|\mathcal{I}_{t}\|}{(\mu\|\mathcal{I}_{t}\|-t\|\mathcal{I}_{t}\|-b)^{2}}.$

Sparsity-Exploiting Distributed Projections onto a Simplex

Abstract

1 Introduction

1.1 Applications

1.2 Contributions

2 Background and Serial Algorithms

2.1 Properties of Simplex Projection

Proposition 2.1 (Held et al. (1974)).

Corollary 2.2.

2.2 Sort and Scan

2.3 Pivot and Partition

Proposition 2.3.

2.4 Condat’s Method

Proposition 2.4.

2.5 Summary of Results

3 Parallel Algorithms

3.1 Parallel Sort and Parallel Scan

3.2 Sparsity-Exploiting Distributed Projections

Theorem 3.1.

Theorem 3.2.

Corollary 3.3.

Proposition 3.4.

3.3 Parallel Pivot and Partition

Proposition 3.5.

3.4 Parallel Condat’s Method

Proposition 3.6.

Proposition 3.7.

3.5 Summary of Results

4 Parallelization for Extensions of Projection onto a Simplex

4.1 Projection onto the ℓ1\ell_{1} Ball

4.2 Centered Parity Polytope Projection

5 Numerical Experiments

5.1 Testing Algorithms

5.1.1 Projection onto Simplex

5.1.2 ℓ1\ell_{1} ball

5.1.3 Weighted Simplex and Weighted ℓ1\ell_{1} ball

5.1.4 Parity Polytope

5.1.5 Lasso on Real-World Data

5.1.6 Discussion

6 Conclusion

7 Acknowledgements

References

Appendix A Mathematical Proofs

Proof.

Proof.

Proof.

Claim.

Proof.

Proof.

Proof.

Proof.

Proof.

Proof.

Proof.

Claim.

Proof.

Proof.

Proof.

Proof.

Proof.

Claim.

Proof.

Proof.

Proof.

Proof.

Appendix B Algorithm Descriptions

B.1 Bucket Method

Proof.

B.2 Projection onto a Weighted Simplex and a Weighted ℓ1\ell_{1} Ball

Appendix C Distribution Examples

Appendix D Algorithm Pseudocode

Appendix E Additional Experiments

E.1 Testing Theoretical Bounds

E.2 Robustness Test

E.3 Weighted simplex, weighted ℓ1\ell_{1} ball

E.4 Runtime Fairness Test

E.5 Discussion on dense projections

4.1 Projection onto the $\ell_{1}$ Ball

5.1.2 $\ell_{1}$ ball

5.1.3 Weighted Simplex and Weighted $\ell_{1}$ ball

B.2 Projection onto a Weighted Simplex and a Weighted $\ell_{1}$ Ball

E.3 Weighted simplex, weighted $\ell_{1}$ ball