This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Searching and Sorting With O(n2)O(n^{2}) Processors in O(1)O(1) Time

Taeyoung An and A. Yavuz Oruç
Department of Electrical Engineering
University of Maryland
College Park, MD 20742
Abstract

The proliferation of number of processing elements (PEs) in parallel computer systems, along with the use of more extensive parallelization of algorithms causes the interprocessor communications dominate VLSI chip space. This paper proposes a new architecture to overcome this issue by using simple crosspoint switches to pair PEs instead of a complex interconnection network. Based on the cyclic permutation wiring idea described in [1], this pairing leads to a linear crosspoint array of n(n1)/2n(n-1)/2 processing elements and as many crosspoints. We demonstrate the versatility of this new parallel architecture by designing fast searching and sorting algorithms for it. In particular, we show that finding a minimum, maximum, and searching a list of nn elements can all be performed in O(1)O(1) time with elementary logic gates with O(n)O(n) fan-in, and in O(lgn)O(\lg n) time with O(1)O(1) fan-in. We further show that sorting a list of nn elements can also be carried out in O(1)O(1) time using elementary logic gates with O(n)O(n) fan-in and threshold logic gates. The sorting time increases to O(lgnlglgn)O(\lg n\lg\lg n) if only elementary logic gates with O(1)O(1) fan-in are used. The algorithm can find the maximum among nn elements in O(1)O(1) time, and sort nn elements in O(lgn(lglgn))O(\lg n(\lg\lg n)) time. In addition, we show how other fundamental queries can be handled within the same order of time complexities.

I Introduction

With the emerging big data problems in distributed and cloud-based computing systems, designing efficient on-chip networks for fast searching and sorting extremely large sets of data has become a critical task. To be sure, there exist myriad searching and sorting algorithms [2, 3] that can be applied to processing big data in distributed and cloud-based systems, but many of these algorithms are either not sufficiently fast to deal with such large data sets or they require parallel processing elements (PEs) with overly complex interconnection networks. In this paper, we introduce a tightly coupled linear array processing model with O(n2)O(n^{2}) simple processing elements in which every pair of processing element is connected by a direct link (an on-and-off switch). Unlike a conventional linear array of O(n)O(n) processing elements [4, 5, 6, 7, 8], this linear array model, called a 1D-Crosspoint Array, uses O(n2)O(n^{2}) processing elements, each of which serves as a simple compare-and-exchange operator, together with some minimal combinational circuit functions. We provide counting-based searching and sorting algorithms on this 1D-Crosspoint Array. Our search algorithm takes O(1)O(1) time and sorting algorithm takes O(lgnlglgn)O(\lg n\lg\lg n) time with constant fan-in elementary operations, and O(1)O(1) time with threshold gates, making them highly competitive with parallel searching and sorting architectures that have previously been reported in the literature. For comparison, we provide a brief survey of such architectures here. Earliest results on parallel sorting appeared in [9, 10, 11, 12, 13, 14, 15, 16]. Batcher’s sorters are non-adaptive or oblivious in that they are constructed by a set of 2×22\times 2 switches connected together in stages to compare and exchange keys to sort them [9]. Batcher’s odd-even and bitonic sorters use O(nlg2n)O(n\lg^{2}n) compare-and-exchange switches and have O(lg2n)O(\lg^{2}n) sorting time. Another non-adaptive sorting network is the AKS network that can sort a set of nn keys in O(lgn)O(\lg n) time using O(nlgn)O(n\lg n) comparator/exchange switches. The main issue with the AKS network is the large constants in its hardware and sorting time complexities. Adaptive techniques are also used in sorting as described [17] for sorting binary keys in O(lgn)O(\lg n) gate delays with O(n)O(n) constant fan-in and fan-out gates. Several other parallel realizations of sorting algorithms have also been reported in the literature. Many of these rely on a mesh-connected parallel computer model [13, 15], or more generic SIMD processors models [10, 11]. In general, the studies on mesh-connected parallel computer models establish that nn keys can be sorted on a n×n\sqrt{n}\times\sqrt{n} mesh in O(n)O(\sqrt{n}) time with different constants in the order of time complexity that ranges between 3 and 6. A more realistic linear array sorting model was introduced in [4] to sort a list of nn keys in 2n2n time using O(n)O(n) PEs and O(n)O(n) memory space. More recent results on sorting on parallel computers with a mesh topology make stronger claims on the time complexity of sorting, effectively reducing the time complexity of sorting to O(lgnlglgn)O(\lg n\lg\lg n) with nn PEs. Examples of such results include those that appeared in [5, 6, 7, 8, 18]. These efforts rely on a reconfigurable pipelined bus system, called the LARPBS (Linear Array, Reconfigurable Pipelined Bus System). The work in [8] reduced the time complexity to O(lg2lgn)O(\lg^{2}\lg n) using n1+ϵn^{1+\epsilon} PEs, where 0<ϵ<1,0<\epsilon<1, and the result in [18] provided an O(lgn)O(\lg n)-time sorting algorithm on the LARPBS model with O(n)O(n) PEs. The time complexity of sorting was reduced further to O(1)O(1) in [19] using a mesh topology by assuming that PEs in such a topology may communicate with each other in O(1)O(1) time. We find this assumption impractical as it ignores wire delays by assuming that bus transmissions between PEs take O(1)O(1) time.

Independent from these results, Valiant reported an abstract parallel processor model and studied the parallel time complexity of merging and sorting problems without specifying a particular topology [12]. He presented both lower and upper bounds on the parallel time complexity of merging and sorting on this abstract model. Valiant’s work is important in that it guides what is feasible theoretically as the number of processing elements is varied from two PEs to n(n1)/2n(n-1)/2 PEs, even though it is not at all obvious how his merging and sorting algorithms can be mapped to an actual parallel processor architecture. In this paper, we attempt to fill this void in one particular case, namely when Valiant’s abstract model assumes n(n1)/2n(n-1)/2 processing elements. We introduce a realistic model for interprocessor communications by placing direct edges or on-and-off switches only between physically adjacent PEs. More precisely, our a 1D linear array architecture with n(n1)2+1\frac{n(n-1)}{2}+1 PEs provides an O(1)O(1) time sorting algorithm, matching the time complexity of sorting a list elements on Valiant’s parallel processing model [12] when O(n2)O(n^{2}) PE’s are used. It should be pointed out that Valiant uses an underlying topology with O(n)O(n) fan-in and fan-out to sort a list of nn elements. The underlying topology of our 1D-Crosspoint Array assumes a fan-in and fan-out of 2,2, but to accomplish O(1)O(1) sorting time, we employ a threshold logic circuit, which by definition requires O(n)O(n) fan-in. We also employ a distribution network with O(n)O(n) fan-out in our O(1)O(1) time sorting algorithm. The same assumptions hold for finding the minimum and maximum of a list of elements as in Valiant’s parallel processer model. The rest of the paper is organized as follows. In the next section, we describe the 1D-Crosspoint Array and prove the minimum number of PEs needed to accomplish our searching and sorting time complexity results. In Section III, we explain how the new architecture can be constructed for any number of PEs. Next in Section IV, we introduce the parallel sorting algorithm suited for the new architecture and analyze the complexity of the algorithm. We also extend this algorithm to finding the minimum, maximum of a list of elements as well as searching for an element. The paper is concluded in Section V, with a discussion that includes the comparison of our sorting algorithm with some of the well-known parallel sorting algorithms.

II The 1D-Crosspoint Array

A one dimensional (1D)-crosspoint array is the simplest form of PE pairing that establishes a baseline of layout and wiring complexity as depicted in Figure 1 for a 55-PE network.

Refer to caption
Figure 1: 1D-Crosspoint Array for n=5n=5

We say that a crosspoint array has a 1D-layout if (a) each PE is paired with at most two pairs and (b) PEs are placed on a straight path. As seen in the figure, all (52)=10\binom{5}{2}=10 pairings of five PEs are realizable in this layout of 10-crosspoint array in which a PE is replicated no more than three times. Each of the five PEs is replicated twice, where the copies will be referred to as replicates. Replicates are created to distribute the workload to multiple PEs and attain parallelism. They will be given the same input dataset, and will be carrying out the same computation with the neighboring PEs(replicates). Thus, replicates are similar to PEs and when there is no ambiguity, they will be viewed like PEs. Each PE and its replicates form a class so that PEs and their replicates are divided into non-overlapping sets. Two classes will be called adjacent if they have two elements (PEs or replicates) that share a crosspoint. Our goal here is to have a pair of elements, i.e., PEs or replicates one from each of the (n2)\binom{n}{2} pairs of nn classes share a crosspoint in a 1D-Crosspoint Array, using a minimum number of crosspoints and replicates and under the assumption that no PE(replicate) is adjacent to more than two PEs (replicates).

To this end, we have the following results.

Proposition 1.

A 1D-Crosspoint Array in which every pair of nn classes is adjacent requires at least (n2)\binom{n}{2} crosspoints and (n2)+1\binom{n}{2}+1 PEs in all nn classes.

Proof.

It is obvious that each pair of PEs requires a crosspoint to be adjacent, and hence (n2)\binom{n}{2} pairs of classes require (n2)\binom{n}{2} crosspoints. It is also obvious that (n2)\binom{n}{2} crosspoints require (n2)+1\binom{n}{2}+1 PEs, given that no PE is connected to more than two PEs.∎

Proposition 2.

In a 1D-Crosspoint Array with nn classes, if every pair of nn classes is adjacent then each class must have no fewer than n12\left\lceil\frac{n-1}{2}\right\rceil replicates.

Proof.

A PE on a 1D-Crosspoint Array can be adjacent to at most two PEs. Thus, any given class needs at least n12\frac{n-1}{2} replicates to be adjacent to the PEs of all other n1n-1 classes. Since number of replicates can only be an integer, n12\frac{n-1}{2} rounds up to the next larger integer, i.e., n12\left\lceil\frac{n-1}{2}\right\rceil.∎

The last proposition can be strengthened for the two PEs at the end of the 1D-Crosspoint Array:

Proposition 3.

If the two PEs at the two ends of the 1D-Crosspoint Array belong to the same class then that class must contain at least n+12\lceil\frac{n+1}{2}\rceil PEs in order to be adjacent to n1n-1 PEs from n1n-1 different classes.

Proof.

Let’s suppose that the two PEs at the two ends belong to class α\alpha. These two PEs can only be adjacent to two PEs in total. Therefore, for class α\alpha to be adjacent to n1n-1 other classes, it must be adjacent to some (n1)2=n3(n-1)-2=n-3 classes if we exclude its adjacencies through the two PEs at the two ends. As in Proposition 2, we can deduce that class α\alpha needs to have at least n32\frac{n-3}{2} PEs. Hence, including the PEs at the ends, PE α\alpha needs at least n32+2=n+12\frac{n-3}{2}+2=\frac{n+1}{2} PEs in total. Again, since the number of replicates can only be an integer, it rounds up to n+12\left\lceil\frac{n+1}{2}\right\rceil. ∎

Proposition 4.

If the two PEs at the two ends of the 1D-Crosspoint Array belong to different classes then those two classes each contain at least n2\left\lceil\frac{n}{2}\right\rceil PEs in order to be adjacent to n1n-1 PEs from n1n-1 classes.

Proof.

This time, suppose that the PEs at the two ends of the array belong to class α\alpha and β\beta. The class α\alpha can be adjacent to only one class via the PE at its end, and similarly class β\beta can be adjacent to only one class on the other end. Therefore, for class α\alpha to be adjacent to n1n-1 classes, it must be adjacent to n2n-2 classes via its remaining PEs and the same holds for class β\beta. This implies that class α\alpha and β\beta must each have at least n22=n21\frac{n-2}{2}=\frac{n}{2}-1 PEs. Hence, including the PE at its end, class α\alpha must contain at least n21+1=n2\left\lceil\frac{n}{2}-1+1\right\rceil=\left\lceil\frac{n}{2}\right\rceil PEs. The same holds for class β\beta, and hence the statement. ∎

Combining Propositions 1, 3, and 4, the following holds:

Theorem 5.

A 1D-Crosspoint Array in which all nn classes are adjacent requires at least n(n1)2+1\frac{n(n-1)}{2}+1 PEs (replicates) for odd nn, and n22\frac{n^{2}}{2} PEs (replicates) for even nn.

Proof.

When the PEs at the two ends of the array belong to the same class, by Proposition 3, that class would need to contain n+12\left\lceil\frac{n+1}{2}\right\rceil PEs. By Proposition 2, each of the other n1n-1 classes would need to have n12\left\lceil\frac{n-1}{2}\right\rceil PEs. Therefore, for the 1D-Crosspoint Array to have all (n2)\binom{n}{2} pairs of nn different classes, it would require at least

n+12×1+n12×(n1) PEs.\left\lceil\frac{n+1}{2}\right\rceil\times 1+\left\lceil\frac{n-1}{2}\right\rceil\times(n-1)\text{ PEs.} (1)

For odd nn Eqn. 1 becomes n(n1)2+1\frac{n(n-1)}{2}+1, and for even nn it reduces to n22+1\frac{n^{2}}{2}+1.
When the PEs at the two ends of the array belong to different classes, by Proposition 4, those classes would need to contain n2\left\lceil\frac{n}{2}\right\rceil PEs each. By Proposition 2, each of the other n2n-2 classes would need to have n12\left\lceil\frac{n-1}{2}\right\rceil PEs. Therefore, for all (n2)\binom{n}{2} pairs of nn different classes of PEs to be adjacent , the 1D-Crosspoint Array would require at least

n2×2+n12×(n2) PEs.\left\lceil\frac{n}{2}\right\rceil\times 2+\left\lceil\frac{n-1}{2}\right\rceil\times(n-2)\text{ PEs.} (2)

If nn is an odd number, Eqn. 2 reduces to n(n1)2+2,\frac{n(n-1)}{2}+2, and if nn is an even number, it becomes n22.\frac{n^{2}}{2}.
Therefore, for all nn classes to be adjacent on the 1D-Crosspoint Array, there must be n(n1)2+1\frac{n(n-1)}{2}+1 or more PEs for odd nn case, and n22\frac{n^{2}}{2} or more PEs for even nn case.∎

III Construction of 1D-Crosspoint Array

We now present a construction for a 1D-Crosspoint Array for any nn number of classes using a number of PEs that matches the lower bound given in Theorem 5. Our construction works differently for even and odd nn as described below.

III-A Even nn

We borrow ideas from cyclic permutation groups as used for constructing One-sided binary tree-crossbar switch in [20, 1]. Let p=(0 1 2 3,n1)p=(0\,1\,2\,3\cdots,n-1) be a permutation of nn classes111Throughout the rest of the paper, pp will be fixed to this permutation., where p(i)=i+1modn,0in1.p(i)=i+1\mod n,0\leq i\leq n-1. We will use pp as the representation of layout of PEs belonging to nn classes, where classes whose ids are adjacent in the cycle representation of powers of pp will also be adjacent in the 1D-Crosspoint Array. The cyclic group of permutations generated by pp consists of nn permutations p,p2,p3,,pnp,p^{2},p^{3},\cdots,p^{n}, where pnp^{n} is the identity permutation. The permutation pjp^{j}, where 1jn11\leq j\leq n-1 specifies the right neighbor of class ii which is given by (i+j)modn(i+j)\bmod{n}. It is shown in [20, 1] that p,p2,,pn1p,p^{2},\cdots,p^{n-1} map every element to a distinct element, which can be interpreted as every PE having distinct right neighbor in the corresponding 1D-Crosspoint Array. It is also shown that pnjp^{n-j}, 1jn211\leq j\leq\frac{n}{2}-1 is pjp^{j} written in reverse and it represents the inverse permutation of pjp^{j}. The PEs are identified with the elements in pjp^{j} that are generally expressed as a product of disjoint cycles. For example, pp consists of a one long cycle, i.e., a cycle of nn elements, p2p^{2} may or may not be a long cycle depending upon nn being a prime or not, and so on. The elements that are adjacent in the cycles of each pjp^{j} determine the crosspoints between the PEs in some unique way. More specifically, two PEs will have a crosspoint between them if the elements that identify the two PEs are adjacent in a cycle of pjp^{j} for some j,1jn1.j,1\leq j\leq n-1. Hence, the PEs are connected together using crosspoints and in [20] PEs need not be physically adjacent when crosspoints are placed in-between them. However, in our construction of 1D-Crosspoint Array, we restrict the placement of crosspoints between PEs that are physically adjacent. In addition, each PE is connected to exactly one other PE in [20], whereas, in our work, each PE is connected to up to two other PEs. The last distinction we need draw between [20] and our work is that we only use p,p2,,pn2p,p^{2},\cdots,p^{\frac{n}{2}} in the construction of the 1D-Crosspoint Array even though we will make use of pn2+1,pn2+2,,pn1p^{\frac{n}{2}+1},p^{\frac{n}{2}+2},\cdots,p^{n-1} in some of our proofs. With these ideas in mind, we now describe some preliminary facts that is used for the construction.

Proposition 6.

The number of cycles in permutation pjp^{j} is equal to gcd(n,j),\gcd(n,j), i.e., greatest common divisor of nn and j.j.

Proof.

Suppose a cycle starts with ii. Then the following elements of the cycle are given by adding jj and then applying modulo nn. The cycle ends when the result of the modulo function equals ii again, which is when (i+mj)modn=i(i+mj)\bmod n=i, where mm would be the number of elements in the cycle. This implies that the mjmj is the least common multiple of nn and jj, or lcm(n,j)\textup{lcm}(n,j). Then, from the relation between lcm(n,j)\textup{lcm}(n,j) and gcd(n,j)\gcd(n,j), mj=lcm(n,j)=njgcd(n,j)mj=\textup{lcm}(n,j)=\frac{nj}{\gcd(n,j)}, we get gcd(n,j)=nm,\gcd(n,j)=\frac{n}{m}, which is the number of cycles in pj.p^{j}.

Proposition 7.

The elements of a cycle are gcd(n,j)\gcd(n,j) apart from each other.

Proof.

Let k=gcd(n,j)k=\gcd(n,j) and let uu be an element in a cycle in permutation pjp^{j}, 0jn20\leq j\leq\frac{n}{2}. Further suppose vv is another element that belongs to the same cycle to which uu belongs. By the property of the cyclic permutation group, we can express vv as v=(u+xj)modn,v=(u+xj)\bmod n, 0xnk1,0\leq x\leq\frac{n}{k}-1, which implies u+xj=an+vu+xj=an+v, where aa is a non-negative integer. Since jj and nn are a multiple of kk, letting j=bkj=bk and n=ckn=ck, where bb and cc are positive integers, we have u+x(bk)=a(ck)+vu+x(bk)=a(ck)+v or v=u+(bxac)k.v=u+(bx-ac)k. Now, since vuv\neq u, and b,x,a,b,x,a, and cc are all integers, (bxac)(bx-ac) is a non-zero integer. Hence, vv can only be a multiple of kk apart from uu, such as u±k,u±2k,u\pm k,u\pm 2k, etc. Therefore, every element of a cycle in pjp^{j} is always gcd(n,j)\gcd(n,j) apart from each other. ∎

Proposition 8.

The smallest element of any cycle in any of pjp^{j}, 1jn21\leq j\leq\frac{n}{2}, is less than or equal to n21\frac{n}{2}-1.

Proof.

By Proposition 6, permutation pjp^{j}, 1jn21\leq j\leq\frac{n}{2}, has gcd(n,j)\gcd(n,j) cycles. Suppose pjp^{j} has a cycle gig_{i}, where 0igcd(n,j)10\leq i\leq\gcd(n,j)-1. Further suppose an element mm that is less than or equal to gcd(n,j)1\gcd(n,j)-1 in gig_{i}. Then, by Proposition 7, mm would be the smallest element in that cycle, since mgcd(n,j)<0m-\gcd(n,j)<0. Furthermore, no other element in gig_{i} will be less than m+gcd(n,j)m+\gcd(n,j). Since this holds for any mm that is less than gcd(n,j)\gcd(n,j), all the elements less than gcd(n,j)\gcd(n,j) must be distributed to distinct cycles and be the smallest element in its cycle. Now, when nn is even and j<nj<n, the maximum value of gcd(n,j)\gcd(n,j) is n2\frac{n}{2}. Therefore, the smallest element in any cycle must be less than or equal to n21\frac{n}{2}-1. ∎

The construction algorithm for the 1D-Crosspoint Array is shown in Algorithm 1, and Figure 2 illustrates this algorithm for n=12.n=12.

Algorithm 1 Construction algorithm for even nn.

Step 1. Suppose that the cycles in p,p2,,pn2p,p^{2},\cdots,p^{\frac{n}{2}} are written so that the first element is the smallest in every one of them22footnotemark: 2. Suppose that an empty frame of a 1D-Crosspoint Array with n2/2n^{2}/2 PEs and a crosspoint between each PEs is provided without the actual assignment of class ids to the PEs.
Step 2. Partition all the cycles in all of p,p2,,pn2p,p^{2},\cdots,p^{\frac{n}{2}} into n2\frac{n}{2} sets, QiQ_{i}, 0in210\leq i\leq\frac{n}{2}-1, where QiQ_{i} is the set of cycles in which the first element is i,0in/21.i,0\leq i\leq n/2-1. (Proposition 9 establishes that there exists such a partition of n/2n/2 such sets.) Thus, Q0Q_{0} consists of all cycles that begin with 0, Q1Q_{1} consists of all cycles that begin with 11, and so on, Qn21Q_{\frac{n}{2}-1} consists of all cycles that begin with n21\frac{n}{2}-1. Set i=0i=0.
Step 3. Choose any cycle in QiQ_{i} that consists of more than two elements. Then place a PE that belongs to the class ii into the left most PE space available in the 1D-Crosspoint Array. Next, suppose the next element in that cycle is αi\alpha_{i}. Place a PE that belongs to class αi\alpha_{i} on the right-hand side of the previously placed PE. Repeat placing PEs in the same manner from left to right into the 1D-Crosspoint Array until all the elements in that cycle are used. We will refer to this process as placing a cycle on the 1D-Crosspoint Array. (See Figure 2(d).)
Step 4. Choose a cycle in QiQ_{i} that has not yet been chosen. Then place the newly chosen cycle into the 1D-Crosspoint Array in the same way as in Step 3. The first PE associated with the new cycle should be placed on the right-hand side of the last PE from Step 3 without skipping any PE spaces. Step 5. Repeat Step 4 until all the cycles consisting of more than two elements have been processed. Then choose the cycle in QiQ_{i} that consists of only two elements (Every QiQ_{i} has exactly one cycle of two elements as would be implied by Proposition 10 and the definition of QiQ_{i}), and place it into the 1D-Crosspoint Array as before.
Step 6. Set i=i+1i=i+1 and repeat Steps 3,4, and 5 until i=n21.i=\frac{n}{2}-1. Again, PE spaces must not be skipped between the steps.
2~{}~{}~{}~{}^{2}This is done for notational convenience only.

p\displaystyle p =(0,1,2,3,4,5,6,7,8,9,10,11)\displaystyle=(0,1,2,3,4,5,6,7,8,9,10,11)
p2\displaystyle p^{2} =(0,2,4,6,8,10)(1,3,5,7,9,11)\displaystyle=(0,2,4,6,8,10)\;(1,3,5,7,9,11)
p3\displaystyle p^{3} =(0,3,6,9)(1,4,7,10)(2,5,8,11)\displaystyle=(0,3,6,9)\;(1,4,7,10)\;(2,5,8,11)
p4\displaystyle p^{4} =(0,4,8)(1,5,9)(2,6,10)(3,7,11)\displaystyle=(0,4,8)\;(1,5,9)\;(2,6,10)\;(3,7,11)
p5\displaystyle p^{5} =(0,5,10,3,8,1,6,11,4,9,2,7)\displaystyle=(0,5,10,3,8,1,6,11,4,9,2,7)
p6\displaystyle p^{6} =(0,6)(1,7)(2,8)(3,9)(4,10)(5,11)\displaystyle=(0,6)\;(1,7)\;(2,8)\;(3,9)\;(4,10)\;(5,11)
p7\displaystyle p^{7} =(0,7,2,9,4,11,6,1,8,3,10,5)\displaystyle=(0,7,2,9,4,11,6,1,8,3,10,5)
p8\displaystyle p^{8} =(0,8,4)(1,9,5)(2,10,6)(3,11,7)\displaystyle=(0,8,4)\;(1,9,5)\;(2,10,6)\;(3,11,7)
p9\displaystyle p^{9} =(0,9,6,3)(1,10,7,4)(2,11,8,5)\displaystyle=(0,9,6,3)\;(1,10,7,4)\;(2,11,8,5)
p10\displaystyle p^{10} =(0,10,8,6,4,2)(1,11,9,7,5,3)\displaystyle=(0,10,8,6,4,2)\;(1,11,9,7,5,3)
p11\displaystyle p^{11} =(0,11,10,9,8,7,6,5,4,3,2,1)\displaystyle=(0,11,10,9,8,7,6,5,4,3,2,1)
(a)
Step 1: The cyclic permutation group,
where the first element being the smallest.
Q0=\displaystyle Q_{0}= {(0,1,2,3,4,5,6,7,8,9,10,11),\displaystyle\;\big{\{}(0,1,2,3,4,5,6,7,8,9,10,11),
(0,2,4,6,8,10),(0,3,6,9),(0,4,8),\displaystyle\;\;\;(0,2,4,6,8,10),(0,3,6,9),(0,4,8),
(0,5,10,3,8,1,6,11,4,9,2,7),(0,6)}\displaystyle\;\;\;(0,5,10,3,8,1,6,11,4,9,2,7),(0,6)\big{\}}
Q1=\displaystyle Q_{1}= {(1,3,5,7,9,11),(1,4,7,10),(1,5,9),(1,7)}\displaystyle\;\big{\{}(1,3,5,7,9,11),(1,4,7,10),(1,5,9),(1,7)\big{\}}
Q2=\displaystyle Q_{2}= {(2,5,8,11),(2,6,10),(2,8)}\displaystyle\;\big{\{}(2,5,8,11),(2,6,10),(2,8)\big{\}}
Q3=\displaystyle Q_{3}= {(3,7,11),(3,9)}\displaystyle\;\big{\{}(3,7,11),(3,9)\big{\}}
Q4=\displaystyle Q_{4}= {(4,10)}\displaystyle\;\big{\{}(4,10)\big{\}}
Q5=\displaystyle Q_{5}= {(5,11)}\displaystyle\;\big{\{}(5,11)\big{\}}
(b)
Step 2: All of the cycles in all of pjp^{j}, where 0j6,0\leq j\leq 6,
has been partitioned into QiQ_{i}, 0in210\leq i\leq\frac{n}{2}-1.
Refer to caption
(c)
Step 3: The 1D-Crosspoint Array after placing the first cycle of Q0Q_{0}.
It is assumed that (0,1,2,3,4,5,6,7,8,9,10,11)(0,1,2,3,4,5,6,7,8,9,10,11) was chosen.
Refer to caption
(d)
Step 4: The 1D-Crosspoint Array after placing the second cycle of Q0Q_{0}.
It is assumed that (0,2,4,8,10)(0,2,4,8,10) was chosen.
Refer to caption
(e) Step 5: The 1D-Crosspoint Array after placing all the cycles of Q0Q_{0}, with the cycle (0,6)(0,6) from pn2p^{\frac{n}{2}} is placed last.
Refer to caption
(f)
Step 6: The 1D-Crosspoint Array after placing all the cycles of Q1Q_{1}, with the cycle (1,7)(1,7) from pn2p^{\frac{n}{2}} placed last.
Note that some of the PEs are abbreviated by ‘\cdots’, due to the lack of space.
Refer to caption
(g)
Step 6: The 1D-Crosspoint Array after placing all the cycles of every QiQ_{i}, where 0in21.0\leq i\leq\frac{n}{2}-1.
Note that some of the PEs are abbreviated by ‘\cdots’ once again due to the lack of space.
Figure 2: An illustration of the construction algorithm for an example case of n=12.n=12.

The following propositions are the propositions mentioned in Step 2 and Step 5.

Proposition 9.

The cycles in all of p,p2,,pn2p,p^{2},\cdots,p^{\frac{n}{2}} can be partitioned into n/2n/2 subsets QiQ_{i}, 0in21,0\leq i\leq\frac{n}{2}-1, where QiQ_{i} is defined in Algorithm 1.

Proof.

By Proposition 8, the first element of any cycle in any of pjp^{j}, 1jn21\leq j\leq\frac{n}{2}, is less than or equal to n21\frac{n}{2}-1. Therefore, the cycles in all of pjp^{j} can be partitioned into n2\frac{n}{2} subsets QiQ_{i} as defined in Algorithm 1. ∎

Proposition 10.

For an even nn, the permutation pn2p^{\frac{n}{2}} is the only permutation among pjp^{j}, where 1jn21\leq j\leq\frac{n}{2}, whose n2\frac{n}{2} cycles are composed of two elements each.

Proof.

By Proposition 6, the number of cycles in pjp^{j} is gcd(n,j).\gcd(n,j). When nn is even and j<nj<n, the maximum value of gcd(n,j)\gcd(n,j) is n2\frac{n}{2}, which occurs only if j=n2j=\frac{n}{2}. ∎

We next establish that Steps 3 through 6 construct a 1D-Crosspoint Array in which all (n2)\binom{n}{2} pairs of classes of PEs are connected together by crosspoints. Suppose that pj=g1g2gkp^{j}=g_{1}g_{2}...g_{k}, where each gig_{i} represents a cycle, and further suppose that gi=(bi,0bi,1bi,ri1)g_{i}=(b_{i,0}b_{i,1}...b_{i,r_{i}-1}), where rir_{i} is the length of gig_{i}. Then pnj=g11g21gk1p^{n-j}=g_{1}^{-1}g_{2}^{-1}...g_{k}^{-1}, and we write gi1g_{i}^{-1} as gi1=(bi,0bi,ri1bi,ri2bi,1)g_{i}^{-1}=(b_{i,0}b_{i,r_{i}-1}b_{i,r_{i}-2}...b_{i,1}). Then, gig_{i} represent the pairs (bi,0,bi,1),(bi,1,bi,2),(bi,2,bi,3),(b_{i,0},b_{i,1}),(b_{i,1},b_{i,2}),(b_{i,2},b_{i,3}), (bi,3,bi,4),,(b_{i,3},b_{i,4}),\cdots, (bi,ri3,bi,ri2),(b_{i,r_{i}-3},b_{i,r_{i}-2}), (bi,ri2,bi,ri1)(b_{i,r_{i}-2},b_{i,r_{i}-1}) in the 1D-Crosspoint Array. If rir_{i} is an even number, we note that of these ri1r_{i-1} pairs, (bi,0,bi,1),(b_{i,0},b_{i,1}), (bi,2,bi,3),(b_{i,2},b_{i,3}),,\cdots,(bi,ri2,bi,ri1)(b_{i,r_{i}-2},b_{i,r_{i}-1}) will be pairs in pjp^{j}, and (bi,1,bi,2),(b_{i,1},b_{i,2}), (bi,3,bi,4),,(b_{i,3},b_{i,4}),\cdots, (bi,ri3,bi,ri2)(b_{i,r_{i}-3},b_{i,r_{i}-2}) will be pairs in pnjp^{n-j} in the one-sided binary tree-crossbar switch that is described in [20]. Thus, there is a one-to-one correspondence between the pairs that the cyclic permutation group forms in 1D-Crosspoint Array and in the one-sided binary tree-crossbar switch. However, the pair (bi,ri1bi,0)(b_{i,r_{i}-1}\,b_{i,0}) that is formed in pnjp^{n-j} in the one-sided binary tree-crossbar is left out in this one-to-one correspondence. This pair is accounted for by Step 4 in Algorithm 1, by inserting a crosspoint between the last PE placed by the previously chosen cycle and very first PE placed by the newly chosen cycle, since every cycle in the same QiQ_{i}, 0in210\leq i\leq\frac{n}{2}-1, start with the same bi,0b_{i,0} by Steps 1 and 2. Step 4 excludes choosing a cycle that consists only two elements, and saves it for Step 5. This is because for cycles consisting only two elements, no pair will be omitted in Steps 3 and 4, since (bi,ri1bi,0)=(bi,1bi,0)(b_{i,r_{i}-1}\,b_{i,0})=(b_{i,1}\,b_{i,0}), which eliminates the need for placing the PE belonging to the same class as the starting PE in the same QiQ_{i}.
Now, it was proven in [20] that every (n2)\binom{n}{2} pair is formed by p,p2,,pn1p,p^{2},\cdots,p^{n-1} in the one-sided binary tree-crossbar switch. Therefore, following steps 3 through 6, keeping in mind the one-to-one correspondence that has been described, all (n2)\binom{n}{2} pairs that are extracted from the first n2\frac{n}{2} permutations, p,p2,,pn2,p,p^{2},\cdots,p^{\frac{n}{2}}, are placed in the 1D-Crosspoint Array .

In the case that rir_{i} is odd, gig_{i} represents the pairs (bi,0,bi,1),(bi,1,bi,2),,(bi,ri3,bi,ri2),(b_{i,0},b_{i,1}),(b_{i,1},b_{i,2}),\cdots,(b_{i,r_{i}-3},b_{i,r_{i}-2}), (bi,ri2,bi,ri1)(b_{i,r_{i}-2},b_{i,r_{i}-1}) in 1D-Crosspoint Array. Of these ri1r_{i}-1 pairs, (bi,0,bi,1),(b_{i,0},b_{i,1}), (bi,2,bi,3),(b_{i,2},b_{i,3}),,\cdots,(bi,ri3,bi,ri2)(b_{i,r_{i}-3},b_{i,r_{i}-2}) will be pairs in pjp^{j}, and (bi,1,bi,2),(b_{i,1},b_{i,2}), (bi,3,bi,4),,(b_{i,3},b_{i,4}),\cdots, (bi,ri2,bi,ri1)(b_{i,r_{i}-2},b_{i,r_{i}-1}) will be pairs in pnjp^{n-j} in the one-sided binary tree-crossbar switch. The omitted pair is the same (bi,ri1bi,0)(b_{i,r_{i}-1}\,b_{i,0}) pair, and therefore steps 3 through 6 ensure all the cycles in all of p,p2,,pn2p,p^{2},\cdots,p^{\frac{n}{2}} places all (n2)\binom{n}{2} pairs in an 1D-Crosspoint Array with crosspoints in-between them for odd rir_{i} as well. Note that the number of elements in p,p2,,pn2p,p^{2},\cdots,p^{\frac{n}{2}} is n×n2=n22n\times\frac{n}{2}=\frac{n^{2}}{2}, which matches the lower bound in Theorem 5. Therefore, the 1D-Crosspoint Array described in the algorithm is optimal with respect to the number of PEs used. However, it is important to note that the optimal lower bound in Theorem 5 for even nn is greater than the minimum number of PEs in Proposition 1. More specifically, it is greater by n21\frac{n}{2}-1, which indicates the existence of n21\frac{n}{2}-1 redundant adjacencies on the 1D-Crosspoint Array. In fact, the pairs made of the last PE placed in step 5 and the first PE placed in the next iteration of step 3, i.e., the pairs made by elements from two different QiQ_{i}’s, are already created elsewhere in the 1D-Crosspoint Array. By Proposition 9, there are n2\frac{n}{2} of QiQ_{i}’s and therefore there will be n21\frac{n}{2}-1 pairs between those QiQ_{i}’s which confirms the n21\frac{n}{2}-1 of redundant adjacency. We state this formally in the following remark, because it will play a key role in the odd nn 1D-Crosspoint Array construction and in the sorting algorithm that is presented in the next section.

Remark 1a.

For all even nn, there are n21\frac{n}{2}-1 redundant adjacencies on the 1D-Crosspoint Array. ∎

III-B Odd nn

For an odd nn, we first construct a 1D-Crosspoint Array for n1n-1 classes, using Algorithm 1, except that in Step 1, we start with an empty frame with n(n1)2+1\frac{n(n-1)}{2}+1 PEs, and in Step 6, we skip a PE space between the placements of consecutive QiQ_{i}’s, as shown in Figure 3(c). When 1D-Crosspoint Array for n1n-1 classes are constructed, we place the PEs of the nthn^{th} class, or class n1n-1, in the skipped PE spaces after Step 6, as shown in Figure 3(d). There will always be two empty PE spaces at the end, and we place the PE of class n1n-1 in the first empty PE space, and place the PE of class 0 in the second one as shown in Figure 3(e).
It still remains to establish that all (n2)\binom{n}{2} adjacencies of nn classes are included in the 1D-Crosspoint Array. By constructing a 1D-Crosspoint Array for n1n-1 classes, we capture every adjacencies between n1n-1 classes, and we would only need to capture adjacencies between class n1n-1 and the other classes. Previously, in Remark 1a, we pointed out the existence of redundant pairs in the 1D-Crosspoint Array, which were made by elements from two different QiQ_{i}’s. By inserting class n1n-1 in between the classes that form such pairs, we capture the adjacencies between class n1n-1 and other classes without destroying any of the existing pairs. In fact, this procedure will ensure the 1D-Crosspoint Array to have the adjacencies between class n1n-1 and class 1,2,,n31,2,\cdots,n-3. This is because the PE on the righthand side of the skipped PE space will be the first PE in each of the cycles in QiQ_{i}, 1in1211\leq i\leq\frac{n-1}{2}-1, i.e., PEs of class 1,2,,n1211,2,\cdots,\frac{n-1}{2}-1, and the PE on the lefthand side of the skipped PE space will be the second PE of the cycles in pn12p^{\frac{n-1}{2}} except the one cycle belonging to Qn21Q_{\frac{n}{2}-1}, i.e., PEs of class n12,n12+1,,n3\frac{n-1}{2},\frac{n-1}{2}+1,\cdots,n-3. Therefore, by placing the PE of class n1n-1 in the skipped PE spaces we capture the adjacencies between class n1n-1 and class 1,2,,n31,2,\cdots,n-3. The only adjacencies that are not yet captured on the 1D-Crosspoint Array are of pairs (n1,0)(n-1,0) and (n1,n2)(n-1,n-2). These two adjacencies can be captured by placing PE of class n1n-1 and 0 at the end of the 1D-Crosspoint Array. An example case of n=7n=7 case is illustrated in Figure 3.

p\displaystyle p =(0,1,2,3,4,5)\displaystyle=(0,1,2,3,4,5)
p2\displaystyle p^{2} =(0,2,4)(1,3,5)\displaystyle=(0,2,4)\;(1,3,5)
p3\displaystyle p^{3} =(0,3)(1,4)(2,5)\displaystyle=(0,3)\;(1,4)\;(2,5)
(a) The cyclic permutation for n1=6n-1=6 case.
Q0=\displaystyle Q_{0}= {(0,1,2,3,4,5),(0,2,4),(0,3)}\displaystyle\;\big{\{}(0,1,2,3,4,5),(0,2,4),(0,3)\big{\}}
Q1=\displaystyle Q_{1}= {(1,3,5),(1,4)}\displaystyle\;\big{\{}(1,3,5),(1,4)\big{\}}
Q2=\displaystyle Q_{2}= {(2,5)}\displaystyle\;\big{\{}(2,5)\big{\}}
(b) All the cycles has been partitioned into QiQ_{i}, 0in1210\leq i\leq\frac{n-1}{2}-1.
Refer to caption
(c) 1D-Crosspoint Array for n1=6n-1=6. Note that a PE space was skipped between consecutive QiQ_{i}’s.
Refer to caption
(d) The PE of class n1n-1 is placed in the skipped PE spaces.
Refer to caption
(e) The PE of class n1n-1 and 0 are placed at the end.
Figure 3: Construction of 1D-Crosspoint Array for n=7n=7 case.

Note that the number of PEs we have placed after constructing a 1D-Crosspoint Array for n1n-1 is n121+2=n+12\frac{n-1}{2}-1+2=\frac{n+1}{2}. Therefore, the total number of PEs is n+12+(n1)22=n(n1)2+1\frac{n+1}{2}+\frac{(n-1)^{2}}{2}=\frac{n(n-1)}{2}+1, which matches the minimum number of PEs needed for odd nn from Theorem 5 and Proposition 1. Therefore, we can construct a 1D-Crosspoint Array for odd n,n, where every pair of nn classes is adjacent only once on the 1D-Crosspoint Array. We state this formally in the following remark as it will play a key role in the memory requirements of the sorting algorithm that is presented in the next section.

Remark 1b.

For all odd nn each class of PEs is adjacent to every other class of PEs exactly once.∎

IV Parallel Enumeration Sort

In this section, we are going to show how the proposed 1D-Crosspoint Array can be used to carry out a parallel enumeration sort. Sorting is a fundamental problem in data processing. Therefore, a fast sorting algorithm is important, especially for large sets of data. To this end, many parallel algorithms were introduced with a myriad of processor topologies. Here, we will show how our proposed topology sorts nn elements in O(lgnlglgn)O(\lg n\lg\lg n) time, using O(n2)O(n^{2}) PEs. Furthermore, we will describe a way to reduce the time complexity to O(1)O(1) using threshold logic gates and an encoder. Among numerous algorithms and topologies, no O(1)O(1) time sorting algorithm has been reported except those in [21, 22, 23], where strong assumptions about path delays in a reconfigurable mesh were made. In particular, it is assumed that a signal takes O(1)O(1) time to travel through any path on such a topology, regardless of the distance between PEs. This assumption is based on the YUPPIE system described in [24]. However, in [24] Maresca et al., explain that the YUPPIE system needs a special clocking scheme to overcome the propagation delay, which is proportional to the distance between PEs. Another strong assumption would be that the overhead of inter-PE communications as compared to the amount of computation time used by PEs is negligible, but this defeats the purpose of having a mesh topology to interconnect PEs together. Unlike these constant time algorithms, our proposed algorithm on 1D-Crosspoint Array does not need to make such a strong assumption as all communications take place between physically adjacent neighboring PEs.
The execution of the algorithm is illustrated in Figure 4 for n=4n=4 with an array A=[6,7,8,5]A=[6,7,8,5] of input values, and in Figure 5 for n=5n=5 with an array A=[8,6,9,5,7]A=[8,6,9,5,7] of input values. The proposed algorithm is a variant of enumeration sort. In general, enumeration sort has two tasks. First, it compares each element with every other element. It should be stated that we assume that the comparison of any two elements in AA takes O(1)O(1) time. Otherwise, it should be factored into the overall time complexity of our sorting algorithm. Second, when all comparisons are completed, for each element, it counts the number of elements that are less than that element, which then gives its rank. The proposed algorithm follows this format as we now explain.
The nested parFor loop in line 8 loads the ii-th element of the input array AA to PE Ci,0C_{i,0} of every class 0in10\leq i\leq n-1. It also initializes all elements of array TT to 0. Each PE takes two steps to complete these two tasks. The latter task is completed in a single step as each PE Ci,0C_{i,0} has access to T[i][],0in1T[i][\,],0\leq i\leq n-1 without any contention by other PEs. Thus, PE Ci,0C_{i,0} can issue a master clear to all the nn flip-flops in row i,0in1.i,0\leq i\leq n-1. Alternatively, we can assign two flip-flops to each PE in class ii to clear a pair of flip-flops in two steps. In both cases, it takes O(1)O(1) time to clear all the flip-flops in all of T[i],0in1.T[i],0\leq i\leq n-1. Loading the ii-th element of AA to all PEs Ci,0,0in1C_{i,0},0\leq i\leq n-1 in O(1)O(1) steps is a little trickier. We assume that there is a backbone bus structure that facilitates this concurrent read operation by all the nn PEs. This can be viewed as a parallel load operation much like setting or clearing the bits of flip-flops in a register. The ii-th element that is read from a memory where the nn elements of AA are stored is placed on a data bus, which then becomes available to all PEs Ci,0,0in1.C_{i,0},0\leq i\leq n-1. Thus loading AA into Ci,0,0in1C_{i,0},0\leq i\leq n-1 can be carried out in O(1)O(1) time. Therefore, the execution of this parFor loop takes O(1)O(1) time to complete overall.

1:// AA is a vector which stores a set of nn elements
2:// Ci,jC_{i,j} represent jj-th PE (or replicate) of class ii,
3:// where 0in10\leq i\leq n-1, 0jn210\leq j\leq\frac{n}{2}-1.
4:// Cir,jr,Cil,jlC_{i_{r},j_{r}},C_{i_{l},j_{l}} denote right and left neighbors of Ci,jC_{i,j}.
5:// TT is an n×nn\times n matrix.
6:// RR is a vector which stores ranks of nn elements.
7:// Load AA and initialize Ti,0in1T_{i},0\leq i\leq n-1 in parallel
8:parFor (0in10\leq i\leq n-1)
9:   Load A[i]A[i] to Ci,0C_{i,0}
10:   parFor 0kn10\leq k\leq n-1
11:      T[i][k]0T[i][k]\leftarrow 0
12:   end parFor
13:end parFor
14:// Compare elements in parallel
15:parFor (0in1(0\leq i\leq n-1, 0jn21)0\leq j\leq\frac{n}{2}-1)
16:   if (il<i)(i_{l}<i) then
17:      Send A[i]A[i] to left neighbor
18:      if ((Signal from left ==1)==1) then
19:         T[i][il]1T[i][i_{l}]\leftarrow 1
20:      end if
21:   else
22:      Receive A[il]A[i_{l}] from left neighbor
23:      if (A[il]<A[i])||((A[il]=A[i])&&(il<i))(A[i_{l}]<A[i])~{}||~{}\big{(}(A[i_{l}]=A[i])\&\&(i_{l}<i)\big{)} then
24:         T[i][il]1T[i][i_{l}]\leftarrow 1
25:         Send 0 to left neighbor
26:      else if (A[il]>A[i])||((A[il]=A[i])&&(ili))(A[i_{l}]>A[i])~{}||~{}\big{(}(A[i_{l}]=A[i])\&\&(i_{l}\geq i)\big{)} then
27:         Send 11 to left neighbor
28:      end if
29:   end if
30:   if (ir<i)(i_{r}<i) then
31:      Send A[i]A[i] to right neighbor
32:      if ((Signal from right ==1)==1) then
33:         T[i][ir]1T[i][i_{r}]\leftarrow 1
34:      end if
35:   else
36:      Receive A[ir]A[i_{r}] from right neighbor
37:      if (A[ir]<A[i])||((A[ir]=A[i])&&(ir<i))(A[i_{r}]<A[i])~{}||~{}\big{(}(A[i_{r}]=A[i])\&\&(i_{r}<i)\big{)} then
38:         T[i][ir]1T[i][i_{r}]\leftarrow 1
39:         Send 0 to right neighbor
40:      else if (A[ir]>A[i])||((A[ir]=A[i])&&(iri))(A[i_{r}]>A[i])~{}||~{}\big{(}(A[i_{r}]=A[i])\&\&(i_{r}\geq i)\big{)} then
41:         Send 11 to right neighbor
42:      end if
43:   end if
44:end parFor
45:// Store the rank
46:parFor 0in10\leq i\leq n-1,
47:   R[i]=k=0n1T[i][k]R[i]=\sum_{k=0}^{n-1}T[i][k]
48:end parFor
Algorithm 2 sorting using a 1D-Crosspoint Array
Ci,jC_{i,j} Cil,jlC_{i_{l},j_{l}} Cir,jrC_{i_{r},j_{r}} parFor in line 16, 30 parFor in line 23, 37 Array TT after line 44 parFor in line 46
  Refer to caption C0,0C_{0,0} A[0]=6A[0]=6 T=[0001100111010000]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&0&0&1\\ 1&0&0&1\\ 1&1&0&1\\ 0&0&0&0\\ \end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right] R[0]=1R[0]=1
\cdashline4-6 C1,0C_{1,0} ir=1i=0i_{r}=1\nless i=0 A[1]=7A[0]=6A[1]=7\nless A[0]=6
\therefore Receive A[1]A[1] from right \therefore Send 11 to right
C1,0C_{1,0} A[1]=7A[1]=7 C0,0C_{0,0} il=0<i=1i_{l}=0<i=1 Receive 11 from left T=[0001100111010000]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&0&0&1\\ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}&0&0&1\\ 1&1&0&1\\ 0&0&0&0\\ \end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right] R[1]=2R[1]=2
\therefore Send A[1]A[1] to left T[1][0]1\therefore T[1][0]\leftarrow 1
\cdashline4-6 C2,0C_{2,0} ir=2i=1i_{r}=2\nless i=1 A[2]=8A[1]=7A[2]=8\nless A[1]=7
\therefore Receive A[2]A[2] from right \therefore Send 11 to right
C2,0C_{2,0} A[2]=8A[2]=8 C1,0C_{1,0} il=1<i=2i_{l}=1<i=2 Receive 11 from left T=[0001100111010000]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&0&0&1\\ 1&0&0&1\\ 1&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{\text@underline{1}}}&0&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}\\ 0&0&0&0\\ \end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right] R[2]=3R[2]=3
\therefore Send A[2]A[2] to left T[2][1]1\therefore T[2][1]\leftarrow 1
\cdashline4-6 C3,0C_{3,0} ir=3i=2i_{r}=3\nless i=2 A[3]=5<A[2]=8A[3]=5<A[2]=8
\therefore Receive A[3]A[3] from right T[2][3]1\therefore T[2][3]\leftarrow 1, Send 0 to right
C3,0C_{3,0} A[3]=5A[3]=5 C2,0C_{2,0} il=2<i=3i_{l}=2<i=3 Receive 0 from left T=[0001100111010000]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&0&0&1\\ 1&0&0&1\\ 1&1&0&1\\ 0&0&0&0\\ \end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right] R[3]=0R[3]=0
\therefore Send A[3]A[3] to left
\cdashline4-6 C0,1C_{0,1} ir=0<i=3i_{r}=0<i=3 Receive 0 from right
\therefore Send A[3]A[3] to right
C0,1C_{0,1} A[0]=6A[0]=6 C3,0C_{3,0} il=3i=0i_{l}=3\nless i=0 A[3]=5<A[0]=6A[3]=5<A[0]=6 T=[0001100111010000]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&0&0&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}\\ 1&0&0&1\\ 1&1&0&1\\ 0&0&0&0\\ \end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right]
\therefore Receive A[3]A[3] from left T[0][3]1,\therefore T[0][3]\leftarrow 1, Send 0 to left
\cdashline4-6 C2,1C_{2,1} ir=2i=0i_{r}=2\nless i=0 A[2]=8A[0]=6A[2]=8\nless A[0]=6
\therefore Receive A[2]A[2] from right \therefore Send 11 to right
C2,1C_{2,1} A[2]=8A[2]=8 C0,1C_{0,1} il=0<i=2i_{l}=0<i=2 Receive 11 from right T=[0001100111010000]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&0&0&1\\ 1&0&0&1\\ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{\text@underline{1}}}&0&1\\ 0&0&0&0\\ \end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right]
\therefore Send A[2]A[2] to left T[2][0]1\therefore T[2][0]\leftarrow 1
\cdashline4-6 C1,1C_{1,1} ir=1<i=2i_{r}=1<i=2 Receive 11 from right
\therefore Send A[2]A[2] to right T[2][1]1\therefore T[2][1]\leftarrow 1
C1,1C_{1,1} A[1]=7A[1]=7 C2,1C_{2,1} il=2i=1i_{l}=2\nless i=1 A[2]=8A[1]=7A[2]=8\nless A[1]=7 T=[0001100111010000]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&0&0&1\\ 1&0&0&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}\\ 1&1&0&1\\ 0&0&0&0\\ \end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right]
\therefore Receive A[2]A[2] from left \therefore Send 11 to left
\cdashline4-6 C3,1C_{3,1} ir=3i=1i_{r}=3\nless i=1 A[3]=6<A[1]=7A[3]=6<A[1]=7
\therefore Receive A[3]A[3] from right T[1][3]1\therefore T[1][3]\leftarrow 1, Send 0 to right
C3,1C_{3,1} A[3]=5A[3]=5 C1,1C_{1,1} il=1<i=3i_{l}=1<i=3 Receive 0 from left T=[0001100111010000]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&0&0&1\\ 1&0&0&1\\ 1&1&0&1\\ 0&0&0&0\\ \end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right]
\therefore Send A[3]A[3] to left
\cdashline4-6
Figure 4: Illustration of sorting n=4n=4 elements using 1D-Crosspoint Array, with input data A={6,7,8,5}A=\{6,7,8,5\}. It is showing what is done in each ParFor loop.
Ci,jC_{i,j} Cil,jlC_{i_{l},j_{l}} Cir,jrC_{i_{r},j_{r}} parFor in line 16, 30 parFor in line 23, 37 Array TT after line 44 parFor in line 46
  Refer to caption C0,0C_{0,0} A[0]=8A[0]=8 T=[0101100010110110000001010]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}&0&1&1\\ 0&0&0&1&0\\ 1&1&0&1&1\\ 0&0&0&0&0\\ 0&1&0&1&0\end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right] R[0]=3R[0]=3
\cdashline4-6 C1,0C_{1,0} ir=1i=0i_{r}=1\nless i=0 A[1]=6<A[0]=8A[1]=6<A[0]=8
\therefore Receive A[1]A[1] from right T[0][1]1\therefore T[0][1]\leftarrow 1, Send 0 to right
C1,0C_{1,0} A[1]=6A[1]=6 C0,0C_{0,0} il=0<i=1i_{l}=0<i=1 Receive 0 from left T=[0101100010110110000001010]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&1&0&1&1\\ 0&0&0&1&0\\ 1&1&0&1&1\\ 0&0&0&0&0\\ 0&1&0&1&0\end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right] R[1]=1R[1]=1
\therefore Send A[1]A[1] to left
\cdashline4-6 C2,0C_{2,0} ir=2i=1i_{r}=2\nless i=1 A[2]=9A[1]=6A[2]=9\nless A[1]=6
\therefore Receive A[2]A[2] from right \therefore Send 11 to right
C2,0C_{2,0} A[2]=9A[2]=9 C1,0C_{1,0} il=1<i=2i_{l}=1<i=2 Receive 11 from left T=[0101100010110110000001010]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&1&0&1&1\\ 0&0&0&1&0\\ 1&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}&0&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}&1\\ 0&0&0&0&0\\ 0&1&0&1&0\end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right] R[2]=4R[2]=4
\therefore Send A[2]A[2] to left T[2][1]1\therefore T[2][1]\leftarrow 1
\cdashline4-6 C3,0C_{3,0} ir=3i=2i_{r}=3\nless i=2 A[3]=5<A[2]=9A[3]=5<A[2]=9
\therefore Receive A[3]A[3] from right T[2][3]1\therefore T[2][3]\leftarrow 1, Send 0 to right
C3,0C_{3,0} A[3]=5A[3]=5 C2,0C_{2,0} il=2<i=3i_{l}=2<i=3 Receive 0 from left T=[0101100010110110000001010]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&1&0&1&1\\ 0&0&0&1&0\\ 1&1&0&1&1\\ 0&0&0&0&0\\ 0&1&0&1&0\end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right] R[3]=0R[3]=0
\therefore Send A[3]A[3] to left
\cdashline4-6 C4,0C_{4,0} ir=4i=3i_{r}=4\nless i=3 A[4]=7A[3]=5A[4]=7\nless A[3]=5
\therefore Receive A[4]A[4] from right \therefore Send 11 to right
C4,0C_{4,0} A[4]=7A[4]=7 C3,0C_{3,0} il=3<i=4i_{l}=3<i=4 Receive 11 from left T=[0101100010110110000001010]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&1&0&1&1\\ 0&0&0&1&0\\ 1&1&0&1&1\\ 0&0&0&0&0\\ 0&1&0&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}&0\end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right] R[4]=2R[4]=2
\therefore Send A[4]A[4] to left T[4][3]1\therefore T[4][3]\leftarrow 1
\cdashline4-6 C0,1C_{0,1} ir=0<i=4i_{r}=0<i=4 Receive 0 from right
\therefore Send A[4]A[4] to right
C0,1C_{0,1} A[0]=8A[0]=8 C4,0C_{4,0} il=4i=0i_{l}=4\nless i=0 A[4]=7<A[0]=8A[4]=7<A[0]=8 T=[0101100010110110000001010]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&1&0&1&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}\\ 0&0&0&1&0\\ 1&1&0&1&1\\ 0&0&0&0&0\\ 0&1&0&1&0\end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right]
\therefore Receive A[4]A[4] from left T[0][4]1\therefore T[0][4]\leftarrow 1, Send 0 to left
\cdashline4-6 C2,1C_{2,1} ir=2i=0i_{r}=2\nless i=0 A[2]=9A[0]=8A[2]=9\nless A[0]=8
\therefore Receive A[2]A[2] from right \therefore Send 11 to right
C2,1C_{2,1} A[2]=9A[2]=9 C2,1C_{2,1} il=0<i=2i_{l}=0<i=2 Receive 11 from left T=[0101100010110110000001010]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&1&0&1&1\\ 0&0&0&1&0\\ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}&1&0&1&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}\\ 0&0&0&0&0\\ 0&1&0&1&0\end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right]
\therefore Send A[2]A[2] to left T[2][0]1\therefore T[2][0]\leftarrow 1
\cdashline4-6 C4,1C_{4,1} ir=4i=2i_{r}=4\nless i=2 A[4]=7<A[2]=9A[4]=7<A[2]=9
\therefore Receive A[4]A[4] from right T[2][4]1\therefore T[2][4]\leftarrow 1, Send 0 to right
C4,1C_{4,1} A[4]=7A[4]=7 C2,1C_{2,1} il=2<i=4i_{l}=2<i=4 Receive 0 from left T=[0101100010110110000001010]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&1&0&1&1\\ 0&0&0&1&0\\ 1&1&0&1&1\\ 0&0&0&0&0\\ 0&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}&0&1&0\end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right]
\therefore Send A[4]A[4] to left
\cdashline4-6 C1,1C_{1,1} ir=1<i=4i_{r}=1<i=4 Receive 11 from right
\therefore Send A[4]A[4] to right T[4][1]1\therefore T[4][1]\leftarrow 1
C1,1C_{1,1} A[1]=6A[1]=6 C2,1C_{2,1} il=4i=1i_{l}=4\nless i=1 A[4]=7A[1]=6A[4]=7\nless A[1]=6 T=[0101100010110110000001010]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&1&0&1&1\\ 0&0&0&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}&0\\ 1&1&0&1&1\\ 0&0&0&0&0\\ 0&1&0&1&0\end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right]
\therefore Receive A[4]A[4] from left \therefore Send 11 to left
\cdashline4-6 C3,1C_{3,1} ir=3i=1i_{r}=3\nless i=1 A[3]=5<A[1]=6A[3]=5<A[1]=6
\therefore Receive A[3]A[3] from right T[1][3]1\therefore T[1][3]\leftarrow 1, Send 0 to right
C3,1C_{3,1} A[3]=5A[3]=5 C1,1C_{1,1} il=1<i=3i_{l}=1<i=3 Receive 0 from left T=[0101100010110110000001010]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&1&0&1&1\\ 0&0&0&1&0\\ 1&1&0&1&1\\ 0&0&0&0&0\\ 0&1&0&1&0\end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right]
\therefore Send A[3]A[3] to left
\cdashline4-6 C0,2C_{0,2} ir=0<i=3i_{r}=0<i=3 Receive 0 from right
\therefore Send A[3]A[3] to right
C0,2C_{0,2} A[0]=8A[0]=8 C3,1C_{3,1} il=3i=0i_{l}=3\nless i=0 A[3]=5<A[0]=8A[3]=5<A[0]=8 T=[0101100010110110000001010]T=\left[\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\begin{matrix}0&1&0&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\textbf{1}}&1\\ 0&0&0&1&0\\ 1&1&0&1&1\\ 0&0&0&0&0\\ 0&1&0&1&0\end{matrix}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\right]
\therefore Receive A[3]A[3] to left T[0][3]1\therefore T[0][3]\leftarrow 1, Send 0 to left
\cdashline4-6
Figure 5: Illustration of sorting n=5n=5 elements using 1D-Crosspoint Array, with input data A={8,6,9,5,7}A=\{8,6,9,5,7\}. It is showing what is done in each ParFor loop.

The second parFor loop, starting at line 15 performs a parallel enumeration sort, comparing every element with every other element. As the 1D-Crosspoint Array provides a crosspoint between any two PE classes, this task is carried out in a distributed manner. Each pair of neighboring PEs essentially compare their input elements that they were loaded within the previous parFor loop as follows. Between any two neighboring PEs, the PE with the greater class number sends its element to the PE with the smaller class number (See the third column in Figures 4 and 5). Then the PE with the smaller class number carries out the comparison between the received element and the element that it was assigned with. This is done by the if-else statements beginning in lines 16 and 30. The two if-else statements are identical except that one of them is for communicating with the righthand side PE, and the other is for communicating with the lefthand side PE. During each of these comparisons, if the PE that carried out the comparison has the greater element, it writes a 11 to the T[i][il]T[i][i_{l}] (or T[i][ir]T[i][i_{r}]), then sends a 0 to the neighboring PE with the smaller element to indicate that the comparison is done. When the PE that carries out the comparison has the smaller element, it sends a 11 to the neighboring PE with the greater element, notifying it that 11 has to be written to the T[il][i]T[i_{l}][i] (or T[ir][i]T[i_{r}][i]). At the end of this step, the array TT is set in such a way that T[i][il]=1T[i][i_{l}]=1 implies that A[il]<A[i]A[i_{l}]<A[i] (See, the fourth column in Figures 4 and 5). It may appear to the reader that this step requires a memory that supports a concurrent write in order to run in O(1)O(1) time.

Refer to caption
Figure 6: Computing the index of the minimum and maximum elements in an array of nn elements.

However, we note that the parFor loop in line 15 does not compare the same two elements more than once. By Remark 1b, when nn is an odd number, the 1D-Crosspoint Array has only one adjacency between any two classes of PEs. Therefore, the locations to which the PEs write the comparison results are all distinct as seen in Figure 5, and hence a concurrent write memory is avoided. On the other hand, in even nn case, by Remark 1a, there are n21\frac{n}{2}-1 redundant adjacencies on the 1D-Crosspoint Array. For example, it is seen in Figure 4 that the comparison between PE classes 11 and 22 is done twice, and therefore the PEs C2,0C_{2,0} and C2,1C_{2,1} within the same class of PEs, i.e., class 2, attempt to write to location T[2][1]T[2][1] at the same time (see the underlined elements in TT in the fifth column). To avoid multiple writes when nn is even, another PE can easily be added to switch to a 1D-Crosspoint Array with an odd number of PE classes. Thus, assuming that we have an odd number of PE classes, and given that the number of if-else statements that a PE in any given class executes is at most two, this parFor loop also takes only O(1)O(1) time to complete. Finally, the last parFor loop in line 46 computes the rank of each element, which corresponds to the second task of enumeration sort. This step is executed by each class of PEs separately, where up to n2\frac{n}{2} PEs in each class are used to carry out addition operations. Thus, accessing the entries in T[i][]T[i][\,] cannot result in any contention across the PE classes. We further note that each 11 in T[i][]T[i][\,] implies that there is an element less than A[i],A[i], and hence counting the number of 11’s or summing the entries in T[i][]T[i][\,] gives the rank of the A[i]A[i]. In a way, the vectors stored in T[i][],0in1T[i][\,],0\leq i\leq n-1 may be viewed as encoded representations of the ranks of the elements in A,A, and no further operations may be needed. This makes the time complexity of sorting a set of nn elements on an O(n2)O(n^{2})-PE-1D-Crosspoint Array O(1)O(1) if the ranks of the elements in AA are not needed in decimal or some other, preferred number representation.
Besides sorting, a significant utility of this encoded form of sorting is the ease with which the minimum or maximum of the elements in AA can be determined. Both these tasks can be completed in O(1)O(1) time using simple logic gates, and an encoder consisting of OR gates with a fan-in of nn. To see why, suppose that the index of the minimum element of AA is imini_{\rm min}. Then T[imin][]T[i_{\rm min}][\,] should be [0,0,,0,0][0,0,\cdots,0,0]. Thus, if we apply a NOR function to the elements in each row as shown in Figure 6(a), only the row corresponding to the minimum element should output 11. Therefore, we can compute imini_{min} by cascading nn nn-input NOR gates with an nn-input by lgn\lg n-output encoder in O(1)O(1) time. The time complexity increases to O(lgn)O(\lg n) if the fan-in of OR gates within the encoder is assumed to be a constant or the fan-in of the NOR gate that is applied to each row of bits in TT is constant. Similarly, suppose that the index of the maximum element of AA is imaxi_{max}. Then T[imax][]T[i_{\rm max}][] should be all 11^{\prime}s except T[imax][imax]T[i_{\rm max}][i_{max}]. Therefore, if we apply an AND function to each row after complementing the diagonal elements in TT, then only the row that corresponds to the maximum element should output 11. Again, an encoder may be used to obtain the index of the maximum element in AA in O(1)O(1) time with unlimited fan-in gates and in O(lgn)O(\lg n) time with constant fan-in gates as shown in Figure 6(b). In both cases, we only use O(lgn)O(\lg n) logic gates with nn fan-in for the encoder and O(n)O(n) OR or AND gates for the first stage in Figure 6. This querying procedure can be extended to answering other queries as well. For example, we can determine if any particular element in AA has a rank of jj or higher, 1jn.1\leq j\leq n. The j=1j=1 case is trivial as it only requires an OR-gate with nn inputs. If j=2j=2 then we can add the bits in the corresponding row in the TT matrix in pairs to determine if any particular element in AA has a rank of jj or higher with high probability in O(1)O(1) time. To see why, note that exactly three out of four binary patterns 00,01,10,1100,01,10,11 result in a sum of zero or one, and hence the number of nn-bit patterns in which all n/2n/2 pairs sum to less than two is 3n/2.3^{n/2}. Dividing this by the total number of nn-bit patterns, we find that the probability of incorrectly guessing that the rank of an element is at least two tends to 0 as the following computation shows

3n/22n=12n(10.5lg3)=120.2175n0 as n.\frac{3^{n/2}}{2^{n}}=\frac{1}{2^{n(1-0.5\lg 3)}}=\frac{1}{2^{0.2175n}}\rightarrow 0\text{ as }n\rightarrow\infty.\vspace{1pt}

Thus n/2n/2 2-input AND gates together with an n/2n/2-input OR gate suffice to determine if the rank of any given element is at least 2 with high probability. In general, adding kk bits together, k2k\geq 2 leads to the probability of incorrectly guessing that the rank of an element is at least j,1jnj,1\leq j\leq n is given by

{2ki=0j1(ki)}n/k2n0 as n.\frac{\big{\{}2^{k}-\sum_{i=0}^{j-1}{k\choose i}\big{\}}^{n/k}}{2^{n}}\rightarrow 0\text{ as }n\rightarrow\infty.\vspace{1pt}

We note that it is always possible to determine the rank of any particular element in AA exactly, by summing the entries in the corresponding row in the TT matrix by a binary tree adder in O(lglgn)O(\lg\lg n) time. Querying the ii-th largest element is only a little more involved. Suppose that the index of the ii-th largest element is ix.i_{x}. Then the sum of the 1’s in T[ix][]T[i_{x}][\,] must be equal to i.i. Moreover, the sum of the 1’s in no other row should be equal to ixi_{x}. Therefore, if we subtract ii from the sum of all entries in the rows in TT then the row that produces a zero provides the key to the ii-th largest element. Once identified, the index of this row can be captured as before using an nn-input by lgn\lg n-output encoder in O(1)O(1) or O(lgn)O(\lg n) time as in the prior cases. All we need to do is to flip the bits that are generated by subtraction operations, before feeding them to the encoder. The only other time complexity that must be added to the total time complexity is that of a lgn\lg n-bit subtraction operation. This can be carried out by the first nn PEs, Ci,0,0in1C_{i,0},0\leq i\leq n-1 in O(1)O(1) time, using O(nlg2n)O(n\lg^{2}n) logic gates with O(n)O(n) fan-in or O(lglgn)O(\lg\lg n) gate-level time using a lgn\lg n-bit prefix adder [25] with O(nlgn)O(n\lg n) constant fan-in logic gates. Another possible query would be to search for a specific number in a given array. This can be accomplished with an encoder and slightly modified Algorithm 2. Since we only need to compare each element with the number we are searching for, not with n1n-1 other elements, only one replicate from each class is needed to check if the element it has is equal to the number that is searched in the if statement in line 23, and 37. We would not need the following else blocks in line 26, and 40, and the array TT can be a vector of size n×1n\times 1. Each element in array TT will then be fed into the n×lgnn\times\lg n encoder, which will output the index of the number that is searched. Due to the fact that algorithm for search query is a cut-down version of the Algorithm 2, and does only require a n×lgnn\times\lg n encoder, the searching among nn elements take O(1)O(1) time in the 1D-Crosspoint Array.

Refer to caption
Figure 7: Computing the ranks from the TT matrices.

Finally, if the ranks are desirable in decimal or some other numerical notation, the binary entries in T[i][],0in1T[i][],0\leq i\leq n-1 can be summed in O(lgnlglgn)O(\lg n\lg\lg n) time using a binary tree of nn Brent-Kung adders [25], each with two lgn\lg n-bit operands and consisting of O(lgn)O(\lg n) logic gates with constant fan-in to compute the ranks of all elements on nn PEs in parallel, where Ci,0,0in1C_{i,0},0\leq i\leq n-1 computes the rank of A[i].A[i]. This increases the overall gate-level time complexity of sorting a set of nn elements from O(1)O(1) to O(lgnlglgn),O(\lg n\lg\lg n), using O(nlgn)O(n\lg n) additional logic gates with constant fan-in per PE. We note that it is impossible to reduce the time complexity to O(1)O(1) using a polynomial order of AND, OR and NOT gates even with unbounded fan-in [26, 27, 28, 29]. Instead, we use threshold logic as shown in Figure 7. The diagram in Figure 7(a) depicts the overall layout of our architecture. The iith 1’s-counter sets its eme_{m} output to 1 when its input string T[i][]T[i][] has exactly mm 1’s, 0in10\leq i\leq n-1. The nn-input, lgn\lg n-output encoder then generates the rank for T[i][],0in1.T[i][],0\leq i\leq n-1. The 1’s counters (part (c)) are constructed from Δ(m)\Delta(m) circuits, each of which is obtained by a pair of threshold gates, where each such gate is assumed to have O(1)O(1) delay, which is the standard assumption in threshold logic circuits, even though practical physical constraints may make this assumption overly optimistic [28]. The bipartite graph G(n,n2)G(n,n^{2}) replicates each of its nn inputs, and connects each copy to one of the inputs of the nn inputs of each Δ(m)\Delta(m) circuit as shown in the figure. The upper threshold gate in a Δ(m)\Delta(m) circuit, (in part (b)), together with an inverter produces a 1 when fewer than m+1m+1 of its binary-valued inputs are equal to 1, whereas the lower threshold gate produces a 1 when mm or more of its binary-valued inputs are equal to 1. Combining the outputs of the two threshold gates with an AND gate thus detects if the number of 1’s in the input string T[i][]T[i][] is equal to m.m. Given that the path from an input to the output of a Δ(m)\Delta(m) circuit traverses a threshold gate, an inverter, and AND gate, its output is computed in O(1)O(1) time. Therefore, each output of a 1’s counter is also computed in O(1)O(1) time.

  Architecture     Hardware Complexity     Time Complexity     Comments
  Valiant’s generic model     when 42nkn(n1)24\leq 2n\leq k\leq\frac{n(n-1)}{2}     (lgnlgkn)(\lg n-\lg\frac{k}{n}) (lglgnlglgkn)\cdot(\lg\lg n-\lg\lg\frac{k}{n})     Capable of arbitrary disjoint comparison without fan-in,fan-out restriction [12].
Preparata’s algorithm     O(nlgn)O(n\lg n)     O(lgn)O(\lg n)     A variant of enumeration sort that uses Valiant’s sorting algorithm in the process [30].
Bitonic Sorting Network     O(nlg2n)O(n\lg^{2}n)     O(lg2n)O(\lg^{2}n)     Practical but its time complexity grows faster than O(lgn)O(\lg n) [31].
AKS Network     O(nlgn)O(n\lg n)     O(lgn)O(\lg n)     Very large constant in time complexity [32].
Reconfigurable Mesh     O(n2)O(n^{2})     O(1)O(1)     n×nn\times n mesh with switches at the crosspoints. Requires strong assumption[21][22].
1D-Crosspoint Array     O(n2)O(n^{2})     O(lgn(lglgn))O(\lg n(\lg\lg n)), or O(1)O(1) with threshold gates.     nn distinct PEs, and O(n2)O\left(\frac{n}{2}\right) replicates.
     
TABLE I: Different parallel sorting architecture

Given that the ranks are computed by a cascade of a 1’s counter and an nn-input, lgn\lg n-output encoder, the ranks are computed in O(1)O(1) time as well. Hence, sorting of nn numbers is completed in O(1)O(1) time using O(n2)O(n^{2}) threshold gates. It should be pointed out that sorting is known to be in class TC0TC^{0} of problems [33] in which every problem can be solved using a Boolean circuit with O(1)O(1) depth that is constructed out of AND, OR, and threshold gates that grows by a polynomial function of the problem size n.n. Our rank computation architecture provides an actual circuit with threshold gates and other combinational circuit logic that complete our O(1)O(1) time sorting algorithm with O(n2)O(n^{2}) gates, matching both the depth and circuit complexity bounds.

V Discussion

In this section, we compare and analyze our work with earlier results on comparison and sorting problems. Table I lists these results with their hardware and time complexities for sorting [12, 30, 32, 31, 21]. One such key contribution is due to Valiant who devised a general solution and its time complexity analysis for finding the maximum and sorting problems using a generic model of computation [12]. The model assumes kk PEs that can compare arbitrary disjoint pairs at any given time. In addition, the PEs can share comparison results without any fan-in or fan-out restriction, or in other words, can send or receive comparison results without any time overhead. Valiant showed that, for such a model, when 42nkn(n1)24\leq 2n\leq k\leq\frac{n(n-1)}{2}, where nn is the problem size, the time complexity of finding a maximum is lglgnlglgkn\lg\lg n-\lg\lg\frac{k}{n}(See [12, p.350]). Despite the fact that our proposed 1D-Crosspoint Array requires an extra PE, the actual comparison is done on at most n(n1)2\frac{n(n-1)}{2} PEs, and hence 1D-Crosspoint Array can be considered as an extreme case of Valiant’s solution. Substituting k=n(n1)2k=\frac{n(n-1)}{2}, the time complexity of finding the maximum becomes lglgnlglgn12O(1)\lg\lg n-\lg\lg\frac{n-1}{2}\approx O(1). We recall that the time complexity of finding the maximum on 1D-Crosspoint Array without fan-in restriction is also O(1),O(1), which indicates that we have presented a tangible model of computation, rather than an abstract model with an impractical assumption about fan-in and fan-out of PEs, while matching Valiant’s time complexity. For sorting, Valiant proposed a parallel variation of merge sort that had an upper bound of 2(lgnlgkn)(lglgnlglgkn)2\left(\lg n-\lg\frac{k}{n}\right)\left(\lg\lg n-\lg\lg\frac{k}{n}\right) for 42nkn(n1)24\leq 2n\leq k\leq\frac{n(n-1)}{2} (See [12, p.355]). As in finding a maximum, he assumes unbounded fan-in and fan-out in this case as well, and this simplifies the time complexity of distributing workloads between consecutive stages of merging sublists by O(lgk).O(\lg k). Substituting k=n(n1)2k=\frac{n(n-1)}{2} yields 2(lgnlgn12)(lglgnlglgn12)O(1)2\left(\lg n-\lg\frac{n-1}{2}\right)\left(\lg\lg n-\lg\lg\frac{n-1}{2}\right)\approx O(1). On the other hand, 1D-Crosspoint Array has a sorting time complexity of O((lgn)lglgn)O((\lg n)\lg\lg n) if we restrict the fan-in of gates in our solution to O(1)O(1). Without such a restriction, our 1D-Crosspoint Array together with O(n2)O(n^{2}) threshold gates and nn-input, lgn\lg n-output encoders sorts in O(1)O(1) time as well. As mentioned in section IV, the assumption of threshold gates having delay of O(1)O(1) may be overly optimistic, Valiant also admits that O(lgn)O(\lg n) overhead remains to be overcome in sorting problems (See [12, p.349]). However, here we have presented a specific architecture and a sorting algorithm that is well-suited to run on this architecture with the same time complexity as Valiant’s parallel sorting algorithm when k=n(n1)/2.k=n(n-1)/2. It should also be noted that the 1D-Crosspoint Array architecture is a computation model that has essentially no communication conflict or delay due to its structure being the simplest form of PE pairing. A more quantitative comparison of its performance with Valiant’s parallel sorting algorithm is difficult as there are steps in his algorithm whose time complexities are not easily quantifiable. Preparata utilized the Valiant’s merging algorithm to devise a variation of enumeration sort where the Valiant’s merge algorithm was used as a first step in his algorithm [30]. Preparata’s algorithm uses O(nlgn)O(n\lg n) number of PEs, and sorts nn elements in O(lgn)O(\lg n) time. However, as it makes crucial use of Valiant’s algorithm, it also follows the same assumptions and overlooks the O(lgk)O(\lg k) time overhead in the merging step as well.
In another direction, sorting networks provide an alternative to sorting algorithms. A sorting network refers to an architecture that is composed of wires and comparators that can compare their inputs and sort them, where each element in the array to be sorted can be compared with one other element at a time. The sorting network proposed by Ajtai et al. [32], which is commonly known as the AKS network named after its authors, has a hardware complexity of O(nlgn)O(n\lg n) and time complexity of O(lgn)O(\lg n), which is smaller than the 1D-Crosspoint Array by a factor of O(lglgn),O(\lg\lg n), under the constant fan-in assumption. It is also faster than Valiant’s parallel sorting algorithm by the same factor, when k=nlgn.k=n\lg n. It is worth noting that the AKS sorting network uses a more restrictive model even though it sorts faster than Valiant’s algorithm. However, the constant involved in the O(lgn)O(\lg n) term in [32] equals 6100 in the lowest case [34], which indicates that the 1D-Crosspoint Array(or Valiant’s algorithm) will be faster for lglgn<6100\lg\lg n<6100, or n<226100n<2^{2^{6100}}. In other words, for any practical nn, 1D-Crosspoint Array will be faster in practice, even though the time complexity will grow faster then the AKS network. However, it should also be added that the AKS network uses O(nlgn)O(n\lg n) comparators as compared to O(n2)O(n^{2}) comparators in 1D-Crosspoint Array.
Batcher’s odd-even and bitonic sorting networks [31] provide a more practical alternative to parallel sorting algorithms such as those given by Valiant, Preparata and the one we presented here. Odd-even and bitonic sorting networks both have a hardware complexity of O(nlg2n)O(n\lg^{2}n) and carry out sorting in O(lg2n)O(\lg^{2}n) time complexity. The Valiant’s upper bound on sorting with k=nlg2nk=n\lg^{2}n PEs is O(lgnlglgn)O(\lg n\lg\lg n), and this suggests that Batcher’s sorting networks are not optimal. Nonetheless, it must be emphasized that Batcher sorting networks do not assume unbounded fan-in or fan-out in their comparators.
Besides sorting networks, a number of reconfigurable array architectures have been used for sorting in the literature [19, 21, 23, 22, 5, 6, 7, 8, 18]. An example we would like to mention here is the reconfigurable mesh architecture that Olariu proposed in [21]. This architecture has a hardware complexity of O(n2)O(n^{2}) and time complexity of O(1)O(1). The hardware complexity of this architecture matches the hardware complexity of 1D-Crosspoint Array and its time complexity matches the Valiant’s upper bound. However, as explained in Section IV of this paper, Olariu assumes that the path delays in the reconfigurable mesh are O(1).O(1). This is an impractical assumption, and if it is relaxed then the time complexity of the sorting algorithm on the reconfigurable mesh will be O(n).O(n).

For future research, it will be worthwhile to obtain practical architectures with fewer than O(n2)O(n^{2}) processing elements to match the time complexity bounds for finding the minimum, maximum, searching and sorting problems under Valiant’s parallel processor model.

References

  • [1] A. Y. Oruc, “A self-routing on-chip network,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 5, pp. 1229–1239, 2016.
  • [2] D. E. Knuth, “The art of computer programming, volume 3: Searching and sorting,” Addison-Westley Publishing Company: Reading, MA, 1973.
  • [3] S. G. Akl, Parallel sorting algorithms, vol. 12. Academic press, 2014.
  • [4] L. Yen-Chun, “On balancing sorting on a linear array,” IEEE Transactions on Parallel and Distributed Systems, vol. 4, no. 5, pp. 566–571, 1993.
  • [5] Y. Pan, M. Hamdi, and K. Li, “Efficient and scalable quicksort on a linear array with a reconfigurable pipelined bus system,” Future Generation Computer Systems, vol. 13, no. 6, pp. 501–513, 1998.
  • [6] Y. Pan, “Basic data movement operations on the larpbs model,” in Parallel Computing Using Optical Interconnections, pp. 227–247, Springer, 1998.
  • [7] Y. Pan and K. Li, “Linear array with a reconfigurable pipelined bus system—concepts and applications,” Information Sciences, vol. 106, no. 3-4, pp. 237–258, 1998.
  • [8] A. Datta, S. Soundaralakshmi, and R. Owens, “Fast sorting algorithms on a linear array with a reconfigurable pipelined bus system,” IEEE Transactions on Parallel and Distributed Systems, vol. 13, no. 3, pp. 212–222, 2002.
  • [9] K. E. Batcher, “Sorting networks and their applications,” in Proceedings of the April 30–May 2, 1968, spring joint computer conference, pp. 307–314, 1968.
  • [10] A. Habermann, “Parallel neighbor sort,” Computer Science Report, Carnegie-Mellon University, Pittsburgh, 1972.
  • [11] G. Baudet, D. Stevenson, et al., “Optimal sorting algorithms for parallel computers,” 1975.
  • [12] L. G. Valiant, “Parallelism in comparison problems,” SIAM Journal on Computing, vol. 4, no. 3, pp. 348–355, 1975.
  • [13] C. D. Thompson and H. T. Kung, “Sorting on a mesh-connected parallel computer,” Communications of the ACM, vol. 20, no. 4, pp. 263–271, 1977.
  • [14] F. P. Preparata, “New parallel-sorting schemes,” IEEE transactions on Computers, no. 7, pp. 669–673, 1978.
  • [15] D. Nassimi and S. Sahni, “Bitonic sort on a mesh-connected parallel computer,” IEEE Transactions on Computers, no. 1, pp. 2–7, 1979.
  • [16] C.-P. Schnorr and A. Shamir, “An optimal sorting algorithm for mesh connected computers,” in Proceedings of the eighteenth annual ACM symposium on Theory of computing, pp. 255–263, 1986.
  • [17] M. V. Chien and A. Y. Oruc, “Adaptive binary sorting schemes and associated interconnection networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 6, pp. 561–572, 1994.
  • [18] M. He, X. Wu, and S. Q. Zheng, “An optimal and processor efficient parallel sorting algorithm on a linear array with a reconfigurable pipelined bus system,” Computers & Electrical Engineering, vol. 35, no. 6, pp. 951–965, 2009.
  • [19] R. Lin, S. Olariu, J. Schwing, and J. Zhang, “Sorting in o (1) time on an n×\times n reconfigurable mesh,” in Parallel Computing: From Theory to Sound Practice, Proceedings of the European Workshop on Parallel Computing, pp. 16–27, Citeseer, 1992.
  • [20] A. Yavuz Oruç, “One-sided binary tree-crossbar switching for on-chip networks,” in 2015 49th Annual Conference on Information Sciences and Systems (CISS), pp. 1–5, March 2015.
  • [21] S. Olariu, J. L. Schwing, and J. Zhang, “Integer sorting in o(1) time on an n*n reconfigurable mesh,” in Eleventh Annual International Phoenix Conference on Computers and Communication [1992 Conference Proceedings], pp. 480–484, April 1992.
  • [22] J. Jang and V. Prasanna, “An optimal sorting algorithm on reconfigurable mesh,” vol. 25, pp. 31 – 41, 1995.
  • [23] Y. . Chen and W. . Chen, “Constant time sorting on reconfigurable meshes,” IEEE Transactions on Computers, vol. 43, pp. 749–751, June 1994.
  • [24] M. Maresca and H. Li, “Connection autonomy in simd computers: a vlsi implementation,” Journal of Parallel and Distributed Computing, vol. 7, no. 2, pp. 302–320, 1989.
  • [25] Brent and Kung, “A regular layout for parallel adders,” IEEE Transactions on Computers, vol. C-31, pp. 260–264, March 1982.
  • [26] M. Furst, J. B. Saxe, and M. Sipser, “Parity, circuits, and the polynomial-time hierarchy,” Mathematical systems theory, vol. 17, no. 1, pp. 13–27, 1984.
  • [27] I. Parberry and G. Schnitger, “Parallel computation with threshold functions,” Journal of Computer and System Sciences, vol. 36, no. 3, pp. 278–302, 1988.
  • [28] K.-Y. Siu, V. P. Roychowdhury, and T. Kailath, “Depth-size tradeoffs for neural computation,” IEEE Transactions on Computers, no. 12, pp. 1402–1412, 1991.
  • [29] A. Hajnal, W. Maass, P. Pudlák, M. Szegedy, and G. Turán, “Threshold circuits of bounded depth,” Journal of Computer and System Sciences, vol. 46, no. 2, pp. 129–154, 1993.
  • [30] F. P. Preparata, “New parallel-sorting schemes,” IEEE transactions on Computers, no. 7, pp. 669–673, 1978.
  • [31] K. E. Batcher, “Sorting networks and their applications,” in Proceedings of the April 30–May 2, 1968, spring joint computer conference, pp. 307–314, 1968.
  • [32] M. Ajtai, J. Komlós, and E. Szemerédi, “An 0(n log n) sorting network,” in Proceedings of the Fifteenth Annual ACM Symposium on Theory of Computing, STOC ’83, (New York, NY, USA), p. 1–9, Association for Computing Machinery, 1983.
  • [33] A. K. Chandra, L. Stockmeyer, and U. Vishkin, “Constant depth reducibility,” SIAM Journal on Computing, vol. 13, no. 2, pp. 423–439, 1984.
  • [34] M. S. Paterson, “Improved sorting networks witho (logn) depth,” Algorithmica, vol. 5, no. 1-4, pp. 75–92, 1990.