next up previous
Next: Performance Evaluation Up: A Randomized Parallel Sorting Algorithm With an Experimental Study Previous: The Parallel Computation Model

A New Sample Sort Algorithm

Consider the problem of sorting n elements equally distributed amongst p processors, where we assume without loss of generality that p divides n evenly. The idea behind sample sort is to find a set of p - 1 splitters to partition the n input elements into p groups indexed from 1 up to p such that every element in the tex2html_wrap_inline1860 group is less than or equal to each of the elements in the tex2html_wrap_inline1862 group, for tex2html_wrap_inline1864 . Then the task of sorting each of the p groups can be turned over to the correspondingly indexed processor, after which the n elements will be arranged in sorted order. The efficiency of this algorithm obviously depends on how evenly we divide the input, and this in turn depends on how well we choose the splitters. One way to choose the splitters is by randomly sampling the input elements at each processor - hence the name sample sort.

Previous versions of sample sort [21, 10, 17, 15] have randomly chosen s samples from the tex2html_wrap_inline1872 elements at each processor, routed these ps samples to a single processor, sorted them at that processor, and then selected every tex2html_wrap_inline1876 element as a splitter. Each processor tex2html_wrap_inline1878 then performs a binary search on these splitters for each of its input values and then uses the results to route the values to the appropriate destination, after which local sorting is done to complete the sorting process. The first difficulty with this approach is the work involved in gathering and sorting the samples. A larger value of s results in better load balancing, but it also increases the overhead. The second difficulty is that no matter how the routing is scheduled, there exist inputs that give rise to large variations in the number of elements destined for different processors, and this in turn results in an inefficient use of the communication bandwidth. Moreover, such an irregular communication scheme cannot take advantage of the regular communication primitives proposed under the MPI standard [26]. The final difficulty with the original approach is that duplicate values are accommodated by tagging each item with a unique value [10]. This, of course, doubles the cost of both memory access and interprocessor communication.

In our solution, we incur no overhead in obtaining tex2html_wrap_inline1882 samples from each processor and in sorting these samples to identify the splitters. Because of this very high oversampling, we are able to replace the irregular routing with exactly two calls to our transpose primitive, and, in addition, we are able to efficiently accommodate the presence of duplicates without resorting to tagging.

The pseudocode for our algorithm is as follows:

We can establish the complexity of this algorithm with high probability - that is with probability tex2html_wrap_inline1994 for some positive constant tex2html_wrap_inline1996 . But before doing this, we need to establish the results of the following lemmas.

Lemma 1: At the completion of Step (1), the number of elements in each bucket is at most tex2html_wrap_inline1892 with high probability, for any tex2html_wrap_inline2000 and tex2html_wrap_inline2002 .

Proof: The probability that exactly tex2html_wrap_inline1892 elements are placed in a particular bucket in Step (1) is given by the binomial distribution

equation244

where tex2html_wrap_inline2006 , tex2html_wrap_inline2008 , and tex2html_wrap_inline2010 . Using the following Chernoff bound [12] for estimating the tail of a binomial distribution

equation257

the probability that a particular bucket will contain at least tex2html_wrap_inline1892 elements can be bounded by

equation264

Hence, the probability that any of the tex2html_wrap_inline2014 buckets contains at least tex2html_wrap_inline1892 elements can be bounded by

equation270

and Lemma 1 follows.

Lemma 2: At the completion of Step (2), the total number of elements received by processor tex2html_wrap_inline1914 , which comprise the set of samples from which the splitters are chosen, is at most tex2html_wrap_inline2020 with high probability, for any tex2html_wrap_inline2022 and tex2html_wrap_inline2002 .

Proof: The probability that processor tex2html_wrap_inline1914 receives exactly tex2html_wrap_inline2020 elements is given by the binomial distribution tex2html_wrap_inline2030 . Using the Chernoff bound for estimating the tail of a binomial distribution, the probability that processor tex2html_wrap_inline1914 receives at least tex2html_wrap_inline2020 elements can be bounded by tex2html_wrap_inline2036 and Lemma 2 follows.

Lemma 3: For each tex2html_wrap_inline1918 , where tex2html_wrap_inline2040 , let tex2html_wrap_inline2042 and tex2html_wrap_inline2044 be respectively the sets of input elements and samples that are both equal in value to tex2html_wrap_inline1918 , and let tex2html_wrap_inline2048 . Then, with high probability, no tex2html_wrap_inline2042 will contain more than tex2html_wrap_inline2052 elements, where

equation306

Proof: The set of input elements tex2html_wrap_inline2042 = tex2html_wrap_inline2058 can have more than tex2html_wrap_inline2052 members only if tex2html_wrap_inline2062 or less members are selected to be samples from the set tex2html_wrap_inline2064 = , which is the set composed of the first tex2html_wrap_inline2052 members in tex2html_wrap_inline2042 . However, since each element of tex2html_wrap_inline2064 is independently chosen to be a sample with probability tex2html_wrap_inline2076 , the probability of this event occurring is given by

equation330

Using the following ``Chernoff'' type bound [18] for estimating the head of a binomial distribution

equation339

where tex2html_wrap_inline2078 , tex2html_wrap_inline2080 , and tex2html_wrap_inline2010 , it follows that the probability that a set tex2html_wrap_inline2042 among the p sets of input elements has more than tex2html_wrap_inline2052 is bounded by

equation353

Using the fact that tex2html_wrap_inline2002 , it is easy to show that the above sum can be bounded by tex2html_wrap_inline2092 , for some tex2html_wrap_inline2094 and

equation306

The bound of Lemma 3 will also hold if we include the subsets of elements and samples whose values fall strictly between two consecutive splitters.

Lemma 4: At the completion of Step (7), the number of elements received by each processor is at most tex2html_wrap_inline1990 with high probability, for any tex2html_wrap_inline2098 ( tex2html_wrap_inline2100 without duplicates) and tex2html_wrap_inline2002 .

Proof: Let Q be the set of input elements to be sorted by our algorithm, let R be the set of samples of Step (4) at processor tex2html_wrap_inline1914 with cardinality tex2html_wrap_inline2020 , and and let S be the subset of R associated with tex2html_wrap_inline1918 , which we define to be the samples in R with indices tex2html_wrap_inline2122 through tex2html_wrap_inline1938 , inclusively. Let tex2html_wrap_inline2126 , tex2html_wrap_inline2128 , and tex2html_wrap_inline2130 be respectively the number of elements in Q, R, and S with value equal to tex2html_wrap_inline1932 , let tex2html_wrap_inline2140 , tex2html_wrap_inline2142 , and tex2html_wrap_inline2144 be respectively the number of elements in Q, R, and S with values greater than tex2html_wrap_inline1932 but less than tex2html_wrap_inline1918 , and let tex2html_wrap_inline2156 , tex2html_wrap_inline2158 , and tex2html_wrap_inline2160 be respectively the number of elements in Q, R, and S with value equal to tex2html_wrap_inline1918 .

According to Step (6) of our algorithm, processor tex2html_wrap_inline1900 will receive

equation407

elements. To compute the upper bound tex2html_wrap_inline1990 on this expression, we first use Lemma 3 to bound each tex2html_wrap_inline2174 , giving us

equation426

Rearranging this expression, we get:

eqnarray440

Clearly, this expression is maximized for tex2html_wrap_inline2176 and tex2html_wrap_inline2178 . Substituting these values and rearranging once again, we get:

equation460

Since tex2html_wrap_inline2180 , this expression is maximized for tex2html_wrap_inline2182 Since Lemma 2 guarantees that with high probability tex2html_wrap_inline2022 , Lemma 4 follows with high probability for tex2html_wrap_inline2098 . Alternatively, if there are no duplicates, we can show that the bound follows with high probability for tex2html_wrap_inline2100 .

Lemma 5: If the set of input elements is arbitrarily partitioned into at most 2p subsets, each of size tex2html_wrap_inline2192 tex2html_wrap_inline2194 , with high probability at the conclusion of Step (2) no processor will receive more than tex2html_wrap_inline2196 elements from any particular subset, for tex2html_wrap_inline2198 and tex2html_wrap_inline2002 .

Proof: The probability that exactly tex2html_wrap_inline2196 elements are sent to a particular processor by the conclusion of Step (2) is given by the binomial distribution tex2html_wrap_inline2204 . Using the Chernoff bound for estimating the tail of a binomial distribution, the probability that from M possible subsets any processor will receive at least tex2html_wrap_inline2196 elements can be bounded by

equation495

and Lemma 5 follows for tex2html_wrap_inline2210 .

Lemma 6: The number of elements exchanged by any two processors in Step (7) is at most tex2html_wrap_inline1978 with high probability, for any tex2html_wrap_inline2214 ( tex2html_wrap_inline2216 without duplicates) and tex2html_wrap_inline2002 .

Proof: Let U be the set of input elements to be sorted by our algorithm, let V be the set of elements held by intermediate processor tex2html_wrap_inline1878 after Step (2), and let W be the set of elements held by destination processor tex2html_wrap_inline1900 after Step (7). Let tex2html_wrap_inline2230 , tex2html_wrap_inline2232 , and tex2html_wrap_inline2234 be respectively the number of elements in U, V, and W with values equal to tex2html_wrap_inline1932 , let tex2html_wrap_inline2244 , tex2html_wrap_inline2246 , and tex2html_wrap_inline2248 be respectively the number of elements in U, V, and W with values greater than tex2html_wrap_inline1932 but less than tex2html_wrap_inline1918 , and let tex2html_wrap_inline2260 , tex2html_wrap_inline2262 , and tex2html_wrap_inline2264 be respectively the number of elements in U, V, and W with values equal to tex2html_wrap_inline1918 .

According to Step (6) of our algorithm, intermediate processor tex2html_wrap_inline1878 will send

equation536

elements to processor tex2html_wrap_inline1900 . To compute the upper bound tex2html_wrap_inline1978 on this expression, we first use Lemma 5 to bound each tex2html_wrap_inline2280 , giving us:

equation547

Notice that since destination processor tex2html_wrap_inline1900 receives respectively tex2html_wrap_inline1926 and tex2html_wrap_inline1928 of the elements at each intermediate processor with values equal to tex2html_wrap_inline1932 and tex2html_wrap_inline1918 , it follows that tex2html_wrap_inline2292 and tex2html_wrap_inline2294 . Hence, we can rewrite the expression above as

equation568

Rearranging this expression, we get:

equation579

Clearly, this expression is maximized for tex2html_wrap_inline2296 and tex2html_wrap_inline2298 . Substituting these values and rearranging, we get:

equation588

Since tex2html_wrap_inline2300 , this expression is maximized for tex2html_wrap_inline2302 Since Lemma 4 guarantees that with high probability tex2html_wrap_inline2098 , Lemma 6 follows with high probability for tex2html_wrap_inline2306 . Alternatively, if there are no duplicates, we can show that the bound follows with high probability for tex2html_wrap_inline2216 .

With these bounds on the values of tex2html_wrap_inline1894 , tex2html_wrap_inline1992 , and tex2html_wrap_inline1980 , the analysis of our sample sort algorithm is as follows. Steps (1), (3), (4), (6), and (8) involve no communication and are dominated by the cost of the sequential sorting in Step (3) and the merging in Step (8). Sorting integers using radix sort requires tex2html_wrap_inline2316 time, whereas sorting floating point numbers using merge sort requires tex2html_wrap_inline2318 time. Step (8) requires tex2html_wrap_inline2320 time if we merge the sorted subsequences in a binary tree fashion. Steps (2), (5), and (7) call the communication primitives transpose, bcast, and transpose, respectively. The analysis of these primitives in [6] shows that with high probability these three steps require tex2html_wrap_inline2322 , tex2html_wrap_inline2324 , and tex2html_wrap_inline2326 , respectively. Hence, with high probability, the overall complexity of our sample sort algorithm is given (for floating point numbers) by

eqnarray629

for tex2html_wrap_inline2328 .

Clearly, our algorithm is asymptotically optimal with very small coefficients. But a theoretical comparison of our running time with previous sorting algorithms is difficult, since there is no consensus on how to model the cost of the irregular communication used by the most efficient algorithms. Hence, it is very important to perform an empirical evaluation of an algorithm using a wide variety of benchmarks, as we will do next.


next up previous
Next: Performance Evaluation Up: A Randomized Parallel Sorting Algorithm With an Experimental Study Previous: The Parallel Computation Model

David R. Helman
helman@umiacs.umd.edu