next up previous
Next: Performance Evaluation Up: A New Deterministic Parallel Sorting Algorithm With an Experimental Evaluation Previous: The Parallel Computation Model

A New Sorting Algorithm by Regular Sampling

Consider the problem of sorting n elements equally distributed amongst p processors, where we assume without loss of generality that p divides n evenly. The idea behind sorting by regular sampling is to find a set of p - 1 splitters to partition the n input elements into p groups indexed from 1 up to p such that every element in the tex2html_wrap_inline1806 group is less than or equal to each of the elements in the tex2html_wrap_inline1808 group, for tex2html_wrap_inline1810 . Then the task of sorting each of the p groups can be turned over to the correspondingly indexed processor, after which the n elements will be arranged in sorted order. The efficiency of this algorithm obviously depends on how well we divide the input, and this in turn depends on how evenly we choose the splitters. One way to choose the splitters is by regularly sampling the sorted input elements at each processor - hence the name Sorting by Regular Sampling.

A previous version of regular sample sort [18, 16], known as Parallel Sorting by Regular Sampling (PSRS) , first sorts the tex2html_wrap_inline1816 elements at each processor and then selects every tex2html_wrap_inline1818 element as a sample. These samples are then routed to a single processor, where they are sorted and every tex2html_wrap_inline1822 sample is selected as a splitter. Each processor then uses these splitters to partition the sorted input values and then routes the resulting subsequences to the appropriate destinations, after which local merging of these subsequences is done to complete the sorting process. The first difficulty with this approach is the load balance. There exist inputs for which at least one processor will be left with as many as tex2html_wrap_inline1824 elements at the completion of sorting. This could be reduced by choosing more samples, but this would also increase the overhead. And no matter how many samples are chosen, previous studies have shown that the load balance would still deteriorate linearly with the number of duplicates [16]. One could, of course, tag each item with a unique value, but this would also double the cost of both memory access and interprocessor communication. The other difficulty is that no matter how the routing is scheduled, there exist inputs that give rise to large variations in the number of elements destined for different processors, and this in turn results in an inefficient use of the communication bandwidth. Moreover, such an irregular communication scheme cannot take advantage of the regular communication primitives proposed under the MPI standard [17].

In our algorithm, which is parameterized by a sampling ratio s tex2html_wrap_inline1828 , we guarantee that, at the completion of sorting, each processor will have at most tex2html_wrap_inline1830 elements, while incurring no overhead in gathering the set of samples used to identify the splitters. This bound holds regardless of the number of duplicate elements present in the input. Moreover, we are able to replace the irregular routing with exactly two calls to our transpose primitive.

The pseudocode for our algorithm is as follows:

Before establishing the complexity of this algorithm, we need to establish the following theorem.

Theorem 1: The number of elements sent by processor tex2html_wrap_inline1832 to processor tex2html_wrap_inline1850 in Step (7) is at most tex2html_wrap_inline1944 for i = p and tex2html_wrap_inline1928 for i < p. Consequently, at the completion of the algorithm, no processor holds more than tex2html_wrap_inline1830 elements, for tex2html_wrap_inline1954 and tex2html_wrap_inline1828 .

Proof: Let tex2html_wrap_inline1958 be the set of samples from the sorted array of samples in Step (4) with indices (js - s + 1) through (js), inclusively. Let tex2html_wrap_inline1964 be the subset of samples in tex2html_wrap_inline1958 originating from processor tex2html_wrap_inline1832 , and let tex2html_wrap_inline1970 be the cardinality of tex2html_wrap_inline1964 . Let tex2html_wrap_inline1974 , let tex2html_wrap_inline1976 be the number of samples in tex2html_wrap_inline1964 with value less than tex2html_wrap_inline1980 , and let tex2html_wrap_inline1982 be the number of samples in tex2html_wrap_inline1964 with value tex2html_wrap_inline1980 . Let tex2html_wrap_inline1988 , let tex2html_wrap_inline1990 be the cardinality of tex2html_wrap_inline1992 , let tex2html_wrap_inline1994 be the number of samples in tex2html_wrap_inline1992 with value less than tex2html_wrap_inline1980 , and let tex2html_wrap_inline2000 be the number of samples in tex2html_wrap_inline1992 with value tex2html_wrap_inline1980 . Obviously, tex2html_wrap_inline2006 will only be nonzero if tex2html_wrap_inline2008 . Finally, for simplicity of discussion but without loss of generality, we assume that n is a constant multiple of tex2html_wrap_inline2012 .

Clearly, each sample can be mapped in a one-to-one fashion to the sorted input generated during Step (1) (but before being distributed amongst the bins). For example, the first sample in tex2html_wrap_inline1964 maps to the tex2html_wrap_inline2016 element in the sorted input at processor tex2html_wrap_inline1832 , the second sample in tex2html_wrap_inline1964 maps to the tex2html_wrap_inline2022 element in the sorted input at processor tex2html_wrap_inline1832 , and so forth up to the tex2html_wrap_inline2026 element which maps to the tex2html_wrap_inline2028 element in the sorted input at processor tex2html_wrap_inline1832 . Hence, it follows that tex2html_wrap_inline2032 elements in the sorted input of Step (1) at processor tex2html_wrap_inline1832 will be less than tex2html_wrap_inline1980 , where tex2html_wrap_inline2038 . It is also true that at least tex2html_wrap_inline2040 elements in the sorted input of Step (1) will be less than or equal to tex2html_wrap_inline1980 , where tex2html_wrap_inline2044 .

The shuffling of Step (1) together with the transpose of Step (2) maps the tex2html_wrap_inline2046 element at processor tex2html_wrap_inline1832 into the tex2html_wrap_inline2050 position of the tex2html_wrap_inline1806 subarray at processor tex2html_wrap_inline2054 , a subarray which we will denote as tex2html_wrap_inline2056 . Now, tex2html_wrap_inline2058 elements in tex2html_wrap_inline2060 will be less than tex2html_wrap_inline1980 and will unequivocally route to processors tex2html_wrap_inline2064 through tex2html_wrap_inline1850 , where:

eqnarray318

Furthermore, at least tex2html_wrap_inline2068 elements in tex2html_wrap_inline2060 will be less than or equal to tex2html_wrap_inline1980 , where tex2html_wrap_inline2074 . This means that the p subarrays at processor tex2html_wrap_inline2078 collectively have at least

eqnarray348

elements which are greater than or equal to tex2html_wrap_inline1980 . Furthermore,

eqnarray361

where tex2html_wrap_inline1914 is the number of elements equal to tex2html_wrap_inline1980 which the algorithm in Step (6) will seek to route to processor tex2html_wrap_inline1850 and tex2html_wrap_inline2088 is the number of elements equal to tex2html_wrap_inline1980 which the algorithm in Step (6) will seek to route to processors tex2html_wrap_inline2064 through tex2html_wrap_inline2094 . From this it follows that the algorithm will always be able to route a minimum of tex2html_wrap_inline2096 elements to processors tex2html_wrap_inline2064 through tex2html_wrap_inline1850 . On the other hand, the maximum number of elements that will be routed by this algorithm to these processors is:

eqnarray406

Hence, the maximum number of elements send by processor tex2html_wrap_inline2078 to processor tex2html_wrap_inline1850 is:

displaymath1786

and Theorem 1 follows.

With the results of Theorem 1, the analysis of our algorithm for sorting by regular sampling is as follows. Steps (3), (4), (6), (8), and (9) require O(sp), tex2html_wrap_inline2108 , tex2html_wrap_inline2110 , tex2html_wrap_inline2112 , and tex2html_wrap_inline2114 time, respectively. The cost of sequential sorting in Step (1) depends on the data type - sorting integers using radix sort requires tex2html_wrap_inline2116 time, whereas sorting floating point numbers using merge sort requires tex2html_wrap_inline2118 time. Steps (2), (5), and (7) call the communication primitives transpose, bcast, and transpose, respectively. The analysis of these primitives in [6] shows that these three steps require tex2html_wrap_inline2120 , tex2html_wrap_inline2122 , and tex2html_wrap_inline2124 , respectively. Hence, with high probability, the overall complexity of our sorting algorithm is given (for floating point numbers) by

eqnarray509

for tex2html_wrap_inline1954 and tex2html_wrap_inline1828 .

Clearly, our algorithm is asymptotically optimal with very small coefficients. But a theoretical comparison of our running time with previous sorting algorithms is difficult, since there is no consensus on how to model the cost of the irregular communication used by the most efficient algorithms. Hence, it is very important to perform an empirical evaluation of an algorithm using a wide variety of benchmarks, as we will do next.


next up previous
Next: Performance Evaluation Up: A New Deterministic Parallel Sorting Algorithm With an Experimental Evaluation Previous: The Parallel Computation Model

David R. Helman
helman@umiacs.umd.edu