A New Sorting Algorithm by Regular Sampling

Next: An Efficient Radix Sort Up: Sorting Previous: A New Sample Sort Algorithm

A New Sorting Algorithm by Regular Sampling

A disadvantage of our random sample sort algorithm is that the performance bounds and the memory requirements can only be guaranteed with high probability. The alternative to this is to choose the samples by regular sampling. A previous version of regular sample sort [23, 19], known as Parallel Sorting by Regular Sampling (PSRS), first sorts the elements at each processor and then selects every element as a sample. These samples are then routed to a single processor, where they are sorted and every sample is selected as a splitter. Each processor then uses these splitters to partition the sorted input values and then routes the resulting subsequences to the appropriate destinations, after which local merging is done to complete the sorting process. The first difficulty with this approach is the load balance. There exist inputs for which at least one
processor will be left at the completion of sorting with as many as elements. This could be reduced by choosing more splitters, but this would also increase the overhead. And no matter what is done, previous workers have observed that the load balance would still deteriorate linearly with the number of duplicates [19]. The other difficulty is that no matter how the routing is scheduled, there exist inputs that give rise to large variations in the number of elements destined for different processors, and this in turn results in an inefficient use of the communication bandwidth. Moreover, such an irregular communication scheme cannot take advantage of the regular communication primitives proposed under the MPI standard [20].

In our algorithm, which is parameterized by a sampling ratio s , we guarantee that at the completion of sorting, each processor will have at most elements, while incurring no overhead in gathering the samples to identify the splitters. This bound holds regardless of the number of duplicates present in the input. Moreover, we are able to replace the irregular routing with exactly two calls to our transpose primitive.

The pseudo code for our algorithm is as follows:

Step (1): Each processor sorts each of its input values using an appropriate sequential sorting algorithm. For integers we use the radix sort algorithm, whereas for floating point numbers we use the merge sort algorithm. The sorted data is then ``dealt'' into p bins so that the item in the sorted order is placed into the position of the bin.
Step (2): Each processor routes the contents of bin j to processor , for , which is equivalent to performing a transpose operation with block size .
Step (3): From each of the p sorted subsequences received in Step (2), processor selects each element as a sample, for , and a given value of s, for .
Step (4): Processor merges the p sorted subsequences of samples and then selects each sample as a splitter, for . By default, the last splitter is the largest value allowed by the data type used. Additionally, binary search is used to compute for the set of samples with indices (k-1)s through (ks-1) the number of samples which share the same value as . Note that is an upper bound on the total number of duplicates amongst the samples in and the elements which proceed each sample in the sorted order of Step (1). After Step (2), the input is redistributed so so that each processor holds between and of those duplicates associated with the set in Step (1).
Step (5): Processor broadcasts the p splitters and their est values to the other p - 1 processors.
Step (6): Each processor uses binary search to define for each of the p sorted sequences received in Step (2) and each of the p splitters a subsequence . The p subsequences associated with all contain values which are greater than or equal to and and less then or equal to , and collectively include at most elements with the same value as .
Step (7): Each processor routes the p subsequences associated with to processor , for . Since no two processors will exchange more than elements, this is equivalent to performing a transpose operation with block size .
Step (8): Each processor ``unshuffles'' all those subsequences sharing a common origin in Step (2). The p consolidated subsequences are then merged to produce the column of the sorted array.

Before establishing the complexity of this algorithm, we need the results of the following theorem, whose proof has been omitted for brevity [15].

Theorem 2: The number of elements exchanged by any two processors in Step (7) is at most . Consequently, at the completion of Step (7), no processor receives more than elements, for .

Hence, the analysis of our regular sample sort algorithm is similar to that of our sample sort algorithm and is given (for floating point numbers) by

for and .

Figure 4: Scalability with respect to problem size of regular sorting integers from the [U] benchmark, for differing numbers of processors, on the Cray T3D and the IBM SP-2-WN.

Like our random sample sort algorithm, our regular sample sort algorithm is asymptotically optimal with very small coefficients. Once again, our algorithm was implemented and tested on the nine benchmarks. Table iv displays the performance of our regular sort as a function of input distribution for a variety of input sizes. It shows that the performance is essentially independent of the input distribution.

table606
Table iv: Total execution time (in seconds) for regular sorting a variety of benchmarks on a 64 node Cray T3D.

Table v examines the scalability of our regular sort as a function of machine size. It shows that for a given input size n the execution time scales almost inversely with the number of processors p.

table629
Table v: Total execution time (in seconds) for regular sorting 8M integers from the [WR] benchmark on a variety of machines and processors. A hyphen indicates that that particular platform was unavailable to us.

Finally, Figure 4 examines the scalability of our sample sort as a function of problem size, for differing numbers of processors. They show that for a fixed number of processors there is an almost linear dependence between the execution time and the total number of elements n. See [15] for additional performance data and comparisons with other published results.

Next: An Efficient Radix Sort Up: Sorting Previous: A New Sample Sort Algorithm

David R. Helman
helman@umiacs.umd.edu