A disadvantage of our random sample sort algorithm is that the
performance bounds and the memory requirements can only be guaranteed
with high probability. The alternative to this is to choose the
samples by regular sampling. A previous version of regular sample
sort [23, 19], known as Parallel Sorting by Regular Sampling
(PSRS), first sorts the elements at each processor and
then selects every element as a
sample. These samples are then routed to a single processor,
where they are sorted and every sample is selected as a
splitter. Each processor then uses these splitters to
partition the sorted input values and then routes the resulting
subsequences to the appropriate destinations, after which local
merging is done to complete the sorting process. The first difficulty
with this approach is the load balance. There exist inputs for which
at least one
processor will be left at the completion of sorting with
as many as elements.
This could be reduced by choosing
more splitters, but this would also increase the overhead.
And no matter what is done, previous workers have observed that the load balance
would still deteriorate linearly with the number of duplicates [19].
The other
difficulty is that no matter how the routing is scheduled, there exist
inputs that give rise to large variations in the number of elements
destined for different processors, and this in turn results in an
inefficient use of the communication bandwidth. Moreover, such an
irregular communication scheme cannot take advantage of the regular
communication primitives proposed under the MPI standard [20].
In our algorithm, which is parameterized by a sampling ratio s , we guarantee that at the completion of sorting, each processor will have at most elements, while incurring no overhead in gathering the samples to identify the splitters. This bound holds regardless of the number of duplicates present in the input. Moreover, we are able to replace the irregular routing with exactly two calls to our transpose primitive.
The pseudo code for our algorithm is as follows:
Before establishing the complexity of this algorithm, we need the results of the following theorem, whose proof has been omitted for brevity [15].
Theorem 2: The number of elements exchanged by any two processors in Step (7) is at most . Consequently, at the completion of Step (7), no processor receives more than elements, for .
Hence, the analysis of our regular sample sort algorithm is similar to that of our sample sort algorithm and is given (for floating point numbers) by
for and .
Figure 4: Scalability with respect to problem size of regular sorting integers
from the [U] benchmark, for differing numbers of processors, on the Cray T3D and the IBM SP-2-WN.
Like our random sample sort algorithm, our regular sample sort algorithm is asymptotically optimal with very small coefficients. Once again, our algorithm was implemented and tested on the nine benchmarks. Table iv displays the performance of our regular sort as a function of input distribution for a variety of input sizes. It shows that the performance is essentially independent of the input distribution.
Table iv: Total execution time (in seconds) for regular sorting a variety of
benchmarks on a 64 node Cray T3D.
Table v examines the scalability of our regular sort as a function of machine size. It shows that for a given input size n the execution time scales almost inversely with the number of processors p.
Table v: Total execution time (in seconds) for regular sorting 8M integers
from the [WR] benchmark
on a variety of machines and processors. A hyphen indicates
that that particular platform was unavailable to us.
Finally, Figure 4 examines the scalability of our sample sort as a function of problem size, for differing numbers of processors. They show that for a fixed number of processors there is an almost linear dependence between the execution time and the total number of elements n. See [15] for additional performance data and comparisons with other published results.