Our nine sorting benchmarks are defined as follows, in which MAX is
for integers and approximately
for doubles:
We selected these nine benchmarks for a variety of reasons. Previous
researchers have used the Uniform, Gaussian, and
Zero benchmarks, and so we too included them for purposes of
comparison. But benchmarks should be designed to illicit the worst
case behavior from an algorithm, and in this sense the Uniform
benchmark is not appropriate. For example, for , one would
expect that the optimal choice of the splitters in the
Uniform benchmark would be those which partition the range of
possible values into equal intervals. Thus, algorithms which try to
guess the splitters might perform misleadingly well on such an
input. In this respect, the Gaussian benchmark is more telling.
But we also wanted to find benchmarks which would evaluate the cost of
irregular communication. Thus, we wanted to include benchmarks for
which an algorithm which uses a single phase of routing would find
contention difficult or even impossible to avoid. A naive approach to
rearranging the data would perform poorly on the Bucket Sorted
benchmark. Here, every processor would try to route data to the same
processor at the same time, resulting in poor utilization of
communication bandwidth. This problem might be avoided by an algorithm
in which at each processor the elements are first grouped by
destination and then routed according to the specifications of a
sequence of p destination permutations. Perhaps the most
straightforward way to do this is by iterating over the possible
communication strides. But such a strategy would perform poorly with
the g-Group benchmark, for a suitably chosen value of g. In
this case, using stride iteration, those processors which belong to a
particular group all route data to the same subset of g destination
processors. This subset of destinations is selected so that, when the
g processors route to this subset, they choose the processors in
exactly the same order, producing contention and possibly stalling.
Alternatively, one can synchronize the processors after each
permutation, but this in turn will reduce the communication bandwidth
by a factor of
. In the worst case scenario, each
processor needs to send data to a single processor a unique stride
away. This is the case of the Staggered benchmark, and the
result is a reduction of the communication bandwidth by a factor of
p. Of course, one can correctly object that both the g-Group
benchmark and the Staggered benchmark have been tailored to
thwart a routing scheme which iterates over the possible strides, and
that another sequences of permutations might be found which performs
better. This is possible, but at the same time we are unaware of any
single phase deterministic algorithm which could avoid an equivalent
challenge. The Worst Case Regular benchmark was included to
empirically evaluate both the worst case running time expected
for our regular sorting algorithm and the
effect of the sampling rate on this performance. Finally, the
the Randomized Duplicates
benchmark was included to assess the performance of the algorithms
in the presence of duplicate values