next up previous
Next: Experimental Results Up: Performance Evaluation Previous: Performance Evaluation

Sorting Benchmarks

Our eight sorting benchmarks are defined as follows, in which n and p are assumed for simplicity to be powers of two and tex2html_wrap_inline2338 , the maximum value allowed for doubles, is approximately tex2html_wrap_inline2342 .

  1. Uniform [U], a uniformly distributed random input, obtained by calling the C library random number generator random(). This function, which returns integers in the range 0 to tex2html_wrap_inline2348 , is seeded by each processor tex2html_wrap_inline1878 with the value (21 + 1001i). For the double data type, we ``normalize'' the integer benchmark values by first subtracting the value tex2html_wrap_inline2356 and then scaling the result by tex2html_wrap_inline2358 .
  2. Gaussian [G], a Gaussian distributed random input, approximated by adding four calls to random() and then dividing the result by four. For the double data type, we normalize the integer benchmark values in the manner described for [U].
  3. Zero [Z], a zero entropy input, created by setting every value to a constant such as zero.
  4. Bucket Sorted [B], an input that is sorted into p buckets, obtained by setting the first tex2html_wrap_inline1882 elements at each processor to be random numbers between 0 and tex2html_wrap_inline2370 , the second tex2html_wrap_inline1882 elements at each processor to be random numbers between tex2html_wrap_inline2374 and tex2html_wrap_inline2376 , and so forth. For the double data type, we normalize the integer benchmark values in the manner described for [U].
  5. g-Group [g-G], an input created by first dividing the processors into groups of consecutive processors of size g, where g can be any integer which partitions p evenly. If we index these groups in consecutive order from 1 up to tex2html_wrap_inline2392 , then for group j we set the first tex2html_wrap_inline2396 elements to be random numbers between tex2html_wrap_inline2398 and tex2html_wrap_inline2400 , the second tex2html_wrap_inline2396 elements at each processor to be random numbers between
    tex2html_wrap_inline2404 and tex2html_wrap_inline2406 , and so forth. For the double data type, we normalize the integer benchmark values in the manner described for [U].
  6. Staggered [S], created as follows: if the processor index i is less than or equal to tex2html_wrap_inline2412 , then we set all tex2html_wrap_inline1872 elements at that processor to be random numbers between tex2html_wrap_inline2416 and tex2html_wrap_inline2418 . Otherwise, we set all tex2html_wrap_inline1872 elements to be random numbers between tex2html_wrap_inline2422 and tex2html_wrap_inline2424 . For the double data type, we normalize the integer benchmark values in the manner described for [U].
  7. Deterministic Duplicates [DD], an input of duplicates in which we set all tex2html_wrap_inline1872 elements at each of the first tex2html_wrap_inline2412 processors to be tex2html_wrap_inline2432 , all tex2html_wrap_inline1872 elements at each of the next tex2html_wrap_inline2436 processors to be tex2html_wrap_inline2438 , and so forth. At processor tex2html_wrap_inline2440 , we set the first tex2html_wrap_inline2442 elements to be tex2html_wrap_inline2444 , the next tex2html_wrap_inline2446 elements to be tex2html_wrap_inline2448 , and so forth.
  8. Randomized Duplicates [RD], an input of duplicates in which each processor fills an array T with some constant number range (range is 32 for our work) of random values between 0 and (range - 1) whose sum is S. The first tex2html_wrap_inline2462 values of the input are then set to a random value between 0 and (range - 1), the next tex2html_wrap_inline2468 values of the input are then set to another random value between 0 and (range - 1), and so forth.

We selected these eight benchmarks for a variety of reasons. Previous researchers have used the Uniform, Gaussian, and Zero benchmarks, and so we too included them for purposes of comparison. But benchmarks should be designed to illicit the worst case behavior from an algorithm, and in this sense the Uniform benchmark is not appropriate. For example, for tex2html_wrap_inline2474 , one would expect that the optimal choice of the splitters in the Uniform benchmark would be those which partition the range of possible values into equal intervals. Thus, algorithms which try to guess the splitters might perform misleadingly well on such an input. In this respect, the Gaussian benchmark is more telling. But we also wanted to find benchmarks which would evaluate the cost of irregular communication. Thus, we wanted to include benchmarks for which an algorithm which uses a single phase of routing would find contention difficult or even impossible to avoid. A naive approach to rearranging the data would perform poorly on the Bucket Sorted benchmark. Here, every processor would try to route data to the same processor at the same time, resulting in poor utilization of communication bandwidth. This problem might be avoided by an algorithm in which at each processor the elements are first grouped by destination and then routed according to the specifications of a sequence of p destination permutations. Perhaps the most straightforward way to do this is by iterating over the possible communication strides. But such a strategy would perform poorly with the g-Group benchmark, for a suitably chosen value of g. In this case, using stride iteration, those processors which belong to a particular group all route data to the same subset of g destination processors. This subset of destinations is selected so that, when the g processors route to this subset, they choose the processors in exactly the same order, producing contention and possibly stalling. Alternatively, one can synchronize the processors after each permutation, but this in turn will reduce the communication bandwidth by a factor of tex2html_wrap_inline2392 . In the worst case scenario, each processor needs to send data to a single processor a unique stride away. This is the case of the Staggered benchmark, and the result is a reduction of the communication bandwidth by a factor of p. Of course, one can correctly object that both the g-Group benchmark and the Staggered benchmark have been tailored to thwart a routing scheme which iterates over the possible strides, and that another sequences of permutations might be found which performs better. This is possible, but at the same time we are unaware of any single phase deterministic algorithm which could avoid an equivalent challenge. Finally, the Deterministic Duplicates and the Randomized Duplicates benchmarks were included to assess the performance of the algorithms in the presence of duplicate values.


next up previous
Next: Experimental Results Up: Performance Evaluation Previous: Performance Evaluation

David R. Helman
helman@umiacs.umd.edu