next up previous
Next: Comparison with Previous Results Up: Performance Evaluation Previous: Sorting Benchmarks

Experimental Results

For each experiment, the input is evenly distributed amongst the processors. The output consists of the elements in non-descending order arranged amongst the processors so that the elements at each processor are in sorted order and no element at processor tex2html_wrap_inline1832 is greater than any element at processor tex2html_wrap_inline1850 , for all i < j.

Two variations were allowed in our experiments. First, radix sort was used to sequentially sort integers, whereas merge sort was used to sort double precision floating point numbers (doubles). Second, different implementations of the communication primitives were allowed for each machine. Wherever possible, we tried to use the vendor supplied implementations. In fact, IBM does provide all of our communication primitives as part of its machine specific Collective Communication Library (CCL) [7] and MPI. As one might expect, they were faster than the high level SPLIT-C implementation.

   table718
Table i: Optimal number of samples s for sorting the [WR] integer benchmark on the Cray T3D, for a variety of processors and input sizes.

   table748
Table ii: Optimal number of samples s for sorting the [WR] integer benchmark on the IBM SP-2-WN, for a variety of processors and input sizes.

Tables i and ii examine the preliminary question of the optimal number of samples s for sorting on the Cray T3D and the IBM SP-2-WN. They show the value of s which achieved the best performance on the Worst-Load Regular [WR] benchmark, as a function of both the number of processors p and the number of keys per processor tex2html_wrap_inline1816 . The results suggest that a good rule for choosing s is to set it to tex2html_wrap_inline2408 , which is what we do for the remainder of this discussion. To compare this choice for s with the theoretical expectation, we recall that the complexity of Step (3) is tex2html_wrap_inline2108 , whereas the complexity of Step (9) is tex2html_wrap_inline2114 . Hence, the first term is an increasing function of s, whereas the second term is a decreasing function of s. It is easy to verify that the expression for the sum of these two complexities is minimized for s = tex2html_wrap_inline2424 , and, hence, the theoretical expectation for the optimal value of s agrees with what we observe experimentally.

   table795
Table iii: Total execution time (in seconds) required to sort a variety of integer benchmarks on a 64-node Cray T3D.

   table818
Table iv: Total execution time (in seconds) required to sort a variety of integer benchmarks on a 64-node IBM SP-2-WN.

   table841
Table v: Total execution time (in seconds) required to sort a variety of double benchmarks on a 64-node Cray T3D.

   table864
Table vi: Total execution time (in seconds) required to sort a variety of double benchmarks on a 64-node IBM SP-2-WN.

Tables iii, iv, v, and vi display the performance of our sample sort as a function of input distribution for a variety of input sizes. In each case, the performance is essentially independent of the input distribution. These figures present results obtained on a 64 node Cray T3D and a 64 node IBM SP-2; results obtained from other platforms validate this claim as well. Because of this independence, the remainder of this section will only discuss the performance of our sample sort on the Worst-Load Regular benchmark [WR].

The results in Tables vii and viii together with their graphs in Figure 1 examine the scalability of our sample sort as a function of machine size. Results are shown for the T3D, the SP-2-WN, and the CM-5. Bearing in mind that these graphs are log-log plots, they show that, for a given input size n, the execution time scales inversely with the number of processors p for tex2html_wrap_inline2440 . While this is certainly the expectation of our analytical model for doubles, it might at first appear to exceed our prediction of an tex2html_wrap_inline2444 computational complexity for integers. However, the appearance of an inverse relationship is still quite reasonable when we note that, for values of p between 8 and 64, tex2html_wrap_inline2450 varies by only a factor of two. Moreover, this tex2html_wrap_inline2444 complexity is entirely due to the merging in Step (9), and in practice, Step (9) never accounts for more than tex2html_wrap_inline2454 of the observed execution time. Note that the complexity of Step (9) could be reduced to tex2html_wrap_inline2116 for integers using radix sort, but the resulting execution time would, in most cases, be slower.

   table905
Table vii: Total execution time (in seconds) required to sort 8M integers on a variety of machines and processors using the [WR] benchmark. A hyphen indicates that particular platform was unavailable to us.

   table928
Table viii: Total execution time (in seconds) required to sort 8M doubles on a variety of machines and processors using the [WR] benchmark. A hyphen indicates that particular platform was unavailable to us.

   figure951
Figure 1: Scalability of sorting integers and doubles with respect to machine size.

   table963
Table ix: Time required (in seconds) for each step of sorting 8M integers on the Cray T3D using the [WR] benchmark.

However, the results in Tables vii and viii together with their graphs in Figure 1 also show that for p greater than 64, the inverse relationship between the execution time and the number of processors begins to deteriorate. Table ix explains these results with a step by step breakdown of the execution times reported for the sorting of integers on the T3D. Step (1) clearly displays the tex2html_wrap_inline2116 complexity expected for radix sort, and it dominates the total execution time for small values of p. The transpose operation in Step (2) displays the tex2html_wrap_inline2478 complexity we originally suggested. The dependence of tex2html_wrap_inline1742 on p simply becomes more pronounced as p increases and tex2html_wrap_inline1816 decreases. Step (3) exhibits the O(sp) complexity we anticipated, since for tex2html_wrap_inline2490 , s is halved every other time p is doubled. Steps (6) and (9) display the expected tex2html_wrap_inline2110 and tex2html_wrap_inline2498 tex2html_wrap_inline2500 for tex2html_wrap_inline2502 complexity, respectively. Steps (7) and (8) exhibit the most complicated behavior. The reason for this is that in Step (7), each processor must exchange p subsequences with every other processor and must include with each subsequence a record consisting of four integer values which will allow the unshuffling in Step (8) to be performed efficiently. Hence, the tex2html_wrap_inline2508 transpose block size in the case of 128 processors is nearly half that of the the case of 64 processors (1280 vs. 2816). This, together with the fact that tex2html_wrap_inline1742 increases as a function of p, explains why the time required for Step (7) actually increases for 128 processors. Step (8) would also be expected to exhibit tex2html_wrap_inline2514 tex2html_wrap_inline2516 for tex2html_wrap_inline2502 complexity. But the scheme chosen for unshuffling also involves an O(p) amount of overhead for each group of p subsequences to assess their relationship so that they can be efficiently unshuffled. For sufficiently large values of p, this overhead begins to dominate the complexity. While the data of Table ix was collected for sorting integers on the T3D, the data from the SP-2-WN and the T3D support the same analysis for sorting both integers and doubles.

The graphs in Figure 2 examine the scalability of our regular sample sort as a function of keys per processor tex2html_wrap_inline2532 , for differing numbers of processors. They show that for a fixed number of up to 64 processors there is an almost linear dependence between the execution time and tex2html_wrap_inline1816 . While this is certainly the expectation of our analytic model for integers, it might at first appear to exceed our prediction of a tex2html_wrap_inline2538 computational complexity for floating point values. However, this appearance of a linear relationship is still quite reasonable when we consider that for the range of values shown tex2html_wrap_inline2334 differs by only a factor of 1.2. For p > 64, the relationship between the execution time and and tex2html_wrap_inline1816 is no longer linear. But based on our discussion of the data in Table ix, for large p and relatively small n we would expect a sizeable contribution from those steps which exhibit tex2html_wrap_inline2110 , tex2html_wrap_inline2554 , and tex2html_wrap_inline2556 complexity, which would explain this loss of linearity.

   figure1060
Figure 2: Scalability of sorting integers with respect to the number of keys per processor tex2html_wrap_inline2560 , for differing numbers of processors.

   figure1082
Figure 3: Distribution of execution time amongst the nine steps of regular sample sort. Times are obtained for both a 64 node T3D and a 64 node SP-2-WN using both the integer and the double versions of the [WR] benchmark.

Finally, the graphs in Figure 3 examine the relative costs of the nine steps in our regular sample sort algorithm. Results are shown for both a 64 node T3D and a 64 node SP-2-WN, using both the integer and the double versions of the [WR] benchmark. Notice that for n = 64M integers, the sequential sorting, unshuffling, and merging performed in Steps (1), (8), and (9) consume approximately tex2html_wrap_inline2572 of the execution time on the T3D and approximately tex2html_wrap_inline2574 of the execution time on the SP-2. By contrast, the two transpose operations in Steps (2) and (7) together consume only about tex2html_wrap_inline2576 of the execution time on the T3D and about tex2html_wrap_inline2578 of the execution time on the SP-2. The difference in the distribution between these two platforms is likely due in part to the fact that an integer is 64 bits on the T3D while only 32 bits on the SP-2. By contrast, doubles are 64 bits on both platforms. For n = 64M doubles, the sequential sorting, unshuffling, and merging performed in Steps (3), (8), and (9) consume approximately tex2html_wrap_inline2582 of the execution time on both platforms, whereas the two transpose operations in Steps (2) and (7) together consume only about tex2html_wrap_inline2584 of the execution time. Together, these results show that our algorithm is extremely efficient in its communication performance.


next up previous
Next: Comparison with Previous Results Up: Performance Evaluation Previous: Sorting Benchmarks

David R. Helman
helman@umiacs.umd.edu