Figure 1: Final distribution of the keys corresponding to our input data sets
We examine the performance of our h-relation algorithm on various configurations of the IBM SP-2 and Cray T3D, using four values of h: , , , and . More results are given in [4]. The data sets used in these experiments are defined as follows. Our input of size n is initially distributed cyclically across the p processors such that each processor initially holds keys, for . For the input consists of keys labelled for , followed by keys labelled for , and so forth, (with keys labelled for ), with the last keys labelled for . Note that this results in the same data movement as the transpose primitive. For , instead of an equal number of elements destined for each processor, the function , for ( ), is characterized by
The result of this data movement, shown in Figure 1, is that processor 0 receives the largest imbalance of elements, i.e. h, while other processors receive varying block sizes ranging from 0 to at most h. For , approximately processors receive no elements, and hence this represents an extremely unbalanced case.
Figure 2: Performance of personalized communication ( ) with respect to machine and problem size.
As shown in Figure 2, the time to route an h-relation personalized communication for a given input size on a varying number of processors (p) scales inversely with p whenever n is large compared with p. However, for inputs which are small compared with the machine size, the communication time is dominated by O as shown in the case of the 128-processor Cray T3D with n = 256K.