next up previous
Next: Experimental Results Up: Practical Parallel Algorithms for Personalized Communication Previous: The Parallel Computation Model

An h-Relation Personalized Communication

 

For ease of presentation, we first describe the personalized communication algorithm for the case when the input is initially evenly distributed amongst the processors, and return to the general case in Section 3.3. Consider a set of n elements evenly distributed amongst p processors in such a manner that no processor holds more than elements. Each element consists of a pair , where dest is the location to where the data is to be routed. There are no assumptions made about the pattern of data redistribution, except that no processor is the destination of more than h elements. We assume for simplicity (and without loss of generality) that h is an integral multiple of p.

A straightforward solution to this problem might attempt to sort the elements by destination and then route those elements with a given destination directly to the correct processor. No matter how the messages are scheduled, there exist cases that give rise to large variations of message sizes, and hence will result in an inefficient use of the communication bandwidth. Moreover, such a scheme cannot take advantage of regular communication primitives proposed under the MPI standard. The standard does provide the MPI_Alltoallv primitive for the restricted case when the elements are already locally sorted by destination, and a vector of indices of the first element for each destination in each local array is provided by the user.

In our solution, we use two rounds of the transpose collective communication primitive. In the first round, each element is routed to an intermediate destination, and during the second round, it is routed to its final destination.

The pseudocode for our algorithm is as follows:

Correctness
To prove the correctness of our algorithm, we need to establish the bounds on the bin sizes claimed in Steps (2) and (4). To establish the bound on the size of each bin in Step (2), we note that the assignment process in this step is equivalent to sorting all the elements held in processor by destination and then assigning all those elements with a common destination j one by one to successive binsgif, beginning with bin . Thus, the element with destination j goes to bin . Let be the number of elements a processor initially has with destination j. Notice that with this placement scheme, each bin will have at least elements with destination j, corresponding to the number of complete passes made around the bins, with consecutive bins having one additional element for j. Moreover, this run of additional elements will begin from that bin to which we originally started placing those elements with destination j. This means that if bin l holds an additional element with destination j, the preceding bins will also hold an additional element with destination j. Further, note that if bin l holds exactly q such additional elements, each such element from this set will have a unique destination. Since for each destination, the run of additional elements originates from a unique bin, for each distinct additional element in bin l, a unique number of consecutive bins preceding it will also hold an additional element with destination j. Consequently, if bin l holds exactly q additional elements, there must be a minimum of additional elements in the bins preceding bin l for a minimum total of additional elements distributed amongst the p bins.

Consider the largest bin which holds of the evenly placed elements and of the additional elements, and let its size be . Recall that if a bin holds additional elements, then there must be at least additional elements somehow distributed amongst the p bins. Thus,

Rearranging, we get

Thus, we have that

Since the right hand side of this equation is maximized over when , it follows that

One can show that this bound is tight as there are cases for which the upper bound is achieved.

To bound the bin size in Step (4), recall that the number of elements in bin j at processor i is simply the number of elements which arrive at processor i in Step (2) which are bound for destination j. Since the elements which arrive at processor i in Step (2) are simply the contents of the bins formed at the end of Step (1) in processors 0 through p-1, bounding Step (4) is simply the task of bounding the number of elements marked for destination j which are put in any of the p bins in Step (1). For our purposes, then, we can think of the concatenation of these p bins as being one superbin, and we can view Step (1) as a process in which each processor deals its set of elements bound for destination j into p possible superbins, each beginning with a unique superbin . This is precisely the situation considered in our analysis of the first phase, except now the total number of elements to be distributed is at most h. By the previous argument, the bin size for the second phase is bounded by

Overall Complexity of the Algorithm
Clearly, all computation in this algorithm can be performed in . The transpose primitive, whose analysis is given in [7], takes in the second step, and in the last step. Thus, the overall complexity of our algorithm is given by

 

for . Clearly, the local computation bound is asymptotically optimal. As for the communication bound, is a lower bound as in the worst case a processor sends elements and receives h elements.

Related Work
The overall two-stage approach was independently described by Kaufmann et al. [29] and Ranka et al. [35] around the same time our earlier draft ([4]) appeared on our Web page. However, our algorithm is simpler, has less overhead, and has a tighter bound on the block size of the transpose than the algorithms described in the related work.





next up previous
Next: Experimental Results Up: Practical Parallel Algorithms for Personalized Communication Previous: The Parallel Computation Model

David A. Bader
dbader@umiacs.umd.edu