For ease of presentation, we first describe the personalized
communication algorithm for the case when the input is initially
evenly distributed amongst the processors, and return to the general
case in Section 3.3. Consider a set of n elements evenly
distributed amongst p processors in such a manner that no processor
holds more than elements. Each element consists of a
pair
, where dest is the location to where
the data is to be routed. There are no assumptions made about the
pattern of data redistribution, except that no processor is the
destination of more than h elements. We assume for simplicity (and
without loss of generality) that h is an integral multiple of p.
A straightforward solution to this problem might attempt to sort the elements by destination and then route those elements with a given destination directly to the correct processor. No matter how the messages are scheduled, there exist cases that give rise to large variations of message sizes, and hence will result in an inefficient use of the communication bandwidth. Moreover, such a scheme cannot take advantage of regular communication primitives proposed under the MPI standard. The standard does provide the MPI_Alltoallv primitive for the restricted case when the elements are already locally sorted by destination, and a vector of indices of the first element for each destination in each local array is provided by the user.
In our solution, we use two rounds of the transpose collective communication primitive. In the first round, each element is routed to an intermediate destination, and during the second round, it is routed to its final destination.
The pseudocode for our algorithm is as follows:
Consider the largest bin which holds of the
evenly placed elements and
of the additional elements, and
let its size be
. Recall that if a bin holds
additional elements, then there must be at least
additional elements somehow distributed
amongst the p bins. Thus,
Rearranging, we get
Thus, we have that
Since the right hand side of this equation is maximized over when
, it follows that
One can show that this bound is tight as there are cases for which the upper bound is achieved.
To bound the bin size in Step (4), recall that the number of
elements in bin j at processor i is simply the number of elements
which arrive at processor i in Step (2) which are bound for
destination j. Since the elements which arrive at processor i in
Step (2) are simply the contents of the bins
formed at the end of Step (1) in processors 0 through p-1,
bounding Step (4) is simply the task of bounding the number of
elements marked for destination j which are put in any of the p
bins in Step (1). For our purposes, then, we can
think of the concatenation of these p
bins as being
one superbin, and we can view Step (1) as a process in which
each processor deals its set of
elements bound for destination
j into p possible superbins, each beginning with a unique
superbin
. This is precisely the situation considered
in our analysis of the first phase, except now the total number of
elements to be distributed is at most h. By the previous argument,
the bin size for the second phase is bounded by
for . Clearly, the local computation bound is
asymptotically optimal. As for the communication bound,
is a lower bound as in the worst
case a processor sends
elements and receives h
elements.