Balancing load among processors is very important since poor balance of load generally causes poor processor utilization [19]. The load balancing problem is also important in developing fast solutions for basic combinatorial problems such as sorting, selection, list ranking, and graph problems [12,21].
This problem can be defined as follows.
The load in each processor is given by an array
, where
represents the number of useful elements in
such that
and
.
We are supposed to redistribute the n
elements over the p processors such that
elements are stored
in each processor, where we have assumed without loss of generality that
p divides n.
In this section, we develop
a simple and efficient load balancing algorithm for the BDM model.
The corresponding communication time is given by
. The overall strategy is
described next.
Let and
,
for
. Then, the overall strategy of the load balancing
algorithm can be described as follows: First, the load balancing problem of
the
elements stored in the p arrays
,
, is considered and hence an
output array
of
is generated.
The array
may have km or
useful
elements, where
(steps 2 and 3).
Next, processors with
useful elements read m elements
from appropriate processors (Step 4). The details are given in the next
algorithm.
We assume for simplicity that all of n, p, and m are powers of two.
Algorithm Load_Balancing
Input : Each processor contains an input array
.
Output : The elements are redistributed in such a way that
elements are stored in the output array
of
, for
.
begin
2.1 ¯ for ¯ j=0 to p-1 doRemark: The index t is chosen in such a way that, for i<t, processor{
;
;}
2.2 Compute the prefix sums
of
, and
compute
;
2.3 if
then
{ ¯
{
};
{
};
}
else
{
{
};
{
};
}
Remark:
This step needs a special attention since there are cases when
a set of consecutive processors read their elements from one processor,
say . Assume that h processors,
,
have to read some appropriate elements from
.
Notice that
.
Then this step can be divided into two substeps as follows:
In the first substep, h-1 processors,
, read their elements from each such
; this substep can be done in
communication time by applying Corollary 3.1.
In the second substep, the rest of the routing is performed by using
a sequence of read prefetch operations since the remaining elements in each
processor are accessed only by a single processor.
Hence the total communication time required by this step is
.
Remark:
This step can be completed in communication time
since each processor reads its m elements from
at most m processors, and these reads can be prefetched.
Thus, one can show the following theorem.