Balancing load among processors is very important since poor balance of load generally causes poor processor utilization [19]. The load balancing problem is also important in developing fast solutions for basic combinatorial problems such as sorting, selection, list ranking, and graph problems [12,21].
This problem can be defined as follows. The load in each processor is given by an array , where represents the number of useful elements in such that and . We are supposed to redistribute the n elements over the p processors such that elements are stored in each processor, where we have assumed without loss of generality that p divides n.
In this section, we develop a simple and efficient load balancing algorithm for the BDM model. The corresponding communication time is given by . The overall strategy is described next.
Let and , for . Then, the overall strategy of the load balancing algorithm can be described as follows: First, the load balancing problem of the elements stored in the p arrays , , is considered and hence an output array of is generated. The array may have km or useful elements, where (steps 2 and 3). Next, processors with useful elements read m elements from appropriate processors (Step 4). The details are given in the next algorithm. We assume for simplicity that all of n, p, and m are powers of two.
Algorithm Load_Balancing
Input : Each processor contains an input array .
Output : The elements are redistributed in such a way that
elements are stored in the output array
of , for .
begin
2.1 ¯ for ¯ j=0 to p-1 doRemark: The index t is chosen in such a way that, for i<t, processor will read elements, and for , will read elements. The indices and will be used in the next step to determine the locations of the or elements that will be moved to . Notice that this step takes computation time.{ ;;}
2.2 Compute the prefix sums of , and
compute ;
2.3 if then
{ ¯ {};
{};
}
else
{ {};
{};
}
Remark: This step needs a special attention since there are cases when a set of consecutive processors read their elements from one processor, say . Assume that h processors, , have to read some appropriate elements from . Notice that . Then this step can be divided into two substeps as follows: In the first substep, h-1 processors, , read their elements from each such ; this substep can be done in communication time by applying Corollary 3.1. In the second substep, the rest of the routing is performed by using a sequence of read prefetch operations since the remaining elements in each processor are accessed only by a single processor. Hence the total communication time required by this step is .
Remark: This step can be completed in communication time since each processor reads its m elements from at most m processors, and these reads can be prefetched.
Thus, one can show the following theorem.