Next: SortingFFT, and Up: The Block Distributed Memory Previous: Basic Algorithms for

Load Balancing

Balancing load among processors is very important since poor balance of load generally causes poor processor utilization [19]. The load balancing problem is also important in developing fast solutions for basic combinatorial problems such as sorting, selection, list ranking, and graph problems [12,21].

This problem can be defined as follows. The load in each processor is given by an array , where represents the number of useful elements in such that and . We are supposed to redistribute the n elements over the p processors such that elements are stored in each processor, where we have assumed without loss of generality that p divides n.

In this section, we develop a simple and efficient load balancing algorithm for the BDM model. The corresponding communication time is given by . The overall strategy is described next.

Let and , for . Then, the overall strategy of the load balancing algorithm can be described as follows: First, the load balancing problem of the elements stored in the p arrays , , is considered and hence an output array of is generated. The array may have km or useful elements, where (steps 2 and 3). Next, processors with useful elements read m elements from appropriate processors (Step 4). The details are given in the next algorithm. We assume for simplicity that all of n, p, and m are powers of two.

Algorithm Load_Balancing

Input : Each processor contains an input array .
Output : The elements are redistributed in such a way that elements are stored in the output array of , for .

begin

[Step 1]

Each processor

reads the p-1 values

held in the remaining processors. This step can be performed in at most

communication time by Lemma 3.2.

[Step 2]

Processor

performs the following local computations:

 2.1 ¯ for ¯ j=0  to p-1  do

       				 { ;;}

    2.2  Compute the prefix sums  of 
           , and

       		 compute ;

    2.3   if   then

       				 { ¯  {};

       						     {};

       				 }

       		  else

       				 {  {};

       						     {};

       				 }

Remark: The index t is chosen in such a way that, for i<t, processor

will read

elements, and for

will read

elements. The indices

and

will be used in the next step to determine the locations of the

elements that will be moved to

. Notice that this step takes

computation time.

[Step 3]

Processor

reads

, and reads appropriate numbers of elements from

and

respectively.

Remark: This step needs a special attention since there are cases when a set of consecutive processors read their elements from one processor, say . Assume that h processors, , have to read some appropriate elements from . Notice that . Then this step can be divided into two substeps as follows: In the first substep, h-1 processors, , read their elements from each such ; this substep can be done in communication time by applying Corollary 3.1. In the second substep, the rest of the routing is performed by using a sequence of read prefetch operations since the remaining elements in each processor are accessed only by a single processor. Hence the total communication time required by this step is .

[Step 4]

Processor

, reads the remaining m elements from the appropriate processors; the corresponding indices

and

can be computed locally as in Step 2.

Remark: This step can be completed in communication time since each processor reads its m elements from at most m processors, and these reads can be prefetched.

end

Thus, one can show the following theorem.

Next: SortingFFT, and Up: The Block Distributed Memory Previous: Basic Algorithms for

joseph@umiacs.umd.edu