next up previous
Next: Experimental Results for Up: Block Distributed Memory Previous: Experimental Results for

Analysis for the Broadcasting Algorithm

An efficient algorithm to broadcast q elements from a single processor to p processors is based on matrix transposition, where q is assumed to be larger than p. Processor 0 holds the q elements to be broadcast in the first column of matrix A. We compute the matrix transpose of A, thus, giving every processor elements. Each processor then locally rearranges the data so that an additional matrix transpose will result in each processor holding a copy of all the q elements in its column of A [21].

The following algorithm runs on processor i:

 

Notice that at the end of Step 2, only the first elements in each column are valid. Because of this, we specialize the matrix transpose in Step 3 to prefetch only this first slot from every other processor.

The analysis of this broadcasting algorithm is simple. Since this algorithm just performs two matrix transpositions, the complexities of the broadcasting algorithm are

 



next up previous
Next: Experimental Results for Up: Block Distributed Memory Previous: Experimental Results for



David A. Bader
dbader@umiacs.umd.edu