An efficient algorithm to broadcast q elements from a single
processor to p processors is based on matrix transposition, where
q is assumed to be larger than p. Processor 0 holds the q
elements to be broadcast in the first column of matrix A. We
compute the matrix transpose of A, thus, giving every processor
elements. Each processor then locally rearranges the
data so that an additional matrix transpose will result in each
processor holding a copy of all the q elements in its column of A
[21].
The following algorithm runs on processor i:
Notice that at the end of Step 2, only the first
elements in each column are valid. Because of this, we specialize the
matrix transpose in Step 3 to prefetch only this first slot from every
other processor.
The analysis of this broadcasting algorithm is simple. Since this algorithm just performs two matrix transpositions, the complexities of the broadcasting algorithm are