Another useful data movement primitive is broadcasting. An efficient BDM algorithm is given [4], [25] which takes q elements on a single processor and broadcasts them to the other p-1 processors using just two matrix transpositions.
An efficient algorithm to broadcast q elements from a single
processor to p processors is based on matrix transposition, where
q is assumed to be larger than p. Processor 0 holds the q
elements to be broadcast in the first column of matrix A. We
compute the matrix transpose of A, thus, giving every processor
elements. Each processor then locally rearranges the
data so that an additional matrix transpose will result in each
processor holding a copy of all the q elements in its column of A
[25].
The analysis of this broadcasting algorithm is simple. Since this algorithm just performs two matrix transpositions, the complexities of the broadcasting algorithm are
Performance analysis given in [4] reflects the execution times
from implementations on the CM-5, SP-2, and CS-2, each with p = 32
parallel processing nodes. The algorithms are written in SPLIT-C , a
parallel extension of the C programming language, primarily intended
for distributed memory multiprocessors. SPLIT-C can express the
capabilities of the BMD model and provides a shared global address
space, constructs to express data layout, and split-phase
assignments. The split-phase assignment operator, ,
prefetches data from the specified remote location into local memory.
Computation can be overlapped with the remote request, and the
sync() function allows each processor to stall until all data
prefetching is complete. The SPLIT-C language also supplies a
barrier() function for the global synchronization of the processors.