next up previous
Next: Image (Data) Layout Up: Block Distributed Memory Previous: Matrix Transposition

Broadcasting

Another useful data movement primitive is broadcasting. An efficient BDM algorithm is given [4], [25] which takes q elements on a single processor and broadcasts them to the other p-1 processors using just two matrix transpositions.

An efficient algorithm to broadcast q elements from a single processor to p processors is based on matrix transposition, where q is assumed to be larger than p. Processor 0 holds the q elements to be broadcast in the first column of matrix A. We compute the matrix transpose of A, thus, giving every processor elements. Each processor then locally rearranges the data so that an additional matrix transpose will result in each processor holding a copy of all the q elements in its column of A [25].

The analysis of this broadcasting algorithm is simple. Since this algorithm just performs two matrix transpositions, the complexities of the broadcasting algorithm are

 

Performance analysis given in [4] reflects the execution times from implementations on the CM-5, SP-2, and CS-2, each with p = 32 parallel processing nodes. The algorithms are written in SPLIT-C , a parallel extension of the C programming language, primarily intended for distributed memory multiprocessors. SPLIT-C can express the capabilities of the BMD model and provides a shared global address space, constructs to express data layout, and split-phase assignments. The split-phase assignment operator, , prefetches data from the specified remote location into local memory. Computation can be overlapped with the remote request, and the sync() function allows each processor to stall until all data prefetching is complete. The SPLIT-C language also supplies a barrier() function for the global synchronization of the processors.



next up previous
Next: Image (Data) Layout Up: Block Distributed Memory Previous: Matrix Transposition



David A. Bader
dbader@umiacs.umd.edu