We use the Block Distributed Memory (BDM) Model ([21], [22]) as a computation model for developing and analyzing our parallel algorithms on distributed memory machines. This model allows the design of algorithms using a single address space and does not assume any particular interconnection topology. The model captures performance by incorporating a cost measure for interprocessor communication induced by remote memory accesses. The cost measure includes parameters reflecting memory latency, communication bandwidth, and spatial locality. This model allows the initial placement of data and prefetching.
The complexity of parallel algorithms will be evaluated in terms of
two measures: the computation time , and the communication time
. The measure
refers to the maximum of the local
computations performed on any processor as measured on the standard
sequential RAM model. The communication time
refers to the
total amount of communications time spent by the overall algorithm in
accessing remote data. Using the BDM model, an access operation to a
remote location takes
time, and l prefetch read
operations can be executed in
time, where
is the
normalized maximum latency of any message sent in the communications
network. No processor can send or receive more than one word at a
time.
Two useful data movement patterns, matrix transposition and broadcasting, are discussed next, and their analyses will be included as primitives in the algorithms that follow.
Given a matrix on a p processor machine, where p
divides q, the matrix transposition consists of rearranging the data
such that the first
rows of elements are destined to the
first processor, the second
rows to the second
processor, and so on, with the last
rows of the matrix
destined to the last processor. An efficient matrix transposition
algorithm consists of p iterations such that, during iteration i,
each processor
prefetches the appropriate block of
elements from processor
.
Next, an efficient BDM algorithm is given which takes q elements on a single processor and broadcasts them to the other p-1 processors using just two matrix transpositions.
Performance analysis given will reflect the execution times from
implementations on the CM-5, SP-2, and CS-2, each with p = 32
parallel processing nodes. The algorithms are written in SPLIT-C , a
parallel extension of the C programming language, primarily intended
for distributed memory multiprocessors. SPLIT-C can express the
capabilities of the BMD model and provides a shared global address
space, constructs to express data layout, and split-phase
assignments. The split-phase assignment operator, ,
prefetches data from the specified remote location into local memory.
Computation can be overlapped with the remote request, and the
sync() function allows each processor to stall until all data
prefetching is complete. The SPLIT-C language also supplies a
barrier() function for the global synchronization of the processors.