A fundamental problem in parallel computing is to design high-level, architecture independent, algorithms that execute efficiently on general-purpose parallel machines. The aim is to be able to achieve portability and high performance simultaneously. Note that it is considerably easier to achieve portability alone (say, by using PVM) or high performance (say, by using sophisticated programmers to fine tune the algorithm to the specific machine). There are currently two factors that make this fundamental problem more tractable. The first is the emergence of a dominant parallel architecture consisting of a number of powerful microprocessors interconnected by either a proprietary interconnect or a standard off-the-shelf interconnect. The second factor is the emergence of standards, such as the message passing standard MPI [20], for which machine builders and software developers will try to provide efficient support. Our work builds on these two developments by presenting a theoretical and an experimental framework for designing parallel algorithms. In this abstract, we sketch our contributions in two important problems: personalized communication and sorting. We start with a brief outline of the computation model.
We view a parallel algorithm as a sequence of local computations
interleaved with communication steps, and we allow computation and
communication to overlap. We account for communication costs as
follows. Assuming no congestion, the transfer of a block consisting of
m contiguous words between two processors takes O time, where
is a bound on the latency of the network and
is the time per word at which a processor can inject or
receive data from the network. The cost of each of the collective
communication primitives (see below) will be modeled by O
, where m is the maximum amount of data
transmitted or received by a processor. Such a cost can be justified
by using our earlier work [17, 6, 5]. Using
this cost model, we can evaluate the communication time
of an
algorithm as a function of the input size n, the number of
processors p , and the parameters
and
. The
coefficient of
gives the total number of times collective
communication primitives are used, and the coefficient of
gives the maximum total amount of data exchanged between a processor
and the remaining processors. This communication model is close to a
number of similar models (e.g. the LogP [13], BSP
[24], and LogGP [1] models) that have recently
appeared in the literature and seems to be well-suited for designing
parallel algorithms on current high performance platforms. We define
the computation time
as the maximum time taken by any processor
to perform all of its local computation steps.
Our algorithms are implemented in SPLIT-C [11], an extension of C for distributed memory machines. The algorithms make use of MPI-like communication primitives but do not make any assumptions as to how these primitives are actually implemented. Our collective communication primitives, described in detail in [6], are similar to those of MPI [20], the IBM POWERparallel [7], and the Cray MPP systems [10] and, for example, include the following: transpose, bcast, gather, and scatter. Brief descriptions of these are as follows. The transpose primitive is an all-to-all personalized communication in which each processor has to send a unique block of data to every processor, and all the blocks are of the same size. The bcast primitive is called to broadcast a block of data from a single source to all the remaining processors. The primitives gather and scatter are companion primitives whereby scatter divides a single array residing on a processor into equal-sized blocks which are then distributed to the remaining processors, and gather coalesces these blocks residing on the different processors into a single array on one processor.