A fundamental problem in parallel computing is to design high-level, architecture independent, algorithms that execute efficiently on general-purpose parallel machines. The aim is to be able to achieve portability and high performance simultaneously. Note that it is considerably easier to achieve portability alone (say, by using PVM) or high performance (say, by using sophisticated programmers to fine tune the algorithm to the specific machine). There are currently two factors that make this fundamental problem more tractable. The first is the emergence of a dominant parallel architecture consisting of a number of powerful microprocessors interconnected by either a proprietary interconnect or a standard off-the-shelf interconnect. The second factor is the emergence of standards, such as the message passing standard MPI [20], for which machine builders and software developers will try to provide efficient support. Our work builds on these two developments by presenting a theoretical and an experimental framework for designing parallel algorithms. In this abstract, we sketch our contributions in two important problems: personalized communication and sorting. We start with a brief outline of the computation model.
We view a parallel algorithm as a sequence of local computations interleaved with communication steps, and we allow computation and communication to overlap. We account for communication costs as follows. Assuming no congestion, the transfer of a block consisting of m contiguous words between two processors takes O time, where is a bound on the latency of the network and is the time per word at which a processor can inject or receive data from the network. The cost of each of the collective communication primitives (see below) will be modeled by O , where m is the maximum amount of data transmitted or received by a processor. Such a cost can be justified by using our earlier work [17, 6, 5]. Using this cost model, we can evaluate the communication time of an algorithm as a function of the input size n, the number of processors p , and the parameters and . The coefficient of gives the total number of times collective communication primitives are used, and the coefficient of gives the maximum total amount of data exchanged between a processor and the remaining processors. This communication model is close to a number of similar models (e.g. the LogP [13], BSP [24], and LogGP [1] models) that have recently appeared in the literature and seems to be well-suited for designing parallel algorithms on current high performance platforms. We define the computation time as the maximum time taken by any processor to perform all of its local computation steps.
Our algorithms are implemented in SPLIT-C [11], an extension of C for distributed memory machines. The algorithms make use of MPI-like communication primitives but do not make any assumptions as to how these primitives are actually implemented. Our collective communication primitives, described in detail in [6], are similar to those of MPI [20], the IBM POWERparallel [7], and the Cray MPP systems [10] and, for example, include the following: transpose, bcast, gather, and scatter. Brief descriptions of these are as follows. The transpose primitive is an all-to-all personalized communication in which each processor has to send a unique block of data to every processor, and all the blocks are of the same size. The bcast primitive is called to broadcast a block of data from a single source to all the remaining processors. The primitives gather and scatter are companion primitives whereby scatter divides a single array residing on a processor into equal-sized blocks which are then distributed to the remaining processors, and gather coalesces these blocks residing on the different processors into a single array on one processor.