Next: The Block Distributed Up: The Block Distributed Memory Previous: The Block Distributed Memory

Introduction

Parallel processing promises to offer a quantum leap in computational power that is likely to have a substantial impact on various aspects of the computing field, and that in particular can be exploited to investigate a wide range of what has been called ``grand challenge'' problems in science and engineering. It is widely recognized [23] that an important ingredient for the success of this technology is the emergence of computational models that can be used for algorithms development and for accurately predicting the performance of these algorithms on real machines. We take a similar view as in [24] in that the computation model should be a ``bridging model'' that links the two layers of hardware and software. Existing computation models tend to be biased towards one or the other layer, except for very few exceptions. The Bulk Synchronous Parallel (BSP) model advocated by Valiant [24] is one of the few exceptions. In this paper, we introduce a computation model that specifically attempts to be a bridging model between the shared memory (single address) programming model and the distributed-memory message passing architectures. Distributed memory systems configured as a single address space are usually referred to as (scalable) shared memory multiprocessors. These machines achieve the scalability of distributed memory architectures and the simple programming style provided by the single address space. Our model can also be used for predicting performance of data parallel algorithms running on distributed memory architectures.

Since a computation model should predict performance on real machines, we start with a discussion on the basis of our measure of communication costs incurred by accessing remote data. As indicated in [8], the hardware organizations of massively parallel processors (MPPs) seem to be converging towards a collection of powerful processors connected by a communication network that can be modeled as a complete graph on which communication is subject to the restrictions imposed by the latency and the bandwidth properties of the network. According to this common organization, the communication between the different processors is handled by point-to-point messages whose routing times are controlled by parameters related to the network latency, processor communication bandwidth, overhead in preparing a message, and network capacity. Such a model avoids a description of the exact structure of the network since algorithms exploiting specific features of the network are less likely to be robust enough to work well on a variety of architectures and to adapt easily to possible future technological changes. However programming the machine at the message-passing level imposes a heavy burden on the programmer and makes algorithms development and evaluation quite complicated. On the other hand, the data-parallel and the shared-memory programming models are appealing in terms of their ease of use and in terms of their close relationship to sequential programming. Both models assume a single address space.

The Block Distributed Memory (BDM) model introduced in the next section captures the performance of shared memory (single address space) algorithms by incorporating a cost measure for interprocessor communication caused by remote memory accesses. The cost is modeled using the latency and the communication bandwidth of each processor. Since a remote memory access involves the transmission of a packet that typically contains a number of consecutive words, our model encourages the use of spatial locality by incorporating a parameter m that represents a cost associated with accessing up to m consecutive words; this cost will be incurred even if a single word is needed. Our model allows the initial placement of input data and includes the memory latency hiding technique of pipelined prefetching. Since we measure the amount of local computation and the amount of communication separately, we are able to normalize the communication cost and drop one parameter so as to make the analysis of the corresponding algorithms simpler. We use our model to develop parallel algorithms for various data rearrangement problems, load balancing, sorting, the Fast Fourier Transform (FFT) computation, and matrix multiplication. We show that most of these algorithms achieve optimal or near optimal communication complexity while simultaneously guaranteeing an optimal speed-up in computational complexity.

In the next section, we provide the details of our model while Section 3 describes a collection of algorithms for handling data rearrangements that occur frequently in shared memory algorithms. The load balancing problem is addressed in Section 4 where a communication efficient algorithm is presented, and Section 5 is devoted to the presentation of efficient algorithms for sorting, FFT, and matrix multiplication. Most of the resulting algorithms seem to share a common structure with high-performance algorithms that have been tested on real machines.

Next: The Block Distributed Up: The Block Distributed Memory Previous: The Block Distributed Memory

joseph@umiacs.umd.edu