In this section we state the main results achieved for solving the problems of personalized communication and sorting.
Problem 1 (h-Relation Personalized Communication): A set of n messages is to be routed such that each processor is the origin or destination of at most h messages.
Result: A new deterministic algorithm using two rounds of the transpose primitive, with optimal complexity and very small constant coefficients, is shown to be very efficient and scalable across different platforms and over different input distributions.
Significance: The importance of this problem has been stated in many papers (e.g. [24, 22]). Our algorithm seems to achieve the best known experimental results for this problem. The general approach was independently described by Kaufmann et al. [18] and Ranka et al. [21] around the same time as our earlier draft [3] appeared, but our algorithm is simpler, has less overhead, and has a tighter bound size for the intermediate collective communication.
Problem 2 (Sorting): Rearrange n equally distributed elements amongst p processors such that they appear in non-decreasing order starting from the smallest indexed processor.
Results: (1) A novel variation of sample sort that uses only two rounds of regular all-to-all personalized communication and that maintains very good load balancing with virtually no overhead. (2) A new deterministic algorithm that achieves almost the same performance as (1) by using regular sampling. (3) An efficient implementation of radix sort that makes use of our personalized communication scheme.
Significance: We have developed a suite of benchmarks and conducted extensive experimentations related to Problem 2. Our algorithms have consistently performed very well across the different benchmarks and the different platforms. Algorithm (1) has outperformed all similar results known to the authors on our platforms. It even compares favorably to the machine-specific implementations reported by some of the machine vendors for the Class A NAS IS Benchmark, which only requires the easier task of ranking. Algorithm (2) has almost the same performance as (1) while guaranteeing memory and communication bounds, and Algorithm (3) achieves the best known execution times for a stable integer sorting algorithm. Additionally, all of our algorithms efficiently handle the presence of duplicates without the overhead of tagging each element.