Consider the problem of sorting n elements equally distributed
amongst p processors, where we assume without loss of generality that p
divides n evenly. The idea behind sorting by regular sampling
is to find a set of
p - 1 splitters to partition the n input elements
into p groups indexed from 1 up to p such that every element
in the group is less than or equal to each of the elements in
the
group, for
. Then the task of
sorting each of the p groups
can be turned over to the correspondingly indexed processor, after which
the n elements will be arranged in sorted order. The
efficiency of this algorithm obviously depends on how well we divide the
input, and this in turn depends on how evenly we choose the splitters.
One way to choose the splitters is by regularly sampling the sorted
input elements at each processor - hence the name Sorting by
Regular Sampling.
A previous version of regular sample
sort [18, 16], known as Parallel Sorting by Regular Sampling
(PSRS) , first sorts the elements at each processor and
then selects every
element as a
sample. These samples are then routed to a single processor,
where they are sorted and every
sample is selected as a
splitter. Each processor then uses these splitters to
partition the sorted input values and then routes the resulting
subsequences to the appropriate destinations, after which local
merging of these subsequences is done to complete the sorting process.
The first difficulty
with this approach is the load balance. There exist inputs for which
at least one processor will be left with
as many as
elements
at the completion of sorting.
This could be reduced by choosing more samples, but this would also
increase the overhead. And no matter how many samples are chosen,
previous studies have shown that the load balance would still deteriorate
linearly with the number of duplicates [16]. One could, of course,
tag each item with a unique value, but this would also double the cost of
both memory access and interprocessor communication. The other
difficulty is that no matter how the routing is scheduled, there exist
inputs that give rise to large variations in the number of elements
destined for different processors, and this in turn results in an
inefficient use of the communication bandwidth. Moreover, such an
irregular communication scheme cannot take advantage of the regular
communication primitives proposed under the MPI standard [17].
In our algorithm, which is parameterized by a sampling ratio
s , we guarantee that, at the
completion of sorting, each processor will have at most
elements, while incurring no overhead in
gathering the set of samples used to identify the splitters.
This bound
holds regardless of the number of duplicate elements present in the input.
Moreover, we are able to replace the irregular routing with exactly two
calls to our transpose primitive.
The pseudocode for our algorithm is as follows:
Before establishing the complexity of this algorithm, we need to establish the following theorem.
Theorem 1: The number of elements sent by processor
to processor
in Step (7) is at most
for i = p and
for i < p.
Consequently, at the completion of the algorithm, no processor holds
more than
elements,
for
and
.
Proof:
Let be the set of samples from the sorted array of samples
in Step (4) with indices (js - s + 1) through (js), inclusively.
Let
be the subset of samples in
originating from
processor
, and let
be the cardinality of
.
Let
, let
be the number of samples
in
with value less than
, and let
be the number of samples in
with value
. Let
,
let
be the cardinality of
, let
be the number of samples
in
with value less than
, and let
be the number of samples in
with value
. Obviously,
will only be
nonzero if
. Finally, for simplicity of
discussion but without loss of generality,
we assume that n is a constant multiple of
.
Clearly, each sample can be mapped in a one-to-one fashion to the sorted
input generated during Step (1) (but before being distributed amongst
the bins). For example, the first sample in
maps to the
element in the
sorted input at processor
, the second sample in
maps to the
element in the
sorted input at processor
, and so forth up to the
element which maps to the
element in the sorted input at processor
. Hence, it follows that
elements in the sorted input
of Step (1) at processor
will be less than
,
where
.
It is also true that at least
elements in the sorted input of
Step (1) will be less than or equal to
, where
.
The shuffling of Step (1) together with the transpose of
Step (2) maps the element at processor
into
the
position of the
subarray at processor
, a subarray which we
will denote as
. Now,
elements in
will be less than
and
will unequivocally route to processors
through
,
where:
Furthermore, at least elements in
will be less than or equal to
, where
. This means
that the p subarrays at processor
collectively have at least
elements which are greater than or equal to . Furthermore,
where is the number of elements
equal to
which the algorithm in Step (6) will seek
to route to processor
and
is the number of elements
equal to
which the algorithm in Step (6) will seek
to route to processors
through
.
From this it follows that the algorithm will always be able to route
a minimum of
elements to processors
through
.
On the other hand, the maximum number of elements that will be routed by
this algorithm to these processors is:
Hence, the maximum number of elements send by processor to
processor
is:
and Theorem 1 follows.
With the results of Theorem 1, the analysis of our algorithm
for sorting by regular sampling is as follows. Steps (3), (4),
(6), (8), and (9) require O(sp), ,
,
, and
time,
respectively.
The cost of sequential sorting in Step (1) depends on the data type -
sorting integers using radix sort requires
time, whereas
sorting floating point numbers using merge sort requires
time.
Steps (2), (5), and
(7) call the communication primitives transpose,
bcast, and transpose, respectively. The analysis of these
primitives in [6] shows that these three steps require
,
, and
,
respectively. Hence, with high probability, the overall complexity of our
sorting algorithm is given (for floating point numbers) by
for and
.
Clearly, our algorithm is asymptotically optimal with very small coefficients. But a theoretical comparison of our running time with previous sorting algorithms is difficult, since there is no consensus on how to model the cost of the irregular communication used by the most efficient algorithms. Hence, it is very important to perform an empirical evaluation of an algorithm using a wide variety of benchmarks, as we will do next.