Performance graphs for matrix transposition execution times using
SPLIT-C on a 32 processor CM-5, SP-2, CS-2, and 8 processor Paragon
are given in Figures
6, 7, 8,
and 9, respectively, in
Appendix A.1. These figures also show the attained data
bandwidth per processor
for the transpose algorithm. For large enough data sets on the CM-5,
we achieve an average bandwidth of 7.62 MB/s per processor, which is
more than three-fourths of the maximum user-payload bandwidth per
processor of 12 MB/s per processor [28]. This is consistent
with the results achieved by other research teams that have achieved
6.4 MB/s per processor (Culler at UC Berkley, [11]), and
7.72 MB/s per processor (Ranka at Syracuse University, [46])
for similar data movements on the CM-5. Note that some of these cited
results are for low-level implementations using message passing
algorithms. For large enough data sets, the SP-2 achieves greater
than 24.8 MB/s per processor for the matrix transpose algorithm, using
a high performance switch hardware rated by the vendor as having a
peak node to node bandwidth of 40 MB/s [27]. The Meiko CS-2
achieves greater than 10.7 MB/s per processor. Note that the CS-2
result is much less than the maximum attainable bandwidth of 50 MB/s
per processor [33] because our SPLIT-C version has not been
fully optimized to make use of the architecture's communications
coprocessor. The 8 processor Paragon achieves greater than 88.6 MB/s
per processor, with the maximum hardware bandwidth given by Intel as
175 MB/s per processor and application peak bandwidth as 135 MB/s per
processor [30].