The performance graphs for broadcasting using the prefetching matrix transposition on a 32 processor CM-5, SP-2, and CS-2, and 8 processor Paragon are given in Figures 6, 7, 8 and 9, respectively, in Appendix A.1. As expected, these graphs show that the SPLIT-C broadcasting algorithm takes roughly twice the time of the SPLIT-C matrix transpose algorithm. In addition, these figures show the attained data bandwidth per processor for this broadcast algorithm. As expected, we achieve approximately the same results as that of the transpose algorithm on both machines.