Given a matrix on a p processor machine, where p
divides q, the matrix transposition consists of rearranging the data
such that the first
rows of elements are moved to the
first processor, the second
rows to the second
processor, and so on, with the last
rows of the matrix
moved to the last processor. An efficient matrix transposition
algorithm consists of p iterations such that, during iteration i,
, each processor
prefetches the
appropriate block of
elements from processor
. The BDM algorithm and analysis for the matrix transpose
data movement is given in [4] and is similar to that of the
LogP model [16]. This matrix transpose algorithm has the
following complexity: