The analysis for the matrix transpose algorithm is similar to that of
the LogP model analysis [11]. The algorithm to perform a matrix transpose on a p processor machine operates as
follows. The data layout of matrix A is straightforward; each column
i of q elements is stored on processor i, for
. Note that the first index of A contains the
processor number, while the second index provides the element offset
in that processor.
Processor i runs the following program:
Each prefetch in Step 1.2 requests a block of elements.
Since each processor prefetches p-1 blocks of
each,
this matrix transpose algorithm will take
communication complexity, or