The analysis for the matrix transpose algorithm is similar to that of the LogP model analysis [11]. The algorithm to perform a matrix transpose on a p processor machine operates as follows. The data layout of matrix A is straightforward; each column i of q elements is stored on processor i, for . Note that the first index of A contains the processor number, while the second index provides the element offset in that processor.
Processor i runs the following program:
Each prefetch in Step 1.2 requests a block of elements. Since each processor prefetches p-1 blocks of each, this matrix transpose algorithm will take communication complexity, or