 
    
    
         
The analysis for the matrix transpose algorithm is similar to that of
the LogP model analysis [11].  The algorithm to perform a  matrix transpose on a p processor machine operates as
follows. The data layout of matrix A is straightforward; each column
i of q elements is stored on processor i, for
 matrix transpose on a p processor machine operates as
follows. The data layout of matrix A is straightforward; each column
i of q elements is stored on processor i, for  . Note that the first index of A contains the
processor number, while the second index provides the element offset
in that processor.
. Note that the first index of A contains the
processor number, while the second index provides the element offset
in that processor.
Processor i runs the following program:
Each prefetch in Step 1.2 requests a block of  elements.
Since each processor prefetches p-1 blocks of
 elements.
Since each processor prefetches p-1 blocks of  each,
this matrix transpose algorithm will take
 each,
this matrix transpose algorithm will take  communication complexity, or
communication complexity, or
 
 
    
   