In our solution, we use two rounds of the transpose collective communication primitive. In the first round, each element is routed to an intermediate destination, and during the second round, it is routed to its final destination.
The pseudocode for our algorithm is as follows:
The overall complexity of our algorithm is , which is asymptotically optimal with very small constants for .