We now consider the general case in which each processor is the source of at most elements and the destination of at most elements. We can use the same deterministic algorithm with the block size of the transpose in Step (2) being and the block size of the transpose in Step (4) being . The resulting overall complexity is O. Alternatively for large variances (), we can use our dynamic data redistribution algorithm in [7] followed by our deterministic algorithm described earlier. The resulting overall complexity will also be the same.