We now consider the general case in which each processor is the source
of at most elements and the destination of at most
elements. We can use the same deterministic algorithm with the block
size of the transpose in Step (2) being
and the block size of the transpose in
Step (4) being
. The resulting overall
complexity is O
. Alternatively for large variances (
), we can use our dynamic data redistribution algorithm in
[7] followed by our deterministic algorithm described
earlier. The resulting overall complexity will also be the same.