In place cube reshuffling

I am not really sure If I should ask this question here but I know there are some really smart guys out there, so what the hell :)!!!.

Here you go : suppose you have a three dimensional cube of data that you would fill like this

for (int i = 0 ; i < dim1; i++)
for (int j = 0 ; j < dim2; j++)
for (int k = 0 ; k < dim3; k++)
data[i*dim2*dim3 + j*dim3 + k] = rand();

you now want to reshuffle the cube so that 2 of the three dimensions are switched or even three. HOW DO YOU DO THAT

I found a paper called “An Optimal Index Reshuffle Algorithm for Multidimensional Arrays and its Applications for Parallel Architectures”. that solves the problem for any multidimensional array but it is hardly parallelizable.

Does any one of you somewhere out there have any idea how this could be done in parallel and in-place? I insist on the In-Place part because otherwise, it would be enough to traverse the cube in the new order and create the mapping between the indices.

Thank you for being.

ya, just one more thing about the paper. when I said hardly parallelizable, I meant on a SIMD machine (In the paper, the length of the cycles is not known and the author proposes to use dynamic scheduling strategies vitch vi zont haf in COUDA :)).

Edit: sorry I didn’t read the “in place”. What I describe is certainly not “in place”. I’ll leave it here anyway:

I implemented a simple index reshuffling with CUDA. It can shuffle XYZ into ZXY. It outperformed my naiv implementation on the CPU by a factor of 10 (measured without data transfer overhead). A volume with over 13 million bins can be reshuffled in 10ms.

Here’s an image that explains the idea:

Basically the data from the source volume is read “as coalesced as possible” into shared memory and then written also “as coalesced as possible” to the target volume.

Thank you for your quick response Seb. If possible, I would very much like to know more about the algo you are using. If you have a document descibing your method, then I would appreciate if you could make it available somewhere where I can access it.



I don’t have a document but you can drop me a message and I will send you the code.
Be warned though: it’s rather confusing :-)
Maybe after a cleanup of the code and if more people are interested I’ll release it to the public domain. I don’t really know if this is a problem that many people want to solve though. If yes it would also be great to have maybe other shuffling modes in there.


Do you manage to write the code in a way that all the data will be horizontally coalesced into the block of memory instead of 2-by-2? I think it shouldn’t be too hard to implement that, but I am not sure.

It can be done if the volume is small enough. The data read by one block has to fit into shared memory (16KB per multiprozessor or block).
My idea was to try to make both the reads and writes coalesced and the best way to do that is to make the intermediary data in shared memory square or almost square.

This looks rather like a 3D extension of 2D matrix transpose. The same method (SMEM as a staging area) is used in the “transpose” SDK example if you want a simpler example to start from.


You’re right. Isn’t index reshuffling and matrix transposition essentially the same thing?