3d transpose

Has anyone developped a 3d transpose kernel?

I need to write to an array in a XYZ fashion, which is not coalesced, but if i could write it ZYX it would be.
So what i need is to be able to write in ZYX then do a fast coalesced transpose to bring it back in XYZ form so that the rest of the program can execute.

I realize its analog to the transpose in the SDK but with threadim.z<=64 and griddim.z=0, it looks like an adress calculation hell.

So if anyone has already got one working, would you be so kind as to post it?


edit… thinking about it for more than 2 seconds, im thinking Y 2d transpose would accomplish what im looking for

Im afraid im gonna need some help after all…
So, i have this array



Well, its really a 1d array. Ive put it on multiple rows to make it easier to read.

What i need, in a kernel that will run on this data set, is to coalesce the reads on the ‘z’ coordinate. So basicaly, instead of having a XYZ array, i need to transform it into a ZXY one.

I cannot, as i have said, performe a series of 2d translates, so would only work, as far as i can see, when transforming it to a YXZ array.

SO the way i see it, i need to learn from the 2d transpose exemple by loading 3d blocks of data into shared memory in a coalesced fashion and then read from that array “out of order” to write to gmem coalesced. Simple enough until now.

The maximum amount of threads in a block is 512, so, at most, i can have a 8x8x8 block using a 8x8x8 shared memory block.

And the problem lies here. All threads of a half warp must participate in the coalesced memory transaction (this has to run on 1.1). So thats 16 threads. But i can only read 64 times 8 consecutive elements from global memory, since the sub block is 8x8x8.

Does anybody see a way to do this?