3d transpose

Has anyone developped a 3d transpose kernel?

I need to write to an array in a XYZ fashion, which is not coalesced, but if i could write it ZYX it would be.
So what i need is to be able to write in ZYX then do a fast coalesced transpose to bring it back in XYZ form so that the rest of the program can execute.

I realize its analog to the transpose in the SDK but with threadim.z<=64 and griddim.z=0, it looks like an adress calculation hell.

So if anyone has already got one working, would you be so kind as to post it?

Thanks!

edit… thinking about it for more than 2 seconds, im thinking Y 2d transpose would accomplish what im looking for

Im afraid im gonna need some help after all…
So, i have this array

(0,0,0),(1,0,0),…,(n,0,0)
(0,1,0),(1,1,0),…,(n,1,0)

(0,n,0),(1,n,0),…,(n,n,0)
(0,0,1),(1,0,1),…,(n,0,1)

Well, its really a 1d array. Ive put it on multiple rows to make it easier to read.

What i need, in a kernel that will run on this data set, is to coalesce the reads on the ‘z’ coordinate. So basicaly, instead of having a XYZ array, i need to transform it into a ZXY one.

I cannot, as i have said, performe a series of 2d translates, so would only work, as far as i can see, when transforming it to a YXZ array.

SO the way i see it, i need to learn from the 2d transpose exemple by loading 3d blocks of data into shared memory in a coalesced fashion and then read from that array “out of order” to write to gmem coalesced. Simple enough until now.

The maximum amount of threads in a block is 512, so, at most, i can have a 8x8x8 block using a 8x8x8 shared memory block.

And the problem lies here. All threads of a half warp must participate in the coalesced memory transaction (this has to run on 1.1). So thats 16 threads. But i can only read 64 times 8 consecutive elements from global memory, since the sub block is 8x8x8.

Does anybody see a way to do this?