Has anyone developped a 3d transpose kernel?
I need to write to an array in a XYZ fashion, which is not coalesced, but if i could write it ZYX it would be.
So what i need is to be able to write in ZYX then do a fast coalesced transpose to bring it back in XYZ form so that the rest of the program can execute.
I realize its analog to the transpose in the SDK but with threadim.z<=64 and griddim.z=0, it looks like an adress calculation hell.
So if anyone has already got one working, would you be so kind as to post it?
Thanks!
edit… thinking about it for more than 2 seconds, im thinking Y 2d transpose would accomplish what im looking for